Recently, I encountered a hang that is happening during memory hot
remove operation. It turns out that the hang is caused by pinned user
pages in ZONE_MOVABLE.
Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
this is not the case if a user applications such as through dpdk
libraries pinned them via vfio dma map. Kernel keeps trying to
hot-remove them, but refcnt never gets to zero, so we are looping
until the hardware watchdog kicks in.
We cannot do dma unmaps before hot-remove, because hot-remove is a
slow operation, and we have thousands for network flows handled by
dpdk that we just cannot suspend for the duration of hot-remove
operation.
The solution is for dpdk to allocate pages from a zone below
ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
There is no user interface that we have that allows applications to
select what zone the memory should come from.
I've spoken with Stephen Hemminger, and he said that DPDK is moving in
the direction of using transparent huge pages instead of HugeTLBs,
which means that we need to allow at least anonymous, and anonymous
transparent huge pages to come from non-movable zones on demand.
Here is what I am proposing:
1. Add a new flag that is passed through pin_user_pages_* down to
fault handlers, and allow the fault handler to allocate from a
non-movable zone.
Sample function stacks through which this info needs to be passed is this:
pin_user_pages_remote(gup_flags)
__get_user_pages_remote(gup_flags)
__gup_longterm_locked(gup_flags)
__get_user_pages_locked(gup_flags)
__get_user_pages(gup_flags)
faultin_page(gup_flags)
Convert gup_flags into fault_flags
handle_mm_fault(fault_flags)
From handle_mm_fault(), the stack diverges into various faults,
examples include:
Transparent Huge Page
handle_mm_fault(fault_flags)
__handle_mm_fault(fault_flags)
Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
create_huge_pmd(vmf);
do_huge_pmd_anonymous_page(vmf);
mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
vmf.gfp_mask should be passed as well.
There are several other similar paths in a transparent huge page, also
there is a named path where allocation is based on filesystems, and
the flag should be honored there as well, but it does not have to be
added at the same time.
Regular Pages
handle_mm_fault(fault_flags)
__handle_mm_fault(fault_flags)
Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
handle_pte_fault(vmf)
do_anonymous_page(vmf);
page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
replace change this call according to gfp_mask.
The above only take care of the case if user application faults on the
page during pinning time, but there are also cases where pages already
exist.
2. Add an internal move_pages_zone() similar to move_pages() syscall
but instead of migrating to a different NUMA node, migrate pages from
ZONE_MOVABLE to another zone.
Call move_pages_zone() on demand prior to pinning pages from
vfio_pin_map_dma() for instance.
3. Perhaps, it also makes sense to add madvise() flag, to allocate
pages from non-movable zone. When a user application knows that it
will do DMA mapping, and pin pages for a long time, the memory that it
allocates should never be migrated or hot-removed, so make sure that
it comes from the appropriate place.
The benefit of adding madvise() flag is that we won't have to deal
with slow page migration during pin time, but the disadvantage is that
we would need to change the user interface.
Before I start working on the above approaches, I would like to get an
opinion from the community on an appropriate path forward for this
problem. If what I described sounds reasonable, or if there are other
ideas on how to address the problem that I am seeing.
Thank you,
Pasha
> Am 20.11.2020 um 21:28 schrieb Pavel Tatashin <[email protected]>:
>
> Recently, I encountered a hang that is happening during memory hot
> remove operation. It turns out that the hang is caused by pinned user
> pages in ZONE_MOVABLE.
>
> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> this is not the case if a user applications such as through dpdk
> libraries pinned them via vfio dma map. Kernel keeps trying to
> hot-remove them, but refcnt never gets to zero, so we are looping
> until the hardware watchdog kicks in.
>
> We cannot do dma unmaps before hot-remove, because hot-remove is a
> slow operation, and we have thousands for network flows handled by
> dpdk that we just cannot suspend for the duration of hot-remove
> operation.
>
Hi!
It‘s a known problem also for VMs using vfio. I thought about this some while ago an came to the same conclusion: before performing long-term pinnings, we have to migrate pages off the movable zone. After that, it‘s too late.
What happens when we can‘t migrate (OOM on !MOVABLE memory, short-term pinning)? TBD.
> The solution is for dpdk to allocate pages from a zone below
> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> There is no user interface that we have that allows applications to
> select what zone the memory should come from.
>
> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> the direction of using transparent huge pages instead of HugeTLBs,
> which means that we need to allow at least anonymous, and anonymous
> transparent huge pages to come from non-movable zones on demand.
>
> Here is what I am proposing:
> 1. Add a new flag that is passed through pin_user_pages_* down to
> fault handlers, and allow the fault handler to allocate from a
> non-movable zone.
>
> Sample function stacks through which this info needs to be passed is this:
>
> pin_user_pages_remote(gup_flags)
> __get_user_pages_remote(gup_flags)
> __gup_longterm_locked(gup_flags)
> __get_user_pages_locked(gup_flags)
> __get_user_pages(gup_flags)
> faultin_page(gup_flags)
> Convert gup_flags into fault_flags
> handle_mm_fault(fault_flags)
>
> From handle_mm_fault(), the stack diverges into various faults,
> examples include:
>
> Transparent Huge Page
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> create_huge_pmd(vmf);
> do_huge_pmd_anonymous_page(vmf);
> mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
> vmf.gfp_mask should be passed as well.
>
> There are several other similar paths in a transparent huge page, also
> there is a named path where allocation is based on filesystems, and
> the flag should be honored there as well, but it does not have to be
> added at the same time.
>
> Regular Pages
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> handle_pte_fault(vmf)
> do_anonymous_page(vmf);
> page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
> replace change this call according to gfp_mask.
>
> The above only take care of the case if user application faults on the
> page during pinning time, but there are also cases where pages already
> exist.
>
> 2. Add an internal move_pages_zone() similar to move_pages() syscall
> but instead of migrating to a different NUMA node, migrate pages from
> ZONE_MOVABLE to another zone.
> Call move_pages_zone() on demand prior to pinning pages from
> vfio_pin_map_dma() for instance.
>
> 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> pages from non-movable zone. When a user application knows that it
> will do DMA mapping, and pin pages for a long time, the memory that it
> allocates should never be migrated or hot-removed, so make sure that
> it comes from the appropriate place.
> The benefit of adding madvise() flag is that we won't have to deal
> with slow page migration during pin time, but the disadvantage is that
> we would need to change the user interface.
>
Hm, I am not sure we want to expose these details. What would be the semantics? „Might pin“? Hm, not sure.
Assume you start a fresh VM via QEMU with vfio. When we start mapping guest memory via vfio, that‘s usually the time memory will get populated. Not really much has to be migrated. I think this is even true during live migration.
I think selective DMA pinning (e.g., vIOMMU in QEMU) is different, where we keep pinning/unpinning on demand. But I guess even here, we will often reuse some pages over and over again.
> Before I start working on the above approaches, I would like to get an
> opinion from the community on an appropriate path forward for this
> problem. If what I described sounds reasonable, or if there are other
> ideas on how to address the problem that I am seeing.
At least 1 and 2 sound sane. 3 is TBD - but it‘s a pure optimization, so it can wait.
Thanks!
>
> Thank you,
> Pasha
>
On Fri, Nov 20, 2020 at 09:59:24PM +0100, David Hildenbrand wrote:
>
> > Am 20.11.2020 um 21:28 schrieb Pavel Tatashin <[email protected]>:
> >
> > Recently, I encountered a hang that is happening during memory hot
> > remove operation. It turns out that the hang is caused by pinned user
> > pages in ZONE_MOVABLE.
> >
> > Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> > this is not the case if a user applications such as through dpdk
> > libraries pinned them via vfio dma map. Kernel keeps trying to
> > hot-remove them, but refcnt never gets to zero, so we are looping
> > until the hardware watchdog kicks in.
> >
> > We cannot do dma unmaps before hot-remove, because hot-remove is a
> > slow operation, and we have thousands for network flows handled by
> > dpdk that we just cannot suspend for the duration of hot-remove
> > operation.
> >
>
> Hi!
>
> It‘s a known problem also for VMs using vfio. I thought about this some while ago an came to the same conclusion: before performing long-term pinnings, we have to migrate pages off the movable zone. After that, it‘s too late.
We can't, though. VMs using vfio pin their entire address space (right?)
so we end up with basically all of the !MOVABLE memory used for VMs and
the MOVABLE memory goes unused (I'm thinking about the case of a machine
which only hosts VMs and has nothing else to do with its memory). In
that case, the sysadmin is going to reconfigure ZONE_MOVABLE away, and
now we just don't have any ZONE_MOVABLE. So what's the point?
ZONE_MOVABLE can also be pinned by mlock() and other such system calls.
The kernel needs to understand that ZONE_MOVABLE memory may not actually
be movable, and skip the unmovable stuff.
> Am 20.11.2020 um 22:17 schrieb Matthew Wilcox <[email protected]>:
>
> On Fri, Nov 20, 2020 at 09:59:24PM +0100, David Hildenbrand wrote:
>>
>>>> Am 20.11.2020 um 21:28 schrieb Pavel Tatashin <[email protected]>:
>>>
>>> Recently, I encountered a hang that is happening during memory hot
>>> remove operation. It turns out that the hang is caused by pinned user
>>> pages in ZONE_MOVABLE.
>>>
>>> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
>>> this is not the case if a user applications such as through dpdk
>>> libraries pinned them via vfio dma map. Kernel keeps trying to
>>> hot-remove them, but refcnt never gets to zero, so we are looping
>>> until the hardware watchdog kicks in.
>>>
>>> We cannot do dma unmaps before hot-remove, because hot-remove is a
>>> slow operation, and we have thousands for network flows handled by
>>> dpdk that we just cannot suspend for the duration of hot-remove
>>> operation.
>>>
>>
>> Hi!
>>
>> It‘s a known problem also for VMs using vfio. I thought about this some while ago an came to the same conclusion: before performing long-term pinnings, we have to migrate pages off the movable zone. After that, it‘s too late.
>
> We can't, though. VMs using vfio pin their entire address space (right?)
> so we end up with basically all of the !MOVABLE memory used for VMs and
> the MOVABLE memory goes unused (I'm thinking about the case of a machine
> which only hosts VMs and has nothing else to do with its memory). In
> that case, the sysadmin is going to reconfigure ZONE_MOVABLE away, and
> now we just don't have any ZONE_MOVABLE. So what's the point?
When the guest is using an vIOMMU, it will only pin what‘s currently mapped by the guest into the vIOMMU. Otherwise: yes.
If you assume all memory will be used for VMs with vfio, then yes: no ZONE_MOVABLE, no memory hotunplug. If its‘s only some VMs, it‘s a different story.
>
> ZONE_MOVABLE can also be pinned by mlock() and other such system calls.
Mlocked pages can be migrated, no? They are simply not swappable iirc.
> The kernel needs to understand that ZONE_MOVABLE memory may not actually
> be movable, and skip the unmovable stuff.
>
Then you don‘t have unplug guarantees. Memory unplug broken by design. Then there is no point in optimizing that case at all and tell customers „vfio and memory hotunplug is incompatible“. The only ugly thing is the endless loop.
On Fri, Nov 20, 2020 at 4:34 PM David Hildenbrand <[email protected]> wrote:
>
>
> > Am 20.11.2020 um 22:17 schrieb Matthew Wilcox <[email protected]>:
> >
> > On Fri, Nov 20, 2020 at 09:59:24PM +0100, David Hildenbrand wrote:
> >>
> >>>> Am 20.11.2020 um 21:28 schrieb Pavel Tatashin <[email protected]>:
> >>>
> >>> Recently, I encountered a hang that is happening during memory hot
> >>> remove operation. It turns out that the hang is caused by pinned user
> >>> pages in ZONE_MOVABLE.
> >>>
> >>> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> >>> this is not the case if a user applications such as through dpdk
> >>> libraries pinned them via vfio dma map. Kernel keeps trying to
> >>> hot-remove them, but refcnt never gets to zero, so we are looping
> >>> until the hardware watchdog kicks in.
> >>>
> >>> We cannot do dma unmaps before hot-remove, because hot-remove is a
> >>> slow operation, and we have thousands for network flows handled by
> >>> dpdk that we just cannot suspend for the duration of hot-remove
> >>> operation.
> >>>
> >>
> >> Hi!
> >>
> >> It‘s a known problem also for VMs using vfio. I thought about this some while ago an came to the same conclusion: before performing long-term pinnings, we have to migrate pages off the movable zone. After that, it‘s too late.
> >
> > We can't, though. VMs using vfio pin their entire address space (right?)
> > so we end up with basically all of the !MOVABLE memory used for VMs and
> > the MOVABLE memory goes unused (I'm thinking about the case of a machine
> > which only hosts VMs and has nothing else to do with its memory). In
> > that case, the sysadmin is going to reconfigure ZONE_MOVABLE away, and
> > now we just don't have any ZONE_MOVABLE. So what's the point?
>
> When the guest is using an vIOMMU, it will only pin what‘s currently mapped by the guest into the vIOMMU. Otherwise: yes.
Right, not all guest memory needs to be pinned, so ZONE_MOVABLE can
still be used for a vast amount of allocations.
>
> If you assume all memory will be used for VMs with vfio, then yes: no ZONE_MOVABLE, no memory hotunplug. If its‘s only some VMs, it‘s a different story.
Sounds like in such an extreme case it is reasonable to assume no
hot-plug. But, when you have 8G, and need to remove 2G movable zone,
but can't guarantee it even if you have 6G of free mem, this is
unreasonable.
>
> >
> > ZONE_MOVABLE can also be pinned by mlock() and other such system calls.
>
> Mlocked pages can be migrated, no? They are simply not swappable iirc.
Yes, mlocked they are simply in memory, but the content of the pages
can be migrated to a different place in RAM.
>
> > The kernel needs to understand that ZONE_MOVABLE memory may not actually
> > be movable, and skip the unmovable stuff.
> >
>
> Then you don‘t have unplug guarantees. Memory unplug broken by design. Then there is no point in optimizing that case at all and tell customers „vfio and memory hotunplug is incompatible“. The only ugly thing is the endless loop.
Right, if memory in ZONE_MOVABLE is not guaranteed to be movable, we
can never guarantee memory hot-remove even when we have a lot of free
memory to migrate to.
>
On Fri, Nov 20, 2020 at 3:59 PM David Hildenbrand <[email protected]> wrote:
>
>
> > Am 20.11.2020 um 21:28 schrieb Pavel Tatashin <[email protected]>:
> >
> > Recently, I encountered a hang that is happening during memory hot
> > remove operation. It turns out that the hang is caused by pinned user
> > pages in ZONE_MOVABLE.
> >
> > Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> > this is not the case if a user applications such as through dpdk
> > libraries pinned them via vfio dma map. Kernel keeps trying to
> > hot-remove them, but refcnt never gets to zero, so we are looping
> > until the hardware watchdog kicks in.
> >
> > We cannot do dma unmaps before hot-remove, because hot-remove is a
> > slow operation, and we have thousands for network flows handled by
> > dpdk that we just cannot suspend for the duration of hot-remove
> > operation.
> >
>
> Hi!
>
> It‘s a known problem also for VMs using vfio. I thought about this some while ago an came to the same conclusion: before performing long-term pinnings, we have to migrate pages off the movable zone. After that, it‘s too late.
>
> What happens when we can‘t migrate (OOM on !MOVABLE memory, short-term pinning)? TBD.
>
> > The solution is for dpdk to allocate pages from a zone below
> > ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> > There is no user interface that we have that allows applications to
> > select what zone the memory should come from.
> >
> > I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> > the direction of using transparent huge pages instead of HugeTLBs,
> > which means that we need to allow at least anonymous, and anonymous
> > transparent huge pages to come from non-movable zones on demand.
> >
> > Here is what I am proposing:
> > 1. Add a new flag that is passed through pin_user_pages_* down to
> > fault handlers, and allow the fault handler to allocate from a
> > non-movable zone.
> >
> > Sample function stacks through which this info needs to be passed is this:
> >
> > pin_user_pages_remote(gup_flags)
> > __get_user_pages_remote(gup_flags)
> > __gup_longterm_locked(gup_flags)
> > __get_user_pages_locked(gup_flags)
> > __get_user_pages(gup_flags)
> > faultin_page(gup_flags)
> > Convert gup_flags into fault_flags
> > handle_mm_fault(fault_flags)
> >
> > From handle_mm_fault(), the stack diverges into various faults,
> > examples include:
> >
> > Transparent Huge Page
> > handle_mm_fault(fault_flags)
> > __handle_mm_fault(fault_flags)
> > Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> > create_huge_pmd(vmf);
> > do_huge_pmd_anonymous_page(vmf);
> > mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
> > vmf.gfp_mask should be passed as well.
> >
> > There are several other similar paths in a transparent huge page, also
> > there is a named path where allocation is based on filesystems, and
> > the flag should be honored there as well, but it does not have to be
> > added at the same time.
> >
> > Regular Pages
> > handle_mm_fault(fault_flags)
> > __handle_mm_fault(fault_flags)
> > Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> > handle_pte_fault(vmf)
> > do_anonymous_page(vmf);
> > page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
> > replace change this call according to gfp_mask.
> >
> > The above only take care of the case if user application faults on the
> > page during pinning time, but there are also cases where pages already
> > exist.
> >
> > 2. Add an internal move_pages_zone() similar to move_pages() syscall
> > but instead of migrating to a different NUMA node, migrate pages from
> > ZONE_MOVABLE to another zone.
> > Call move_pages_zone() on demand prior to pinning pages from
> > vfio_pin_map_dma() for instance.
> >
> > 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> > pages from non-movable zone. When a user application knows that it
> > will do DMA mapping, and pin pages for a long time, the memory that it
> > allocates should never be migrated or hot-removed, so make sure that
> > it comes from the appropriate place.
> > The benefit of adding madvise() flag is that we won't have to deal
> > with slow page migration during pin time, but the disadvantage is that
> > we would need to change the user interface.
> >
>
> Hm, I am not sure we want to expose these details. What would be the semantics? „Might pin“? Hm, not sure.
The semantic would be PA must not change, something that DPDK
currently excpects from huge pages, which by the way is not true, as
huge pages are migratable.
>
> Assume you start a fresh VM via QEMU with vfio. When we start mapping guest memory via vfio, that‘s usually the time memory will get populated. Not really much has to be migrated. I think this is even true during live migration.
>
> I think selective DMA pinning (e.g., vIOMMU in QEMU) is different, where we keep pinning/unpinning on demand. But I guess even here, we will often reuse some pages over and over again.
>
>
> > Before I start working on the above approaches, I would like to get an
> > opinion from the community on an appropriate path forward for this
> > problem. If what I described sounds reasonable, or if there are other
> > ideas on how to address the problem that I am seeing.
>
> At least 1 and 2 sound sane. 3 is TBD - but it‘s a pure optimization, so it can wait.
Makes sense, I am also worried about 3, but most of madvise() flags
are for pure optimization purposes: MADV_HUGEPAGE, MADV_SEQUENTIAL,
MADV_WILLNEED etc.
>
> Thanks!
>
> >
> > Thank you,
> > Pasha
> >
>
> Am 20.11.2020 um 22:58 schrieb Pavel Tatashin <[email protected]>:
>
> On Fri, Nov 20, 2020 at 3:59 PM David Hildenbrand <[email protected]> wrote:
>>
>>
>>>> Am 20.11.2020 um 21:28 schrieb Pavel Tatashin <[email protected]>:
>>>
>>> Recently, I encountered a hang that is happening during memory hot
>>> remove operation. It turns out that the hang is caused by pinned user
>>> pages in ZONE_MOVABLE.
>>>
>>> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
>>> this is not the case if a user applications such as through dpdk
>>> libraries pinned them via vfio dma map. Kernel keeps trying to
>>> hot-remove them, but refcnt never gets to zero, so we are looping
>>> until the hardware watchdog kicks in.
>>>
>>> We cannot do dma unmaps before hot-remove, because hot-remove is a
>>> slow operation, and we have thousands for network flows handled by
>>> dpdk that we just cannot suspend for the duration of hot-remove
>>> operation.
>>>
>>
>> Hi!
>>
>> It‘s a known problem also for VMs using vfio. I thought about this some while ago an came to the same conclusion: before performing long-term pinnings, we have to migrate pages off the movable zone. After that, it‘s too late.
>>
>> What happens when we can‘t migrate (OOM on !MOVABLE memory, short-term pinning)? TBD.
>>
>>> The solution is for dpdk to allocate pages from a zone below
>>> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
>>> There is no user interface that we have that allows applications to
>>> select what zone the memory should come from.
>>>
>>> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
>>> the direction of using transparent huge pages instead of HugeTLBs,
>>> which means that we need to allow at least anonymous, and anonymous
>>> transparent huge pages to come from non-movable zones on demand.
>>>
>>> Here is what I am proposing:
>>> 1. Add a new flag that is passed through pin_user_pages_* down to
>>> fault handlers, and allow the fault handler to allocate from a
>>> non-movable zone.
>>>
>>> Sample function stacks through which this info needs to be passed is this:
>>>
>>> pin_user_pages_remote(gup_flags)
>>> __get_user_pages_remote(gup_flags)
>>> __gup_longterm_locked(gup_flags)
>>> __get_user_pages_locked(gup_flags)
>>> __get_user_pages(gup_flags)
>>> faultin_page(gup_flags)
>>> Convert gup_flags into fault_flags
>>> handle_mm_fault(fault_flags)
>>>
>>> From handle_mm_fault(), the stack diverges into various faults,
>>> examples include:
>>>
>>> Transparent Huge Page
>>> handle_mm_fault(fault_flags)
>>> __handle_mm_fault(fault_flags)
>>> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
>>> create_huge_pmd(vmf);
>>> do_huge_pmd_anonymous_page(vmf);
>>> mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
>>> vmf.gfp_mask should be passed as well.
>>>
>>> There are several other similar paths in a transparent huge page, also
>>> there is a named path where allocation is based on filesystems, and
>>> the flag should be honored there as well, but it does not have to be
>>> added at the same time.
>>>
>>> Regular Pages
>>> handle_mm_fault(fault_flags)
>>> __handle_mm_fault(fault_flags)
>>> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
>>> handle_pte_fault(vmf)
>>> do_anonymous_page(vmf);
>>> page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
>>> replace change this call according to gfp_mask.
>>>
>>> The above only take care of the case if user application faults on the
>>> page during pinning time, but there are also cases where pages already
>>> exist.
>>>
>>> 2. Add an internal move_pages_zone() similar to move_pages() syscall
>>> but instead of migrating to a different NUMA node, migrate pages from
>>> ZONE_MOVABLE to another zone.
>>> Call move_pages_zone() on demand prior to pinning pages from
>>> vfio_pin_map_dma() for instance.
>>>
>>> 3. Perhaps, it also makes sense to add madvise() flag, to allocate
>>> pages from non-movable zone. When a user application knows that it
>>> will do DMA mapping, and pin pages for a long time, the memory that it
>>> allocates should never be migrated or hot-removed, so make sure that
>>> it comes from the appropriate place.
>>> The benefit of adding madvise() flag is that we won't have to deal
>>> with slow page migration during pin time, but the disadvantage is that
>>> we would need to change the user interface.
>>>
>>
>> Hm, I am not sure we want to expose these details. What would be the semantics? „Might pin“? Hm, not sure.
>
> The semantic would be PA must not change, something that DPDK
> currently excpects from huge pages, which by the way is not true, as
> huge pages are migratable.
>
>>
>> Assume you start a fresh VM via QEMU with vfio. When we start mapping guest memory via vfio, that‘s usually the time memory will get populated. Not really much has to be migrated. I think this is even true during live migration.
>>
>> I think selective DMA pinning (e.g., vIOMMU in QEMU) is different, where we keep pinning/unpinning on demand. But I guess even here, we will often reuse some pages over and over again.
>>
>>
>>> Before I start working on the above approaches, I would like to get an
>>> opinion from the community on an appropriate path forward for this
>>> problem. If what I described sounds reasonable, or if there are other
>>> ideas on how to address the problem that I am seeing.
>>
>> At least 1 and 2 sound sane. 3 is TBD - but it‘s a pure optimization, so it can wait.
>
> Makes sense, I am also worried about 3, but most of madvise() flags
> are for pure optimization purposes: MADV_HUGEPAGE, MADV_SEQUENTIAL,
> MADV_WILLNEED etc.
BTW, I assume we should also directly tackle migrating pages off CMA regions when pinning, I guess quite some people will be interested in that as well.
Have a nice weekend and thanks for looking into this issue :)
On Fri, 20 Nov 2020, Pavel Tatashin wrote:
> Recently, I encountered a hang that is happening during memory hot
> remove operation. It turns out that the hang is caused by pinned user
> pages in ZONE_MOVABLE.
>
> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> this is not the case if a user applications such as through dpdk
> libraries pinned them via vfio dma map. Kernel keeps trying to
> hot-remove them, but refcnt never gets to zero, so we are looping
> until the hardware watchdog kicks in.
>
> We cannot do dma unmaps before hot-remove, because hot-remove is a
> slow operation, and we have thousands for network flows handled by
> dpdk that we just cannot suspend for the duration of hot-remove
> operation.
>
> The solution is for dpdk to allocate pages from a zone below
> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> There is no user interface that we have that allows applications to
> select what zone the memory should come from.
>
> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> the direction of using transparent huge pages instead of HugeTLBs,
> which means that we need to allow at least anonymous, and anonymous
> transparent huge pages to come from non-movable zones on demand.
>
I'd like to know more about this use case, ZONE_MOVABLE is typically a
great way to optimize for thp availability because, absent memory pinning,
this memory can always be defragmented. So the idea is that DPDK will now
allocate all of its thp from ZONE_NORMAL or only a small subset? Seems
like an invitation for oom kill if the sizing of ZONE_NORMAL is
insufficient.
> Here is what I am proposing:
> 1. Add a new flag that is passed through pin_user_pages_* down to
> fault handlers, and allow the fault handler to allocate from a
> non-movable zone.
>
> Sample function stacks through which this info needs to be passed is this:
>
> pin_user_pages_remote(gup_flags)
> __get_user_pages_remote(gup_flags)
> __gup_longterm_locked(gup_flags)
> __get_user_pages_locked(gup_flags)
> __get_user_pages(gup_flags)
> faultin_page(gup_flags)
> Convert gup_flags into fault_flags
> handle_mm_fault(fault_flags)
>
> From handle_mm_fault(), the stack diverges into various faults,
> examples include:
>
> Transparent Huge Page
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> create_huge_pmd(vmf);
> do_huge_pmd_anonymous_page(vmf);
> mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
> vmf.gfp_mask should be passed as well.
>
> There are several other similar paths in a transparent huge page, also
> there is a named path where allocation is based on filesystems, and
> the flag should be honored there as well, but it does not have to be
> added at the same time.
>
> Regular Pages
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> handle_pte_fault(vmf)
> do_anonymous_page(vmf);
> page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
> replace change this call according to gfp_mask.
>
This would likely be useful for AMD SEV as well, which requires guest
pages to be pinned because the encryption algorithm depends on the host
physical address. This ensures that plaintext memory for two pages don't
result in the same ciphertext.
On Fri 20-11-20 15:27:46, Pavel Tatashin wrote:
> Recently, I encountered a hang that is happening during memory hot
> remove operation. It turns out that the hang is caused by pinned user
> pages in ZONE_MOVABLE.
>
> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> this is not the case if a user applications such as through dpdk
> libraries pinned them via vfio dma map.
Long term or effectively time unbound pinning on zone movable is
fundamentaly broken. The sole reason of ZONE_MOVABLE existence is to
guarantee migrateability. If the cosumer of this memory cannot guarantee
that then it shouldn't use __GFP_MOVABLE in the first place.
> Kernel keeps trying to
> hot-remove them, but refcnt never gets to zero, so we are looping
> until the hardware watchdog kicks in.
Yeah, the existing offlining behavior doesn't stop trying because the
current implementation of the migration cannot tell a diffence between
short and long term failures. Maybe the recent ref count for long term
pinning can be used to help out there.
Anyway, I am wondering what do you mean by watchdog firing. The
operation should trigger neither of soft, hard or hung detectors.
> We cannot do dma unmaps before hot-remove, because hot-remove is a
> slow operation, and we have thousands for network flows handled by
> dpdk that we just cannot suspend for the duration of hot-remove
> operation.
>
> The solution is for dpdk to allocate pages from a zone below
> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> There is no user interface that we have that allows applications to
> select what zone the memory should come from.
Our existing interface is __GFP_MOVABLE. It is a responsibility of the
driver to know whether the resulting memory is migratable. Users
shouldn't even have to think about that.
> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> the direction of using transparent huge pages instead of HugeTLBs,
> which means that we need to allow at least anonymous, and anonymous
> transparent huge pages to come from non-movable zones on demand.
You can migrate before pinning.
> Here is what I am proposing:
> 1. Add a new flag that is passed through pin_user_pages_* down to
> fault handlers, and allow the fault handler to allocate from a
> non-movable zone.
gup already tries to deal with long term pins on CMA regions and migrate
to a non CMA region. Have a look at __gup_longterm_locked. Migrating of
the movable zone sounds like a reasonable solution to me.
> 2. Add an internal move_pages_zone() similar to move_pages() syscall
> but instead of migrating to a different NUMA node, migrate pages from
> ZONE_MOVABLE to another zone.
> Call move_pages_zone() on demand prior to pinning pages from
> vfio_pin_map_dma() for instance.
Why is the existing migration API insufficient?
> 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> pages from non-movable zone. When a user application knows that it
> will do DMA mapping, and pin pages for a long time, the memory that it
> allocates should never be migrated or hot-removed, so make sure that
> it comes from the appropriate place.
> The benefit of adding madvise() flag is that we won't have to deal
> with slow page migration during pin time, but the disadvantage is that
> we would need to change the user interface.
No, the MOVABLE_ZONE like other zone types are internal implementation
detail of the MM. I do not think we want to expose that to the userspace
and carve this into stone.
--
Michal Hocko
SUSE Labs
+CC John Hubbard
On 11/20/20 9:27 PM, Pavel Tatashin wrote:
> Recently, I encountered a hang that is happening during memory hot
> remove operation. It turns out that the hang is caused by pinned user
> pages in ZONE_MOVABLE.
>
> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> this is not the case if a user applications such as through dpdk
> libraries pinned them via vfio dma map. Kernel keeps trying to
> hot-remove them, but refcnt never gets to zero, so we are looping
> until the hardware watchdog kicks in.
>
> We cannot do dma unmaps before hot-remove, because hot-remove is a
> slow operation, and we have thousands for network flows handled by
> dpdk that we just cannot suspend for the duration of hot-remove
> operation.
>
> The solution is for dpdk to allocate pages from a zone below
> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> There is no user interface that we have that allows applications to
> select what zone the memory should come from.
>
> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> the direction of using transparent huge pages instead of HugeTLBs,
> which means that we need to allow at least anonymous, and anonymous
> transparent huge pages to come from non-movable zones on demand.
>
> Here is what I am proposing:
> 1. Add a new flag that is passed through pin_user_pages_* down to
> fault handlers, and allow the fault handler to allocate from a
> non-movable zone.
>
> Sample function stacks through which this info needs to be passed is this:
>
> pin_user_pages_remote(gup_flags)
> __get_user_pages_remote(gup_flags)
> __gup_longterm_locked(gup_flags)
> __get_user_pages_locked(gup_flags)
> __get_user_pages(gup_flags)
> faultin_page(gup_flags)
> Convert gup_flags into fault_flags
> handle_mm_fault(fault_flags)
>
> From handle_mm_fault(), the stack diverges into various faults,
> examples include:
>
> Transparent Huge Page
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> create_huge_pmd(vmf);
> do_huge_pmd_anonymous_page(vmf);
> mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
> vmf.gfp_mask should be passed as well.
>
> There are several other similar paths in a transparent huge page, also
> there is a named path where allocation is based on filesystems, and
> the flag should be honored there as well, but it does not have to be
> added at the same time.
>
> Regular Pages
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> handle_pte_fault(vmf)
> do_anonymous_page(vmf);
> page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
> replace change this call according to gfp_mask.
>
> The above only take care of the case if user application faults on the
> page during pinning time, but there are also cases where pages already
> exist.
Makes sense, as this means no userspace change.
> 2. Add an internal move_pages_zone() similar to move_pages() syscall
> but instead of migrating to a different NUMA node, migrate pages from
> ZONE_MOVABLE to another zone.
> Call move_pages_zone() on demand prior to pinning pages from
> vfio_pin_map_dma() for instance.
As others already said, migrating away before the longterm pin should be
the solution. IIRC it was one of the goals of long term pinning api
proposed long time ago by Peter Ziljstra I think? The implementation
that was merged relatively recently doesn't do that (yet?) for all
movable pages, just CMA, but it could.
> 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> pages from non-movable zone. When a user application knows that it
> will do DMA mapping, and pin pages for a long time, the memory that it
> allocates should never be migrated or hot-removed, so make sure that
> it comes from the appropriate place.
> The benefit of adding madvise() flag is that we won't have to deal
> with slow page migration during pin time, but the disadvantage is that
> we would need to change the user interface.
It's best if we avoid involving userspace until it's shown that's it's
insufficient.
> Before I start working on the above approaches, I would like to get an
> opinion from the community on an appropriate path forward for this
> problem. If what I described sounds reasonable, or if there are other
> ideas on how to address the problem that I am seeing.
>
> Thank you,
> Pasha
>
> > I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> > the direction of using transparent huge pages instead of HugeTLBs,
> > which means that we need to allow at least anonymous, and anonymous
> > transparent huge pages to come from non-movable zones on demand.
> >
>
> I'd like to know more about this use case, ZONE_MOVABLE is typically a
> great way to optimize for thp availability because, absent memory pinning,
> this memory can always be defragmented. So the idea is that DPDK will now
> allocate all of its thp from ZONE_NORMAL or only a small subset? Seems
> like an invitation for oom kill if the sizing of ZONE_NORMAL is
> insufficient.
The idea is to allocate only those THP and anon pages that are long
term pinned from ZONE_NORMAL, the rest can still be allocated from
ZONE_MOVABLE.
>
> > Here is what I am proposing:
> > 1. Add a new flag that is passed through pin_user_pages_* down to
> > fault handlers, and allow the fault handler to allocate from a
> > non-movable zone.
> >
> > Sample function stacks through which this info needs to be passed is this:
> >
> > pin_user_pages_remote(gup_flags)
> > __get_user_pages_remote(gup_flags)
> > __gup_longterm_locked(gup_flags)
> > __get_user_pages_locked(gup_flags)
> > __get_user_pages(gup_flags)
> > faultin_page(gup_flags)
> > Convert gup_flags into fault_flags
> > handle_mm_fault(fault_flags)
> >
> > From handle_mm_fault(), the stack diverges into various faults,
> > examples include:
> >
> > Transparent Huge Page
> > handle_mm_fault(fault_flags)
> > __handle_mm_fault(fault_flags)
> > Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> > create_huge_pmd(vmf);
> > do_huge_pmd_anonymous_page(vmf);
> > mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
> > vmf.gfp_mask should be passed as well.
> >
> > There are several other similar paths in a transparent huge page, also
> > there is a named path where allocation is based on filesystems, and
> > the flag should be honored there as well, but it does not have to be
> > added at the same time.
> >
> > Regular Pages
> > handle_mm_fault(fault_flags)
> > __handle_mm_fault(fault_flags)
> > Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> > handle_pte_fault(vmf)
> > do_anonymous_page(vmf);
> > page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
> > replace change this call according to gfp_mask.
> >
>
> This would likely be useful for AMD SEV as well, which requires guest
> pages to be pinned because the encryption algorithm depends on the host
> physical address. This ensures that plaintext memory for two pages don't
> result in the same ciphertext.
On Mon, Nov 23, 2020 at 4:01 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 20-11-20 15:27:46, Pavel Tatashin wrote:
> > Recently, I encountered a hang that is happening during memory hot
> > remove operation. It turns out that the hang is caused by pinned user
> > pages in ZONE_MOVABLE.
> >
> > Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> > this is not the case if a user applications such as through dpdk
> > libraries pinned them via vfio dma map.
>
> Long term or effectively time unbound pinning on zone movable is
> fundamentaly broken. The sole reason of ZONE_MOVABLE existence is to
> guarantee migrateability. If the cosumer of this memory cannot guarantee
> that then it shouldn't use __GFP_MOVABLE in the first place.
Exactly, this is what I am trying to solve, and started this thread to
figure out what is the best approach to address this problem.
>
> > Kernel keeps trying to
> > hot-remove them, but refcnt never gets to zero, so we are looping
> > until the hardware watchdog kicks in.
>
> Yeah, the existing offlining behavior doesn't stop trying because the
> current implementation of the migration cannot tell a diffence between
> short and long term failures. Maybe the recent ref count for long term
> pinning can be used to help out there.
>
> Anyway, I am wondering what do you mean by watchdog firing. The
> operation should trigger neither of soft, hard or hung detectors.
You are right, the hot-remove is killable operation. In our case,
however, systemd stops petting watchdog during kexec reboot to ensure
that reboot finishes, however, because we hot-remove memory during
shutdown, and kernel is unable to hot-remove memory within 60s we get
a watchdog reset.
>
> > We cannot do dma unmaps before hot-remove, because hot-remove is a
> > slow operation, and we have thousands for network flows handled by
> > dpdk that we just cannot suspend for the duration of hot-remove
> > operation.
> >
> > The solution is for dpdk to allocate pages from a zone below
> > ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> > There is no user interface that we have that allows applications to
> > select what zone the memory should come from.
>
> Our existing interface is __GFP_MOVABLE. It is a responsibility of the
> driver to know whether the resulting memory is migratable. Users
> shouldn't even have to think about that.
Sure, so let's migrate, and fault memory from drivers when long term
pinning. Which is 1 and 2 in my proposal.
> > I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> > the direction of using transparent huge pages instead of HugeTLBs,
> > which means that we need to allow at least anonymous, and anonymous
> > transparent huge pages to come from non-movable zones on demand.
>
> You can migrate before pinning.
Yes.
>
> > Here is what I am proposing:
> > 1. Add a new flag that is passed through pin_user_pages_* down to
> > fault handlers, and allow the fault handler to allocate from a
> > non-movable zone.
>
> gup already tries to deal with long term pins on CMA regions and migrate
> to a non CMA region. Have a look at __gup_longterm_locked. Migrating of
> the movable zone sounds like a reasonable solution to me.
Yes, CMA is doing something similar, but it is migrating before
pinning from CMA to movable zone to avoid fragmentation of CMA. What
we need to do is migrate before pinning to a non-movable zone for all
pages.
>
> > 2. Add an internal move_pages_zone() similar to move_pages() syscall
> > but instead of migrating to a different NUMA node, migrate pages from
> > ZONE_MOVABLE to another zone.
> > Call move_pages_zone() on demand prior to pinning pages from
> > vfio_pin_map_dma() for instance.
>
> Why is the existing migration API insufficient?
Here I am talking about internal implementation not user API. We do
not have a function that migrates pages in a user address space from
one zone to another zone. We only have a function that is exposed as a
syscall that migrates pages from one node to another node.
>
> > 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> > pages from non-movable zone. When a user application knows that it
> > will do DMA mapping, and pin pages for a long time, the memory that it
> > allocates should never be migrated or hot-removed, so make sure that
> > it comes from the appropriate place.
> > The benefit of adding madvise() flag is that we won't have to deal
> > with slow page migration during pin time, but the disadvantage is that
> > we would need to change the user interface.
>
> No, the MOVABLE_ZONE like other zone types are internal implementation
> detail of the MM. I do not think we want to expose that to the userspace
> and carve this into stone.
What I mean here is allowing users to guarantee that the page's PA is
going to stay the same. Sort of a stronger mlock. Mlock only
guarantees that the page is not swapped, but something like
MADV_PINNED would guarantee that page is not going to be swapped and
also not migrated. If a user determines the PA of that page, that PA
is going to stay the same throughout the life of the page. This is not
exposing internal implementation in any way, this guarantee could be
honored in various ways: i.e. pinned or allocating from ZONE_NORMAL.
The fact that we would honor it by allocating memory from ZONE_NORMAL
is implementation detail that would not be exposed to the user.
This is from DPDK's description:
https://software.intel.com/content/www/us/en/develop/articles/memory-in-dpdk-part-1-general-concepts.html
"
Whenever a memory area is made available for DPDK to use, DPDK figures
out its physical address by asking the kernel at that time. Since DPDK
uses pinned memory, generally in the form of huge pages, the physical
address of the underlying memory area is not expected to change, so
the hardware can rely on those physical addresses to be valid at all
times, even if the memory itself is not used for some time. DPDK then
uses these physical addresses when preparing I/O transactions to be
done by the hardware, and configures the hardware in such a way that
the hardware is allowed to initiate DMA transactions itself. This
allows DPDK to avoid needless overhead and to perform I/O entirely
from user space.
"
I just think it is inefficient to first allocate memory from
ZONE_MOVABLE, and later migrate it to ZONE_NORMAL.
That said, I agree, we probably should not be adding a new flag at
least as part of this work.
> Makes sense, as this means no userspace change.
>
> > 2. Add an internal move_pages_zone() similar to move_pages() syscall
> > but instead of migrating to a different NUMA node, migrate pages from
> > ZONE_MOVABLE to another zone.
> > Call move_pages_zone() on demand prior to pinning pages from
> > vfio_pin_map_dma() for instance.
>
> As others already said, migrating away before the longterm pin should be
> the solution. IIRC it was one of the goals of long term pinning api
> proposed long time ago by Peter Ziljstra I think? The implementation
> that was merged relatively recently doesn't do that (yet?) for all
> movable pages, just CMA, but it could.
From what I can tell, CMA is not solving exactly this problem. It
migrates pages from CMA before pinning, but it migrates them to
ZONE_MOVABLE. Also, we still need to take care of the fault scenario.
>
> > 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> > pages from non-movable zone. When a user application knows that it
> > will do DMA mapping, and pin pages for a long time, the memory that it
> > allocates should never be migrated or hot-removed, so make sure that
> > it comes from the appropriate place.
> > The benefit of adding madvise() flag is that we won't have to deal
> > with slow page migration during pin time, but the disadvantage is that
> > we would need to change the user interface.
>
> It's best if we avoid involving userspace until it's shown that's it's
> insufficient.
Agree.
Thank you,
Pasha
On Mon, Nov 23, 2020 at 11:06:21AM -0500, Pavel Tatashin wrote:
> What I mean here is allowing users to guarantee that the page's PA is
> going to stay the same. Sort of a stronger mlock. Mlock only
> guarantees that the page is not swapped, but something like
You've just described get/pin_user_pages(), that is exactly what it is
for.
I agree with the other emails, ZONE_MOVABLE needs to be reconciled
with FOLL_LONGTERM - most likely by preventing ZONE_MOVABLE pages from
being returned. This will need migration like CMA does and the point
about faulting is only an optimization to prevent fault then immediate
migration.
Jason
On Mon, Nov 23, 2020 at 12:15 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Nov 23, 2020 at 11:06:21AM -0500, Pavel Tatashin wrote:
>
> > What I mean here is allowing users to guarantee that the page's PA is
> > going to stay the same. Sort of a stronger mlock. Mlock only
> > guarantees that the page is not swapped, but something like
>
> You've just described get/pin_user_pages(), that is exactly what it is
> for.
You are right. No need for the madvise() flag at all. (The slight
difference of being able to mark memory pinned prior to touching is
really insignificant).
>
> I agree with the other emails, ZONE_MOVABLE needs to be reconciled
> with FOLL_LONGTERM - most likely by preventing ZONE_MOVABLE pages from
> being returned. This will need migration like CMA does and the point
> about faulting is only an optimization to prevent fault then immediate
> migration.
That is right, as the first step we could just do fault and immediate
migration, which is silly, but still better than what we have now.
>
> Jason
On Mon, Nov 23, 2020 at 12:54:16PM -0500, Pavel Tatashin wrote:
> > I agree with the other emails, ZONE_MOVABLE needs to be reconciled
> > with FOLL_LONGTERM - most likely by preventing ZONE_MOVABLE pages from
> > being returned. This will need migration like CMA does and the point
> > about faulting is only an optimization to prevent fault then immediate
> > migration.
>
> That is right, as the first step we could just do fault and immediate
> migration, which is silly, but still better than what we have now.
I was looking at this CMA code lately and would love to see a
cleaner/faster implementation.
If you really understand how this works maybe it is an opportunity to
make it all work better.
Jason
On 11/20/20 12:27 PM, Pavel Tatashin wrote:
> Recently, I encountered a hang that is happening during memory hot
> remove operation. It turns out that the hang is caused by pinned user
> pages in ZONE_MOVABLE.
>
> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> this is not the case if a user applications such as through dpdk
> libraries pinned them via vfio dma map. Kernel keeps trying to
> hot-remove them, but refcnt never gets to zero, so we are looping
> until the hardware watchdog kicks in.
>
> We cannot do dma unmaps before hot-remove, because hot-remove is a
> slow operation, and we have thousands for network flows handled by
> dpdk that we just cannot suspend for the duration of hot-remove
> operation.
>
> The solution is for dpdk to allocate pages from a zone below
> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> There is no user interface that we have that allows applications to
> select what zone the memory should come from.
>
> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> the direction of using transparent huge pages instead of HugeTLBs,
> which means that we need to allow at least anonymous, and anonymous
> transparent huge pages to come from non-movable zones on demand.
>
> Here is what I am proposing:
> 1. Add a new flag that is passed through pin_user_pages_* down to
> fault handlers, and allow the fault handler to allocate from a
> non-movable zone.
I like where the discussion so far (in the other threads) has taken
this. And the current plan also implies, I think, that you can probably
avoid any new flags at all: just check that both FOLL_LONGTERM and
FOLL_PIN are set, and if they are, then make your attempt to migrate
away from ZONE_MOVABLE.
>
> Sample function stacks through which this info needs to be passed is this:
>
> pin_user_pages_remote(gup_flags)
> __get_user_pages_remote(gup_flags)
> __gup_longterm_locked(gup_flags)
> __get_user_pages_locked(gup_flags)
> __get_user_pages(gup_flags)
> faultin_page(gup_flags)
> Convert gup_flags into fault_flags
> handle_mm_fault(fault_flags)
I'm pleased that the gup_flags have pretty much been plumbed through all the
main places that they were missing, so there shouldn't be too much required
at this point.
>
> From handle_mm_fault(), the stack diverges into various faults,
> examples include:
>
> Transparent Huge Page
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> create_huge_pmd(vmf);
> do_huge_pmd_anonymous_page(vmf);
> mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
> vmf.gfp_mask should be passed as well.
>
> There are several other similar paths in a transparent huge page, also
> there is a named path where allocation is based on filesystems, and
> the flag should be honored there as well, but it does not have to be
> added at the same time.
>
> Regular Pages
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> handle_pte_fault(vmf)
> do_anonymous_page(vmf);
> page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
> replace change this call according to gfp_mask.
>
> The above only take care of the case if user application faults on the
> page during pinning time, but there are also cases where pages already
> exist.
>
> 2. Add an internal move_pages_zone() similar to move_pages() syscall
> but instead of migrating to a different NUMA node, migrate pages from
> ZONE_MOVABLE to another zone.
> Call move_pages_zone() on demand prior to pinning pages from
> vfio_pin_map_dma() for instance.
>
> 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> pages from non-movable zone. When a user application knows that it
> will do DMA mapping, and pin pages for a long time, the memory that it
> allocates should never be migrated or hot-removed, so make sure that
> it comes from the appropriate place.
> The benefit of adding madvise() flag is that we won't have to deal
> with slow page migration during pin time, but the disadvantage is that
> we would need to change the user interface.
>
> Before I start working on the above approaches, I would like to get an
> opinion from the community on an appropriate path forward for this
> problem. If what I described sounds reasonable, or if there are other
> ideas on how to address the problem that I am seeing.
>
I'm also in favor with avoiding (3) for now and maybe forever, depending on
how it goes. Good luck... :)
thanks,
--
John Hubbard
NVIDIA
On Mon 23-11-20 11:06:21, Pavel Tatashin wrote:
> On Mon, Nov 23, 2020 at 4:01 AM Michal Hocko <[email protected]> wrote:
> >
> > On Fri 20-11-20 15:27:46, Pavel Tatashin wrote:
> > > Recently, I encountered a hang that is happening during memory hot
> > > remove operation. It turns out that the hang is caused by pinned user
> > > pages in ZONE_MOVABLE.
> > >
> > > Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> > > this is not the case if a user applications such as through dpdk
> > > libraries pinned them via vfio dma map.
> >
> > Long term or effectively time unbound pinning on zone movable is
> > fundamentaly broken. The sole reason of ZONE_MOVABLE existence is to
> > guarantee migrateability. If the cosumer of this memory cannot guarantee
> > that then it shouldn't use __GFP_MOVABLE in the first place.
>
> Exactly, this is what I am trying to solve, and started this thread to
> figure out what is the best approach to address this problem.
>
> >
> > > Kernel keeps trying to
> > > hot-remove them, but refcnt never gets to zero, so we are looping
> > > until the hardware watchdog kicks in.
> >
> > Yeah, the existing offlining behavior doesn't stop trying because the
> > current implementation of the migration cannot tell a diffence between
> > short and long term failures. Maybe the recent ref count for long term
> > pinning can be used to help out there.
> >
> > Anyway, I am wondering what do you mean by watchdog firing. The
> > operation should trigger neither of soft, hard or hung detectors.
>
> You are right, the hot-remove is killable operation. In our case,
> however, systemd stops petting watchdog during kexec reboot to ensure
> that reboot finishes, however, because we hot-remove memory during
> shutdown, and kernel is unable to hot-remove memory within 60s we get
> a watchdog reset.
Well, this should be worked around quite trivially. You can kill your
attempt before the timeout fires.
[...]
> > > 2. Add an internal move_pages_zone() similar to move_pages() syscall
> > > but instead of migrating to a different NUMA node, migrate pages from
> > > ZONE_MOVABLE to another zone.
> > > Call move_pages_zone() on demand prior to pinning pages from
> > > vfio_pin_map_dma() for instance.
> >
> > Why is the existing migration API insufficient?
>
> Here I am talking about internal implementation not user API. We do
> not have a function that migrates pages in a user address space from
> one zone to another zone. We only have a function that is exposed as a
> syscall that migrates pages from one node to another node.
We do have migrate_pages and its interface should make it trivial enough
that a new general purpose helper shouldn't be really needed.
struct migration_target_control mtc = {
.gfp_mask = GFP_USER | __GFP_RETRY_MAYFAIL,
};
migrate_pages(&list_of_pages, alloc_migration_target, NULL,
(unsigned long)&mtc, MIGRATE_SYNC, MR_PINNING);
note that NR_PINNING would have to added.
> > > 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> > > pages from non-movable zone. When a user application knows that it
> > > will do DMA mapping, and pin pages for a long time, the memory that it
> > > allocates should never be migrated or hot-removed, so make sure that
> > > it comes from the appropriate place.
> > > The benefit of adding madvise() flag is that we won't have to deal
> > > with slow page migration during pin time, but the disadvantage is that
> > > we would need to change the user interface.
> >
> > No, the MOVABLE_ZONE like other zone types are internal implementation
> > detail of the MM. I do not think we want to expose that to the userspace
> > and carve this into stone.
>
> What I mean here is allowing users to guarantee that the page's PA is
> going to stay the same. Sort of a stronger mlock. Mlock only
> guarantees that the page is not swapped, but something like
> MADV_PINNED would guarantee that page is not going to be swapped and
> also not migrated.
There were some discussions around vmpin/unpin syscalls. This didn't
really lead anywhere. One of the roadblock was a proper accounting IIRC.
You might want to look for those discussions in email archives.
> If a user determines the PA of that page, that PA
> is going to stay the same throughout the life of the page. This is not
> exposing internal implementation in any way, this guarantee could be
> honored in various ways: i.e. pinned or allocating from ZONE_NORMAL.
> The fact that we would honor it by allocating memory from ZONE_NORMAL
> is implementation detail that would not be exposed to the user.
Jason has already replied to this and I do not have much to add.
[...]
> I just think it is inefficient to first allocate memory from
> ZONE_MOVABLE, and later migrate it to ZONE_NORMAL.
Yes it is inefficient. Is it usual that the memory is already faulted in
when it is pinned?
--
Michal Hocko
SUSE Labs
On Mon 23-11-20 11:31:59, Pavel Tatashin wrote:
[...]
> Also, we still need to take care of the fault scenario.
Forgot to reply to this part. I believe you mean this to be fault at gup
time, right? Then the easiest way forward would be to either add yet
another scoped flag or (maybe) better to generalize memalloc_nocma_* to
imply that the allocated memory is going to be unmovable so drop
__GFP_MOVABLE and also forbid CMA. I have to admit that I do not
remember why long term pin on CMA pages is ok to go to movable but I
strongly suspect this is just shifting problem around.
--
Michal Hocko
SUSE Labs
On Mon 23-11-20 11:31:59, Pavel Tatashin wrote:
> > Makes sense, as this means no userspace change.
> >
> > > 2. Add an internal move_pages_zone() similar to move_pages() syscall
> > > but instead of migrating to a different NUMA node, migrate pages from
> > > ZONE_MOVABLE to another zone.
> > > Call move_pages_zone() on demand prior to pinning pages from
> > > vfio_pin_map_dma() for instance.
> >
> > As others already said, migrating away before the longterm pin should be
> > the solution. IIRC it was one of the goals of long term pinning api
> > proposed long time ago by Peter Ziljstra I think? The implementation
> > that was merged relatively recently doesn't do that (yet?) for all
> > movable pages, just CMA, but it could.
>
> From what I can tell, CMA is not solving exactly this problem. It
> migrates pages from CMA before pinning, but it migrates them to
> ZONE_MOVABLE.
CMA suffers from a very similar problem. The existing solution is
migrating out from the CMA region and it allows MOVABLE zones as well
but that is merely an implementation detail and something that breaks
movability on its own. So something to fix up, ideally for both cases.
--
Michal Hocko
SUSE Labs
On 24.11.20 09:43, Michal Hocko wrote:
> On Mon 23-11-20 11:31:59, Pavel Tatashin wrote:
> [...]
>> Also, we still need to take care of the fault scenario.
>
> Forgot to reply to this part. I believe you mean this to be fault at gup
> time, right? Then the easiest way forward would be to either add yet
> another scoped flag or (maybe) better to generalize memalloc_nocma_* to
> imply that the allocated memory is going to be unmovable so drop
> __GFP_MOVABLE and also forbid CMA. I have to admit that I do not
> remember why long term pin on CMA pages is ok to go to movable but I
> strongly suspect this is just shifting problem around.
Agreed.
--
Thanks,
David / dhildenb