This issue was discovered by Jan Stancek as described in
https://lkml.kernel.org/r/[email protected]
Error paths in hugetlb_cow() and hugetlb_no_page() do not properly clean
up reservation entries when freeing a newly allocated huge page. This
issue was introduced with commit 67961f9db8c4 ("mm/hugetlb: fix huge page
reserve accounting for private mappings). That commit uses the information
in private mapping reserve maps to determine if a reservation was already
consumed. This is important in the case of hole punch and truncate as the
pages are released, but reservation entries are not restored.
This patch restores the reserve entries in hugetlb_cow and hugetlb_no_page
such that reserve entries are consistent with the global reservation count.
The huge page reservation code is quite hard to follow, and this patch
makes it even more complex. One thought I had was to change the way
hole punch and truncate work so that private mapping pages are not thrown
away. This would eliminate the need for this patch as well as 67961f9db8c4.
It would change the existing semantics (as seen by the user) in this area,
but I believe the documentation (man pages) say the behavior is unspecified.
This could be a future change as well as rewriting the existing reservation
code to make it easier to understand/maintain. Thoughts?
In any case, this patch addresses the immediate issue.
Mike Kravetz (1):
mm/hugetlb: fix huge page reservation leak in private mapping error
paths
mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
--
2.7.4
Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
allocated huge page. If a reservation was associated with the huge
page, alloc_huge_page() consumed the reservation while allocating.
When the newly allocated page is freed in free_huge_page(), it will
increment the global reservation count. However, the reservation entry
in the reserve map will remain. This is not an issue for shared
mappings as the entry in the reserve map indicates a reservation exists.
But, an entry in a private mapping reserve map indicates the reservation
was consumed and no longer exists. This results in an inconsistency
between the reserve map and the global reservation count. This 'leaks'
a reserved huge page.
Create a new routine restore_reserve_on_error() to restore the reserve
entry in these specific error paths. This routine makes use of a new
function vma_add_reservation() which will add a reserve entry for a
specific address/page.
In general, these error paths were rarely (if ever) taken on most
architectures. However, powerpc contained arch specific code that
that resulted in an extra fault and execution of these error paths
on all private mappings.
Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
Cc: [email protected]
Reported-by: Jan Stancek <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec49d9e..418bf01 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1826,11 +1826,17 @@ static void return_unused_surplus_pages(struct hstate *h,
* is not the case is if a reserve map was changed between calls. It
* is the responsibility of the caller to notice the difference and
* take appropriate action.
+ *
+ * vma_add_reservation is used in error paths where a reservation must
+ * be restored when a newly allocated huge page must be freed. It is
+ * to be called after calling vma_needs_reservation to determine if a
+ * reservation exists.
*/
enum vma_resv_mode {
VMA_NEEDS_RESV,
VMA_COMMIT_RESV,
VMA_END_RESV,
+ VMA_ADD_RESV,
};
static long __vma_reservation_common(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr,
@@ -1856,6 +1862,14 @@ static long __vma_reservation_common(struct hstate *h,
region_abort(resv, idx, idx + 1);
ret = 0;
break;
+ case VMA_ADD_RESV:
+ if (vma->vm_flags & VM_MAYSHARE)
+ ret = region_add(resv, idx, idx + 1);
+ else {
+ region_abort(resv, idx, idx + 1);
+ ret = region_del(resv, idx, idx + 1);
+ }
+ break;
default:
BUG();
}
@@ -1903,6 +1917,56 @@ static void vma_end_reservation(struct hstate *h,
(void)__vma_reservation_common(h, vma, addr, VMA_END_RESV);
}
+static long vma_add_reservation(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV);
+}
+
+/*
+ * This routine is called to restore a reservation on error paths. In the
+ * specific error paths, a huge page was allocated (via alloc_huge_page)
+ * and is about to be freed. If a reservation for the page existed,
+ * alloc_huge_page would have consumed the reservation and set PagePrivate
+ * in the newly allocated page. When the page is freed via free_huge_page,
+ * the global reservation count will be incremented if PagePrivate is set.
+ * However, free_huge_page can not adjust the reserve map. Adjust the
+ * reserve map here to be consistent with global reserve count adjustments
+ * to be made by free_huge_page.
+ */
+static void restore_reserve_on_error(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long address,
+ struct page *page)
+{
+ if (unlikely(PagePrivate(page))) {
+ long rc = vma_needs_reservation(h, vma, address);
+
+ if (unlikely(rc < 0)) {
+ /*
+ * Rare out of memory condition in reserve map
+ * manipulation. Clear PagePrivate so that
+ * global reserve count will not be incremented
+ * by free_huge_page. This will make it appear
+ * as though the reservation for this page was
+ * consumed. This may prevent the task from
+ * faulting in the page at a later time. This
+ * is better than inconsistent global huge page
+ * accounting of reserve counts.
+ */
+ ClearPagePrivate(page);
+ } else if (rc) {
+ rc = vma_add_reservation(h, vma, address);
+ if (unlikely(rc < 0))
+ /*
+ * See above comment about rare out of
+ * memory condition.
+ */
+ ClearPagePrivate(page);
+ } else
+ vma_end_reservation(h, vma, address);
+ }
+}
+
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
{
@@ -3498,6 +3562,7 @@ retry_avoidcopy:
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out_release_all:
+ restore_reserve_on_error(h, vma, address, new_page);
put_page(new_page);
out_release_old:
put_page(old_page);
@@ -3680,6 +3745,7 @@ backout:
spin_unlock(ptl);
backout_unlocked:
unlock_page(page);
+ restore_reserve_on_error(h, vma, address, page);
put_page(page);
goto out;
}
--
2.7.4
----- Original Message -----
> From: "Mike Kravetz" <[email protected]>
> To: [email protected], [email protected]
> Cc: "Aneesh Kumar K . V" <[email protected]>, "Naoya Horiguchi" <[email protected]>, "Michal
> Hocko" <[email protected]>, "Kirill A . Shutemov" <[email protected]>, "Hillf Danton"
> <[email protected]>, "Dave Hansen" <[email protected]>, "Jan Stancek" <[email protected]>, "Mike
> Kravetz" <[email protected]>
> Sent: Thursday, 20 October, 2016 5:11:16 AM
> Subject: [PATCH 0/1] mm/hugetlb: fix huge page reservation leak in private mapping error paths
>
> This issue was discovered by Jan Stancek as described in
> https://lkml.kernel.org/r/[email protected]
>
> Error paths in hugetlb_cow() and hugetlb_no_page() do not properly clean
> up reservation entries when freeing a newly allocated huge page. This
> issue was introduced with commit 67961f9db8c4 ("mm/hugetlb: fix huge page
> reserve accounting for private mappings). That commit uses the information
> in private mapping reserve maps to determine if a reservation was already
> consumed. This is important in the case of hole punch and truncate as the
> pages are released, but reservation entries are not restored.
>
> This patch restores the reserve entries in hugetlb_cow and hugetlb_no_page
> such that reserve entries are consistent with the global reservation count.
>
> The huge page reservation code is quite hard to follow, and this patch
> makes it even more complex. One thought I had was to change the way
> hole punch and truncate work so that private mapping pages are not thrown
> away. This would eliminate the need for this patch as well as 67961f9db8c4.
> It would change the existing semantics (as seen by the user) in this area,
> but I believe the documentation (man pages) say the behavior is unspecified.
> This could be a future change as well as rewriting the existing reservation
> code to make it easier to understand/maintain. Thoughts?
>
> In any case, this patch addresses the immediate issue.
Mike,
Just to confirm, I ran this patch on my setup (without the patch from Aneesh)
with libhugetlbfs testsuite in loop for several hours. There were no
ENOMEM/OOM failures, I did not observe resv leak after it finished.
Regards,
Jan
>
> Mike Kravetz (1):
> mm/hugetlb: fix huge page reservation leak in private mapping error
> paths
>
> mm/hugetlb.c | 66
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 66 insertions(+)
>
> --
> 2.7.4
>
>
On 10/20/2016 08:44 AM, Jan Stancek wrote:
>
>
>
>
> ----- Original Message -----
>> From: "Mike Kravetz" <[email protected]>
>> To: [email protected], [email protected]
>> Cc: "Aneesh Kumar K . V" <[email protected]>, "Naoya Horiguchi" <[email protected]>, "Michal
>> Hocko" <[email protected]>, "Kirill A . Shutemov" <[email protected]>, "Hillf Danton"
>> <[email protected]>, "Dave Hansen" <[email protected]>, "Jan Stancek" <[email protected]>, "Mike
>> Kravetz" <[email protected]>
>> Sent: Thursday, 20 October, 2016 5:11:16 AM
>> Subject: [PATCH 0/1] mm/hugetlb: fix huge page reservation leak in private mapping error paths
>>
>> This issue was discovered by Jan Stancek as described in
>> https://lkml.kernel.org/r/[email protected]
>>
>> Error paths in hugetlb_cow() and hugetlb_no_page() do not properly clean
>> up reservation entries when freeing a newly allocated huge page. This
>> issue was introduced with commit 67961f9db8c4 ("mm/hugetlb: fix huge page
>> reserve accounting for private mappings). That commit uses the information
>> in private mapping reserve maps to determine if a reservation was already
>> consumed. This is important in the case of hole punch and truncate as the
>> pages are released, but reservation entries are not restored.
>>
>> This patch restores the reserve entries in hugetlb_cow and hugetlb_no_page
>> such that reserve entries are consistent with the global reservation count.
>>
>> The huge page reservation code is quite hard to follow, and this patch
>> makes it even more complex. One thought I had was to change the way
>> hole punch and truncate work so that private mapping pages are not thrown
>> away. This would eliminate the need for this patch as well as 67961f9db8c4.
>> It would change the existing semantics (as seen by the user) in this area,
>> but I believe the documentation (man pages) say the behavior is unspecified.
>> This could be a future change as well as rewriting the existing reservation
>> code to make it easier to understand/maintain. Thoughts?
>>
>> In any case, this patch addresses the immediate issue.
>
> Mike,
>
> Just to confirm, I ran this patch on my setup (without the patch from Aneesh)
> with libhugetlbfs testsuite in loop for several hours. There were no
> ENOMEM/OOM failures, I did not observe resv leak after it finished.
Thanks for the testing Jan.
I do not have access to a Power system, so I simulated the condition to
test.
--
Mike Kravetz
>
> Regards,
> Jan
>
>>
>> Mike Kravetz (1):
>> mm/hugetlb: fix huge page reservation leak in private mapping error
>> paths
>>
>> mm/hugetlb.c | 66
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 66 insertions(+)
>>
>> --
>> 2.7.4
>>
>>
Mike Kravetz <[email protected]> writes:
> Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
> allocated huge page. If a reservation was associated with the huge
> page, alloc_huge_page() consumed the reservation while allocating.
> When the newly allocated page is freed in free_huge_page(), it will
> increment the global reservation count. However, the reservation entry
> in the reserve map will remain. This is not an issue for shared
> mappings as the entry in the reserve map indicates a reservation exists.
> But, an entry in a private mapping reserve map indicates the reservation
> was consumed and no longer exists. This results in an inconsistency
> between the reserve map and the global reservation count. This 'leaks'
> a reserved huge page.
>
> Create a new routine restore_reserve_on_error() to restore the reserve
> entry in these specific error paths. This routine makes use of a new
> function vma_add_reservation() which will add a reserve entry for a
> specific address/page.
>
> In general, these error paths were rarely (if ever) taken on most
> architectures. However, powerpc contained arch specific code that
> that resulted in an extra fault and execution of these error paths
> on all private mappings.
>
> Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Cc: [email protected]
> Reported-by: Jan Stancek <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 66 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ec49d9e..418bf01 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1826,11 +1826,17 @@ static void return_unused_surplus_pages(struct hstate *h,
> * is not the case is if a reserve map was changed between calls. It
> * is the responsibility of the caller to notice the difference and
> * take appropriate action.
> + *
> + * vma_add_reservation is used in error paths where a reservation must
> + * be restored when a newly allocated huge page must be freed. It is
> + * to be called after calling vma_needs_reservation to determine if a
> + * reservation exists.
> */
> enum vma_resv_mode {
> VMA_NEEDS_RESV,
> VMA_COMMIT_RESV,
> VMA_END_RESV,
> + VMA_ADD_RESV,
> };
> static long __vma_reservation_common(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr,
> @@ -1856,6 +1862,14 @@ static long __vma_reservation_common(struct hstate *h,
> region_abort(resv, idx, idx + 1);
> ret = 0;
> break;
> + case VMA_ADD_RESV:
> + if (vma->vm_flags & VM_MAYSHARE)
> + ret = region_add(resv, idx, idx + 1);
> + else {
> + region_abort(resv, idx, idx + 1);
> + ret = region_del(resv, idx, idx + 1);
> + }
> + break;
> default:
> BUG();
> }
> @@ -1903,6 +1917,56 @@ static void vma_end_reservation(struct hstate *h,
> (void)__vma_reservation_common(h, vma, addr, VMA_END_RESV);
> }
>
> +static long vma_add_reservation(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long addr)
> +{
> + return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV);
> +}
> +
> +/*
> + * This routine is called to restore a reservation on error paths. In the
> + * specific error paths, a huge page was allocated (via alloc_huge_page)
> + * and is about to be freed. If a reservation for the page existed,
> + * alloc_huge_page would have consumed the reservation and set PagePrivate
> + * in the newly allocated page. When the page is freed via free_huge_page,
> + * the global reservation count will be incremented if PagePrivate is set.
> + * However, free_huge_page can not adjust the reserve map. Adjust the
> + * reserve map here to be consistent with global reserve count adjustments
> + * to be made by free_huge_page.
> + */
> +static void restore_reserve_on_error(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long address,
> + struct page *page)
> +{
> + if (unlikely(PagePrivate(page))) {
> + long rc = vma_needs_reservation(h, vma, address);
> +
> + if (unlikely(rc < 0)) {
> + /*
> + * Rare out of memory condition in reserve map
> + * manipulation. Clear PagePrivate so that
> + * global reserve count will not be incremented
> + * by free_huge_page. This will make it appear
> + * as though the reservation for this page was
> + * consumed. This may prevent the task from
> + * faulting in the page at a later time. This
> + * is better than inconsistent global huge page
> + * accounting of reserve counts.
> + */
> + ClearPagePrivate(page);
> + } else if (rc) {
> + rc = vma_add_reservation(h, vma, address);
> + if (unlikely(rc < 0))
> + /*
> + * See above comment about rare out of
> + * memory condition.
> + */
> + ClearPagePrivate(page);
> + } else
> + vma_end_reservation(h, vma, address);
> + }
> +}
> +
> struct page *alloc_huge_page(struct vm_area_struct *vma,
> unsigned long addr, int avoid_reserve)
> {
> @@ -3498,6 +3562,7 @@ retry_avoidcopy:
> spin_unlock(ptl);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> out_release_all:
> + restore_reserve_on_error(h, vma, address, new_page);
> put_page(new_page);
> out_release_old:
> put_page(old_page);
> @@ -3680,6 +3745,7 @@ backout:
> spin_unlock(ptl);
> backout_unlocked:
> unlock_page(page);
> + restore_reserve_on_error(h, vma, address, page);
> put_page(page);
> goto out;
> }
> --
> 2.7.4
Mike Kravetz <[email protected]> writes:
> Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
> allocated huge page. If a reservation was associated with the huge
> page, alloc_huge_page() consumed the reservation while allocating.
> When the newly allocated page is freed in free_huge_page(), it will
> increment the global reservation count. However, the reservation entry
> in the reserve map will remain. This is not an issue for shared
> mappings as the entry in the reserve map indicates a reservation exists.
> But, an entry in a private mapping reserve map indicates the reservation
> was consumed and no longer exists. This results in an inconsistency
> between the reserve map and the global reservation count. This 'leaks'
> a reserved huge page.
>
> Create a new routine restore_reserve_on_error() to restore the reserve
> entry in these specific error paths. This routine makes use of a new
> function vma_add_reservation() which will add a reserve entry for a
> specific address/page.
>
> In general, these error paths were rarely (if ever) taken on most
> architectures. However, powerpc contained arch specific code that
> that resulted in an extra fault and execution of these error paths
> on all private mappings.
>
> Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
>
> Cc: [email protected]
> Reported-by: Jan Stancek <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 66 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ec49d9e..418bf01 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1826,11 +1826,17 @@ static void return_unused_surplus_pages(struct hstate *h,
> * is not the case is if a reserve map was changed between calls. It
> * is the responsibility of the caller to notice the difference and
> * take appropriate action.
> + *
> + * vma_add_reservation is used in error paths where a reservation must
> + * be restored when a newly allocated huge page must be freed. It is
> + * to be called after calling vma_needs_reservation to determine if a
> + * reservation exists.
> */
> enum vma_resv_mode {
> VMA_NEEDS_RESV,
> VMA_COMMIT_RESV,
> VMA_END_RESV,
> + VMA_ADD_RESV,
> };
> static long __vma_reservation_common(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr,
> @@ -1856,6 +1862,14 @@ static long __vma_reservation_common(struct hstate *h,
> region_abort(resv, idx, idx + 1);
> ret = 0;
> break;
> + case VMA_ADD_RESV:
> + if (vma->vm_flags & VM_MAYSHARE)
> + ret = region_add(resv, idx, idx + 1);
> + else {
> + region_abort(resv, idx, idx + 1);
> + ret = region_del(resv, idx, idx + 1);
> + }
It is confusing to find ADD_RESV doing region_del, but I don't have
suggestion for a better name.
-aneesh
On 10/23/2016 04:36 AM, Aneesh Kumar K.V wrote:
> Mike Kravetz <[email protected]> writes:
>
>> Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
>> allocated huge page. If a reservation was associated with the huge
>> page, alloc_huge_page() consumed the reservation while allocating.
>> When the newly allocated page is freed in free_huge_page(), it will
>> increment the global reservation count. However, the reservation entry
>> in the reserve map will remain. This is not an issue for shared
>> mappings as the entry in the reserve map indicates a reservation exists.
>> But, an entry in a private mapping reserve map indicates the reservation
>> was consumed and no longer exists. This results in an inconsistency
>> between the reserve map and the global reservation count. This 'leaks'
>> a reserved huge page.
>>
>> Create a new routine restore_reserve_on_error() to restore the reserve
>> entry in these specific error paths. This routine makes use of a new
>> function vma_add_reservation() which will add a reserve entry for a
>> specific address/page.
>>
>> In general, these error paths were rarely (if ever) taken on most
>> architectures. However, powerpc contained arch specific code that
>> that resulted in an extra fault and execution of these error paths
>> on all private mappings.
>>
>> Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
>>
>> Cc: [email protected]
>> Reported-by: Jan Stancek <[email protected]>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>> mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 66 insertions(+)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ec49d9e..418bf01 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1826,11 +1826,17 @@ static void return_unused_surplus_pages(struct hstate *h,
>> * is not the case is if a reserve map was changed between calls. It
>> * is the responsibility of the caller to notice the difference and
>> * take appropriate action.
>> + *
>> + * vma_add_reservation is used in error paths where a reservation must
>> + * be restored when a newly allocated huge page must be freed. It is
>> + * to be called after calling vma_needs_reservation to determine if a
>> + * reservation exists.
>> */
>> enum vma_resv_mode {
>> VMA_NEEDS_RESV,
>> VMA_COMMIT_RESV,
>> VMA_END_RESV,
>> + VMA_ADD_RESV,
>> };
>> static long __vma_reservation_common(struct hstate *h,
>> struct vm_area_struct *vma, unsigned long addr,
>> @@ -1856,6 +1862,14 @@ static long __vma_reservation_common(struct hstate *h,
>> region_abort(resv, idx, idx + 1);
>> ret = 0;
>> break;
>> + case VMA_ADD_RESV:
>> + if (vma->vm_flags & VM_MAYSHARE)
>> + ret = region_add(resv, idx, idx + 1);
>> + else {
>> + region_abort(resv, idx, idx + 1);
>> + ret = region_del(resv, idx, idx + 1);
>> + }
>
> It is confusing to find ADD_RESV doing region_del, but I don't have
> suggestion for a better name.
Thanks for the review Aneesh.
Of course, this naming is the result of shared and private mappings having
completely opposite reserve map semantics. In shared mappings, an entry
in the reserve map indicates a reservation exists. For private mappings,
the absence of an entry in the reserve map indicates a reservation exists.
--
Mike Kravetz
>
> -aneesh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On 10/19/2016 08:11 PM, Mike Kravetz wrote:
> Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
> allocated huge page. If a reservation was associated with the huge
> page, alloc_huge_page() consumed the reservation while allocating.
> When the newly allocated page is freed in free_huge_page(), it will
> increment the global reservation count. However, the reservation entry
> in the reserve map will remain. This is not an issue for shared
> mappings as the entry in the reserve map indicates a reservation exists.
> But, an entry in a private mapping reserve map indicates the reservation
> was consumed and no longer exists. This results in an inconsistency
> between the reserve map and the global reservation count. This 'leaks'
> a reserved huge page.
>
> Create a new routine restore_reserve_on_error() to restore the reserve
> entry in these specific error paths. This routine makes use of a new
> function vma_add_reservation() which will add a reserve entry for a
> specific address/page.
>
> In general, these error paths were rarely (if ever) taken on most
> architectures. However, powerpc contained arch specific code that
> that resulted in an extra fault and execution of these error paths
> on all private mappings.
>
> Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
>
Any additional comments on this?
It does address a regression with private mappings that appears to only be
visible on powerpc. Aneesh submitted a patch to workaround the issue on
powerpc that is in mmotm/linux-next (71271479df7e/955f9aa468e0). Aneesh's
patch makes the symptoms go away. This patch addresses root cause.
Adding Andrew to Cc as I accidentally left him off originally, and he
may have
sparked additional comments.
--
Mike Kravetz
> Cc: [email protected]
> Reported-by: Jan Stancek <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 66 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ec49d9e..418bf01 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1826,11 +1826,17 @@ static void return_unused_surplus_pages(struct hstate *h,
> * is not the case is if a reserve map was changed between calls. It
> * is the responsibility of the caller to notice the difference and
> * take appropriate action.
> + *
> + * vma_add_reservation is used in error paths where a reservation must
> + * be restored when a newly allocated huge page must be freed. It is
> + * to be called after calling vma_needs_reservation to determine if a
> + * reservation exists.
> */
> enum vma_resv_mode {
> VMA_NEEDS_RESV,
> VMA_COMMIT_RESV,
> VMA_END_RESV,
> + VMA_ADD_RESV,
> };
> static long __vma_reservation_common(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr,
> @@ -1856,6 +1862,14 @@ static long __vma_reservation_common(struct hstate *h,
> region_abort(resv, idx, idx + 1);
> ret = 0;
> break;
> + case VMA_ADD_RESV:
> + if (vma->vm_flags & VM_MAYSHARE)
> + ret = region_add(resv, idx, idx + 1);
> + else {
> + region_abort(resv, idx, idx + 1);
> + ret = region_del(resv, idx, idx + 1);
> + }
> + break;
> default:
> BUG();
> }
> @@ -1903,6 +1917,56 @@ static void vma_end_reservation(struct hstate *h,
> (void)__vma_reservation_common(h, vma, addr, VMA_END_RESV);
> }
>
> +static long vma_add_reservation(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long addr)
> +{
> + return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV);
> +}
> +
> +/*
> + * This routine is called to restore a reservation on error paths. In the
> + * specific error paths, a huge page was allocated (via alloc_huge_page)
> + * and is about to be freed. If a reservation for the page existed,
> + * alloc_huge_page would have consumed the reservation and set PagePrivate
> + * in the newly allocated page. When the page is freed via free_huge_page,
> + * the global reservation count will be incremented if PagePrivate is set.
> + * However, free_huge_page can not adjust the reserve map. Adjust the
> + * reserve map here to be consistent with global reserve count adjustments
> + * to be made by free_huge_page.
> + */
> +static void restore_reserve_on_error(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long address,
> + struct page *page)
> +{
> + if (unlikely(PagePrivate(page))) {
> + long rc = vma_needs_reservation(h, vma, address);
> +
> + if (unlikely(rc < 0)) {
> + /*
> + * Rare out of memory condition in reserve map
> + * manipulation. Clear PagePrivate so that
> + * global reserve count will not be incremented
> + * by free_huge_page. This will make it appear
> + * as though the reservation for this page was
> + * consumed. This may prevent the task from
> + * faulting in the page at a later time. This
> + * is better than inconsistent global huge page
> + * accounting of reserve counts.
> + */
> + ClearPagePrivate(page);
> + } else if (rc) {
> + rc = vma_add_reservation(h, vma, address);
> + if (unlikely(rc < 0))
> + /*
> + * See above comment about rare out of
> + * memory condition.
> + */
> + ClearPagePrivate(page);
> + } else
> + vma_end_reservation(h, vma, address);
> + }
> +}
> +
> struct page *alloc_huge_page(struct vm_area_struct *vma,
> unsigned long addr, int avoid_reserve)
> {
> @@ -3498,6 +3562,7 @@ retry_avoidcopy:
> spin_unlock(ptl);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> out_release_all:
> + restore_reserve_on_error(h, vma, address, new_page);
> put_page(new_page);
> out_release_old:
> put_page(old_page);
> @@ -3680,6 +3745,7 @@ backout:
> spin_unlock(ptl);
> backout_unlocked:
> unlock_page(page);
> + restore_reserve_on_error(h, vma, address, page);
> put_page(page);
> goto out;
> }
>
On Wednesday, November 02, 2016 12:37 AM Mike Kravetz wrote:
> On 10/19/2016 08:11 PM, Mike Kravetz wrote:
> > Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
> > allocated huge page. If a reservation was associated with the huge
> > page, alloc_huge_page() consumed the reservation while allocating.
> > When the newly allocated page is freed in free_huge_page(), it will
> > increment the global reservation count. However, the reservation entry
> > in the reserve map will remain. This is not an issue for shared
> > mappings as the entry in the reserve map indicates a reservation exists.
> > But, an entry in a private mapping reserve map indicates the reservation
> > was consumed and no longer exists. This results in an inconsistency
> > between the reserve map and the global reservation count. This 'leaks'
> > a reserved huge page.
> >
> > Create a new routine restore_reserve_on_error() to restore the reserve
> > entry in these specific error paths. This routine makes use of a new
> > function vma_add_reservation() which will add a reserve entry for a
> > specific address/page.
> >
> > In general, these error paths were rarely (if ever) taken on most
> > architectures. However, powerpc contained arch specific code that
> > that resulted in an extra fault and execution of these error paths
> > on all private mappings.
> >
> > Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
> >
>
> Any additional comments on this?
>
> It does address a regression with private mappings that appears to only be
> visible on powerpc. Aneesh submitted a patch to workaround the issue on
> powerpc that is in mmotm/linux-next (71271479df7e/955f9aa468e0). Aneesh's
> patch makes the symptoms go away. This patch addresses root cause.
>
Both works are needed, thanks.
Acked-by: Hillf Danton <[email protected]>