2017-08-25 22:03:07

by Nadav Amit

[permalink] [raw]
Subject: Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree

Michal Hocko <[email protected]> wrote:

> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>
> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>> From: Eric Biggers <[email protected]>
>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>
>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>> put_page() before unlock_page(). This was wrong because put_page() can
>> free the page, e.g. if a concurrent madvise(..., MADV_DONTNEED) has
>> removed it from the memory mapping. put_page() then rightfully complained
>> about freeing a locked page.
>>
>> Fix this by moving the unlock_page() before put_page().

Quick grep shows that a similar flow (put_page() followed by an
unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
well?



2017-08-25 22:31:35

by Mike Kravetz

[permalink] [raw]
Subject: Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree

On 08/25/2017 03:02 PM, Nadav Amit wrote:
> Michal Hocko <[email protected]> wrote:
>
>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>
>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>> From: Eric Biggers <[email protected]>
>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>
>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>> put_page() before unlock_page(). This was wrong because put_page() can
>>> free the page, e.g. if a concurrent madvise(..., MADV_DONTNEED) has
>>> removed it from the memory mapping. put_page() then rightfully complained
>>> about freeing a locked page.
>>>
>>> Fix this by moving the unlock_page() before put_page().
>
> Quick grep shows that a similar flow (put_page() followed by an
> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> well?

I assume you are asking about this block of code?

/*
* page_put due to reference from alloc_huge_page()
* unlock_page because locked by add_to_page_cache()
*/
put_page(page);
unlock_page(page);

Well, there is a typo (page_put) in the comment. :(

However, in this case we have just added the huge page to a hugetlbfs
file. The put_page() is there just to drop the reference count on the
page (taken when allocated). It will still be non-zero as we have
successfully added it to the page cache. So, we are not freeing the
page here, just dropping the reference count.

This should not cause a problem like that seen in madvise.
--
Mike Kravetz

2017-08-25 22:51:37

by Nadav Amit

[permalink] [raw]
Subject: Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree

Mike Kravetz <[email protected]> wrote:

> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>> Michal Hocko <[email protected]> wrote:
>>
>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>>
>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>> From: Eric Biggers <[email protected]>
>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>>
>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>> put_page() before unlock_page(). This was wrong because put_page() can
>>>> free the page, e.g. if a concurrent madvise(..., MADV_DONTNEED) has
>>>> removed it from the memory mapping. put_page() then rightfully complained
>>>> about freeing a locked page.
>>>>
>>>> Fix this by moving the unlock_page() before put_page().
>>
>> Quick grep shows that a similar flow (put_page() followed by an
>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
>> well?
>
> I assume you are asking about this block of code?

Yes.

>
> /*
> * page_put due to reference from alloc_huge_page()
> * unlock_page because locked by add_to_page_cache()
> */
> put_page(page);
> unlock_page(page);
>
> Well, there is a typo (page_put) in the comment. :(
>
> However, in this case we have just added the huge page to a hugetlbfs
> file. The put_page() is there just to drop the reference count on the
> page (taken when allocated). It will still be non-zero as we have
> successfully added it to the page cache. So, we are not freeing the
> page here, just dropping the reference count.
>
> This should not cause a problem like that seen in madvise.

Thanks for the quick response.

I am not too familiar with this piece of code, so just for the matter of
understanding: what prevents the page from being removed from the page cache
shortly after it is added (even if it is highly unlikely)? The page lock? The
inode lock?

Thanks again,
Nadav


2017-08-25 23:41:52

by Mike Kravetz

[permalink] [raw]
Subject: Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree

On 08/25/2017 03:51 PM, Nadav Amit wrote:
> Mike Kravetz <[email protected]> wrote:
>
>> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>>> Michal Hocko <[email protected]> wrote:
>>>
>>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>>>
>>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>>> From: Eric Biggers <[email protected]>
>>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>>>
>>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>>> put_page() before unlock_page(). This was wrong because put_page() can
>>>>> free the page, e.g. if a concurrent madvise(..., MADV_DONTNEED) has
>>>>> removed it from the memory mapping. put_page() then rightfully complained
>>>>> about freeing a locked page.
>>>>>
>>>>> Fix this by moving the unlock_page() before put_page().
>>>
>>> Quick grep shows that a similar flow (put_page() followed by an
>>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
>>> well?
>>
>> I assume you are asking about this block of code?
>
> Yes.
>
>>
>> /*
>> * page_put due to reference from alloc_huge_page()
>> * unlock_page because locked by add_to_page_cache()
>> */
>> put_page(page);
>> unlock_page(page);
>>
>> Well, there is a typo (page_put) in the comment. :(
>>
>> However, in this case we have just added the huge page to a hugetlbfs
>> file. The put_page() is there just to drop the reference count on the
>> page (taken when allocated). It will still be non-zero as we have
>> successfully added it to the page cache. So, we are not freeing the
>> page here, just dropping the reference count.
>>
>> This should not cause a problem like that seen in madvise.
>
> Thanks for the quick response.
>
> I am not too familiar with this piece of code, so just for the matter of
> understanding: what prevents the page from being removed from the page cache
> shortly after it is added (even if it is highly unlikely)? The page lock? The
> inode lock?

Someone would need to acquire the inode lock to remove the page. This
is held until we exit the routine. Also note that put_page for this
type of huge page almost always results in the page being put back
on a free list within the hugetlb(fs) subsystem. It is not returned
to the 'normal' memory allocators for general use.

--
Mike Kravetz

2017-08-26 21:09:12

by Eric Biggers

[permalink] [raw]
Subject: Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree

On Fri, Aug 25, 2017 at 04:41:36PM -0700, Mike Kravetz wrote:
> >>>>>
> >>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
> >>>>> put_page() before unlock_page(). This was wrong because put_page() can
> >>>>> free the page, e.g. if a concurrent madvise(..., MADV_DONTNEED) has
> >>>>> removed it from the memory mapping. put_page() then rightfully complained
> >>>>> about freeing a locked page.
> >>>>>
> >>>>> Fix this by moving the unlock_page() before put_page().
> >>>
> >>> Quick grep shows that a similar flow (put_page() followed by an
> >>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> >>> well?
> >>
> >> I assume you are asking about this block of code?
> >
> > Yes.
> >
> >>
> >> /*
> >> * page_put due to reference from alloc_huge_page()
> >> * unlock_page because locked by add_to_page_cache()
> >> */
> >> put_page(page);
> >> unlock_page(page);
> >>
> >> Well, there is a typo (page_put) in the comment. :(
> >>
> >> However, in this case we have just added the huge page to a hugetlbfs
> >> file. The put_page() is there just to drop the reference count on the
> >> page (taken when allocated). It will still be non-zero as we have
> >> successfully added it to the page cache. So, we are not freeing the
> >> page here, just dropping the reference count.
> >>
> >> This should not cause a problem like that seen in madvise.
> >
> > Thanks for the quick response.
> >
> > I am not too familiar with this piece of code, so just for the matter of
> > understanding: what prevents the page from being removed from the page cache
> > shortly after it is added (even if it is highly unlikely)? The page lock? The
> > inode lock?
>
> Someone would need to acquire the inode lock to remove the page. This
> is held until we exit the routine. Also note that put_page for this
> type of huge page almost always results in the page being put back
> on a free list within the hugetlb(fs) subsystem. It is not returned
> to the 'normal' memory allocators for general use.
>

I'm not sure about that. What about sys_fadvise64(..., POSIX_FADV_DONTNEED)?
That removes pages from the page cache without taking the inode lock. It won't
remove locked pages though, so presumably it is only the page lock that prevents
the race with hugetlbfs_fallocate().

But in any case, why not do the "obviously correct" thing --- unlock before put?

Eric

2017-08-27 02:21:20

by Nadav Amit

[permalink] [raw]
Subject: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

hugetlfs_fallocate() currently performs put_page() before unlock_page().
This scenario opens a small time window, from the time the page is added
to the page cache, until it is unlocked, in which the page might be
removed from the page-cache by another core. If the page is removed
during this time windows, it might cause a memory corruption, as the
wrong page will be unlocked.

It is arguable whether this scenario can happen in a real system, and
there are several mitigating factors. The issue was found by code
inspection (actually grep), and not by actually triggering the flow.
Yet, since putting the page before unlocking is incorrect it should be
fixed, if only to prevent future breakage or someone copy-pasting this
code.

Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")

cc: Eric Biggers <[email protected]>
cc: Mike Kravetz <[email protected]>

Signed-off-by: Nadav Amit <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 28d2753be094..9475fee79cee 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
mutex_unlock(&hugetlb_fault_mutex_table[hash]);

/*
- * page_put due to reference from alloc_huge_page()
* unlock_page because locked by add_to_page_cache()
+ * page_put due to reference from alloc_huge_page()
*/
- put_page(page);
unlock_page(page);
+ put_page(page);
}

if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
--
2.11.0

2017-08-27 17:15:42

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

On 08/26/2017 12:11 PM, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
>
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.
>
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>
> cc: Eric Biggers <[email protected]>
> cc: Mike Kravetz <[email protected]>
>
> Signed-off-by: Nadav Amit <[email protected]>

Thank you Nadav.

Reviewed-by: Mike Kravetz <[email protected]>

Since hugetlbfs is an in memory filesystem, the only way one 'should' be
able to remove a page (file content) is through an inode operation such as
truncate, hole punch, or unlink. That was the basis for my response that
the inode lock would be required for page freeing.

Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
I was expecting to see a check for hugetlbfs pages and exit (without
modification) if encountered. A quick review of the code did not find
any such checks.

I'll take a closer look to determine exactly how hugetlbfs files are
handled. IMO, there should be something similar to the DAX check where
the routine quickly exits.
--
Mike Kravetz

> ---
> fs/hugetlbfs/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>
> /*
> - * page_put due to reference from alloc_huge_page()
> * unlock_page because locked by add_to_page_cache()
> + * page_put due to reference from alloc_huge_page()
> */
> - put_page(page);
> unlock_page(page);
> + put_page(page);
> }
>
> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>

2017-08-27 20:09:05

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

Mike Kravetz <[email protected]> wrote:

> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>> This scenario opens a small time window, from the time the page is added
>> to the page cache, until it is unlocked, in which the page might be
>> removed from the page-cache by another core. If the page is removed
>> during this time windows, it might cause a memory corruption, as the
>> wrong page will be unlocked.
>>
>> It is arguable whether this scenario can happen in a real system, and
>> there are several mitigating factors. The issue was found by code
>> inspection (actually grep), and not by actually triggering the flow.
>> Yet, since putting the page before unlocking is incorrect it should be
>> fixed, if only to prevent future breakage or someone copy-pasting this
>> code.
>>
>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>
>> cc: Eric Biggers <[email protected]>
>> cc: Mike Kravetz <[email protected]>
>>
>> Signed-off-by: Nadav Amit <[email protected]>
>
> Thank you Nadav.

No problem.

>
> Reviewed-by: Mike Kravetz <[email protected]>
>
> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
> able to remove a page (file content) is through an inode operation such as
> truncate, hole punch, or unlink. That was the basis for my response that
> the inode lock would be required for page freeing.
>
> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
> I was expecting to see a check for hugetlbfs pages and exit (without
> modification) if encountered. A quick review of the code did not find
> any such checks.
>
> I'll take a closer look to determine exactly how hugetlbfs files are
> handled. IMO, there should be something similar to the DAX check where
> the routine quickly exits.

I did not cc stable when submitting the patch, based on your previous
response. Let me know if you want me to send v2 which does so.

Thanks,
Nadav


2017-08-28 13:46:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

[CC Andrew]

On Sat 26-08-17 12:11:24, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
>
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.

Even if this a theoretical problem it is definitely worth fixing because
it is a anti-pattern of how the reference counted object should be treated.

> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>
> cc: Eric Biggers <[email protected]>
> cc: Mike Kravetz <[email protected]>
>
> Signed-off-by: Nadav Amit <[email protected]>

Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> fs/hugetlbfs/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>
> /*
> - * page_put due to reference from alloc_huge_page()
> * unlock_page because locked by add_to_page_cache()
> + * page_put due to reference from alloc_huge_page()
> */
> - put_page(page);
> unlock_page(page);
> + put_page(page);
> }
>
> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> --
> 2.11.0

--
Michal Hocko
SUSE Labs

2017-08-28 17:46:09

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

Adding Andrew, Michal on CC

On 08/27/2017 01:08 PM, Nadav Amit wrote:
> Mike Kravetz <[email protected]> wrote:
>
>> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>>> This scenario opens a small time window, from the time the page is added
>>> to the page cache, until it is unlocked, in which the page might be
>>> removed from the page-cache by another core. If the page is removed
>>> during this time windows, it might cause a memory corruption, as the
>>> wrong page will be unlocked.
>>>
>>> It is arguable whether this scenario can happen in a real system, and
>>> there are several mitigating factors. The issue was found by code
>>> inspection (actually grep), and not by actually triggering the flow.
>>> Yet, since putting the page before unlocking is incorrect it should be
>>> fixed, if only to prevent future breakage or someone copy-pasting this
>>> code.
>>>
>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>>
>>> cc: Eric Biggers <[email protected]>
>>> cc: Mike Kravetz <[email protected]>
>>>
>>> Signed-off-by: Nadav Amit <[email protected]>
>>
>> Thank you Nadav.
>
> No problem.
>
>>
>> Reviewed-by: Mike Kravetz <[email protected]>
>>
>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
>> able to remove a page (file content) is through an inode operation such as
>> truncate, hole punch, or unlink. That was the basis for my response that
>> the inode lock would be required for page freeing.
>>
>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
>> I was expecting to see a check for hugetlbfs pages and exit (without
>> modification) if encountered. A quick review of the code did not find
>> any such checks.
>>
>> I'll take a closer look to determine exactly how hugetlbfs files are
>> handled. IMO, there should be something similar to the DAX check where
>> the routine quickly exits.
>
> I did not cc stable when submitting the patch, based on your previous
> response. Let me know if you want me to send v2 which does so.

I still do not believe there is a need to change this in stable. Your patch
should be sufficient to ensure we do the right thing going forward.

Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
hugetlbfs does indeed show a more general problem. One can use
sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does
not go through the special hugetlbfs page handling code, but rather the
normal mm paths. As a result hugetlbfs accounting (like reserve counts)
gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!!

I will address this issue in a separate patch.
--
Mike Kravetz

2017-08-28 18:09:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

On Mon 28-08-17 10:45:58, Mike Kravetz wrote:
> Adding Andrew, Michal on CC
>
> On 08/27/2017 01:08 PM, Nadav Amit wrote:
> > Mike Kravetz <[email protected]> wrote:
> >
> >> On 08/26/2017 12:11 PM, Nadav Amit wrote:
> >>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> >>> This scenario opens a small time window, from the time the page is added
> >>> to the page cache, until it is unlocked, in which the page might be
> >>> removed from the page-cache by another core. If the page is removed
> >>> during this time windows, it might cause a memory corruption, as the
> >>> wrong page will be unlocked.
> >>>
> >>> It is arguable whether this scenario can happen in a real system, and
> >>> there are several mitigating factors. The issue was found by code
> >>> inspection (actually grep), and not by actually triggering the flow.
> >>> Yet, since putting the page before unlocking is incorrect it should be
> >>> fixed, if only to prevent future breakage or someone copy-pasting this
> >>> code.
> >>>
> >>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> >>>
> >>> cc: Eric Biggers <[email protected]>
> >>> cc: Mike Kravetz <[email protected]>
> >>>
> >>> Signed-off-by: Nadav Amit <[email protected]>
> >>
> >> Thank you Nadav.
> >
> > No problem.
> >
> >>
> >> Reviewed-by: Mike Kravetz <[email protected]>
> >>
> >> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
> >> able to remove a page (file content) is through an inode operation such as
> >> truncate, hole punch, or unlink. That was the basis for my response that
> >> the inode lock would be required for page freeing.
> >>
> >> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
> >> I was expecting to see a check for hugetlbfs pages and exit (without
> >> modification) if encountered. A quick review of the code did not find
> >> any such checks.
> >>
> >> I'll take a closer look to determine exactly how hugetlbfs files are
> >> handled. IMO, there should be something similar to the DAX check where
> >> the routine quickly exits.
> >
> > I did not cc stable when submitting the patch, based on your previous
> > response. Let me know if you want me to send v2 which does so.
>
> I still do not believe there is a need to change this in stable. Your patch
> should be sufficient to ensure we do the right thing going forward.
>
> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
> hugetlbfs does indeed show a more general problem. One can use
> sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does
> not go through the special hugetlbfs page handling code, but rather the
> normal mm paths. As a result hugetlbfs accounting (like reserve counts)
> gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!!
>
> I will address this issue in a separate patch.

I didn't check very carefully but it seems that
http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch
should help here, right?

--
Michal Hocko
SUSE Labs

2017-08-28 18:52:07

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

On 08/28/2017 11:09 AM, Michal Hocko wrote:
> On Mon 28-08-17 10:45:58, Mike Kravetz wrote:
>> Adding Andrew, Michal on CC
>>
>> On 08/27/2017 01:08 PM, Nadav Amit wrote:
>>> Mike Kravetz <[email protected]> wrote:
>>>
>>>> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>>>>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>>>>> This scenario opens a small time window, from the time the page is added
>>>>> to the page cache, until it is unlocked, in which the page might be
>>>>> removed from the page-cache by another core. If the page is removed
>>>>> during this time windows, it might cause a memory corruption, as the
>>>>> wrong page will be unlocked.
>>>>>
>>>>> It is arguable whether this scenario can happen in a real system, and
>>>>> there are several mitigating factors. The issue was found by code
>>>>> inspection (actually grep), and not by actually triggering the flow.
>>>>> Yet, since putting the page before unlocking is incorrect it should be
>>>>> fixed, if only to prevent future breakage or someone copy-pasting this
>>>>> code.
>>>>>
>>>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>>>>
>>>>> cc: Eric Biggers <[email protected]>
>>>>> cc: Mike Kravetz <[email protected]>
>>>>>
>>>>> Signed-off-by: Nadav Amit <[email protected]>
>>>>
>>>> Thank you Nadav.
>>>
>>> No problem.
>>>
>>>>
>>>> Reviewed-by: Mike Kravetz <[email protected]>
>>>>
>>>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
>>>> able to remove a page (file content) is through an inode operation such as
>>>> truncate, hole punch, or unlink. That was the basis for my response that
>>>> the inode lock would be required for page freeing.
>>>>
>>>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
>>>> I was expecting to see a check for hugetlbfs pages and exit (without
>>>> modification) if encountered. A quick review of the code did not find
>>>> any such checks.
>>>>
>>>> I'll take a closer look to determine exactly how hugetlbfs files are
>>>> handled. IMO, there should be something similar to the DAX check where
>>>> the routine quickly exits.
>>>
>>> I did not cc stable when submitting the patch, based on your previous
>>> response. Let me know if you want me to send v2 which does so.
>>
>> I still do not believe there is a need to change this in stable. Your patch
>> should be sufficient to ensure we do the right thing going forward.
>>
>> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
>> hugetlbfs does indeed show a more general problem. One can use
>> sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does
>> not go through the special hugetlbfs page handling code, but rather the
>> normal mm paths. As a result hugetlbfs accounting (like reserve counts)
>> gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!!
>>
>> I will address this issue in a separate patch.
>
> I didn't check very carefully but it seems that
> http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch
> should help here, right?

Thanks Michal.

Yes, that patch addresses the above issue with hugetlbfs. I was also
wondering if there were similar issues with other in memory filesystems.
Looks like there are.

--
Mike Kravetz

2017-11-29 03:23:44

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

[CC Andrew, Michal]

On 11/28/2017 06:37 PM, Eric Biggers wrote:
> On Sat, Aug 26, 2017 at 12:11:24PM -0700, Nadav Amit wrote:
>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>> This scenario opens a small time window, from the time the page is added
>> to the page cache, until it is unlocked, in which the page might be
>> removed from the page-cache by another core. If the page is removed
>> during this time windows, it might cause a memory corruption, as the
>> wrong page will be unlocked.
>>
>> It is arguable whether this scenario can happen in a real system, and
>> there are several mitigating factors. The issue was found by code
>> inspection (actually grep), and not by actually triggering the flow.
>> Yet, since putting the page before unlocking is incorrect it should be
>> fixed, if only to prevent future breakage or someone copy-pasting this
>> code.
>>
>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>
>> cc: Eric Biggers <[email protected]>
>> cc: Mike Kravetz <[email protected]>
>>
>> Signed-off-by: Nadav Amit <[email protected]>
>> ---
>> fs/hugetlbfs/inode.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 28d2753be094..9475fee79cee 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>>
>> /*
>> - * page_put due to reference from alloc_huge_page()
>> * unlock_page because locked by add_to_page_cache()
>> + * page_put due to reference from alloc_huge_page()
>> */
>> - put_page(page);
>> unlock_page(page);
>> + put_page(page);
>> }
>>
>> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>> --
>
> This patch wasn't ever applied. Nadia, do you take patches for hugetlbfs, or
> does this need to go through Andrew Morton?
>
> Eric

Nadia has not been active for some time on hugetlbfs, so best to go through
Andrew. Added Andrew and Michal on CC.

This patch has a:
Reviewed-by: Mike Kravetz <[email protected]>
Acked-by: Michal Hocko <[email protected]>

I am still of the opinion that this does not need to be sent to stable.
Although the ordering is current code is incorrect, there is no way for
this to be a problem with current locking. In addition, I verified
that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
for hugetlbfs and other filesystems is addressed in commit 3a77d214807c.
--
Mike Kravetz

From 1585366376880266776@xxx Wed Nov 29 02:39:55 +0000 2017
X-GM-THRID: 1576849154661183388
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-29 02:39:55

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()

On Sat, Aug 26, 2017 at 12:11:24PM -0700, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
>
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.
>
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>
> cc: Eric Biggers <[email protected]>
> cc: Mike Kravetz <[email protected]>
>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>
> /*
> - * page_put due to reference from alloc_huge_page()
> * unlock_page because locked by add_to_page_cache()
> + * page_put due to reference from alloc_huge_page()
> */
> - put_page(page);
> unlock_page(page);
> + put_page(page);
> }
>
> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> --

This patch wasn't ever applied. Nadia, do you take patches for hugetlbfs, or
does this need to go through Andrew Morton?

Eric

From 1577002066639586664@xxx Mon Aug 28 18:52:47 +0000 2017
X-GM-THRID: 1576849154661183388
X-Gmail-Labels: Inbox,Category Forums