2019-11-23 01:07:15

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially

Currently when truncating shmem file, if the range is partial of THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed unless the range cover the whole THP.
Even though all the subpages are truncated (randomly or sequentially),
the THP may still be kept in page cache. This might be fine for some
usecases which prefer preserving THP.

But, when doing balloon inflation in QEMU, QEMU actually does hole punch
or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
So, when using shmem THP as memory backend QEMU inflation actually doesn't
work as expected since it doesn't free memory. But, the inflation
usecase really needs get the memory freed. Anonymous THP will not get
freed right away too but it will be freed eventually when all subpages are
unmapped, but shmem THP would still stay in page cache.

To protect the usecases which may prefer preserving THP, introduce a
new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is
preferred behavior if truncating partial THP. This mode just makes
sense to tmpfs for the time being.

Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +-
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 3 +-
include/linux/shmem_fs.h | 3 +-
include/uapi/linux/falloc.h | 7 +++
mm/shmem.c | 99 +++++++++++++++++++++++++++----
5 files changed, 99 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
index f591870..d44780e 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -408,7 +408,7 @@ void drm_gem_shmem_purge_locked(struct drm_gem_object *obj)
* To do this we must instruct the shmfs to drop all of its
* backing pages, *now*.
*/
- shmem_truncate_range(file_inode(obj->filp), 0, (loff_t)-1);
+ shmem_truncate_range(file_inode(obj->filp), 0, (loff_t)-1, false);

invalidate_mapping_pages(file_inode(obj->filp)->i_mapping,
0, (loff_t)-1);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 4c4954e..cdee286 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -222,7 +222,8 @@ static int shmem_get_pages(struct drm_i915_gem_object *obj)
* To do this we must instruct the shmfs to drop all of its
* backing pages, *now*.
*/
- shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1);
+ shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1,
+ false);
obj->mm.madv = __I915_MADV_PURGED;
obj->mm.pages = ERR_PTR(-EFAULT);
}
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index de8e4b7..42c6420 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -73,7 +73,8 @@ static inline bool shmem_mapping(struct address_space *mapping)
extern void shmem_unlock_mapping(struct address_space *mapping);
extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
-extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
+extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end,
+ bool split);
extern int shmem_unuse(unsigned int type, bool frontswap,
unsigned long *fs_pages_to_unuse);

diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 51398fa..26fd272 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -77,4 +77,11 @@
*/
#define FALLOC_FL_UNSHARE_RANGE 0x40

+/*
+ * FALLOC_FL_SPLIT_HPAGE is used with FALLOC_FL_PUNCH_HOLE together to
+ * split huge page if the hole punch range is start or end in the middle
+ * of THP. So far it only makes sense with tmpfs.
+ */
+#define FALLOC_FL_SPLIT_HPAGE 0x80
+
#endif /* _UAPI_FALLOC_H_ */
diff --git a/mm/shmem.c b/mm/shmem.c
index 220be9f..66e2a82 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -793,7 +793,7 @@ void shmem_unlock_mapping(struct address_space *mapping)
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
*/
static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
- bool unfalloc)
+ bool unfalloc, bool split)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -806,12 +806,14 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
long nr_swaps_freed = 0;
pgoff_t index;
int i;
+ struct page *page = NULL;

if (lend == -1)
end = -1; /* unsigned, so actually very big */

pagevec_init(&pvec);
index = start;
+retry:
while (index < end) {
pvec.nr = find_get_entries(mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
@@ -819,7 +821,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
if (!pvec.nr)
break;
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ page = pvec.pages[i];

index = indices[i];
if (index >= end)
@@ -839,9 +841,16 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
continue;

if (PageTransTail(page)) {
- /* Middle of THP: zero out the page */
+ /*
+ * Middle of THP: zero out the page. We
+ * still need clear the page even though
+ * the THP is going to be split since the
+ * split may fail.
+ */
clear_highpage(page);
unlock_page(page);
+ if (!unfalloc && split)
+ goto split;
continue;
} else if (PageTransHuge(page)) {
if (index == round_down(end, HPAGE_PMD_NR)) {
@@ -851,6 +860,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
*/
clear_highpage(page);
unlock_page(page);
+ if (!unfalloc && split)
+ goto split;
continue;
}
index += HPAGE_PMD_NR - 1;
@@ -866,9 +877,34 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
unlock_page(page);
}
+split:
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
cond_resched();
+
+ if (split && PageTransCompound(page)) {
+ /* The THP may get freed under us */
+ if (!get_page_unless_zero(compound_head(page)))
+ goto out;
+
+ if (!trylock_page(page))
+ goto out_put;
+
+ /*
+ * The extra pins from page cache lookup have been
+ * released by pagevec_release().
+ */
+ if (!split_huge_page(page)) {
+ unlock_page(page);
+ put_page(page);
+ /* Re-look up page cache from current index */
+ goto retry;
+ }
+ unlock_page(page);
+out_put:
+ put_page(page);
+ }
+out:
index++;
}

@@ -901,6 +937,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
return;

index = start;
+again:
while (index < end) {
cond_resched();

@@ -937,7 +974,12 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
lock_page(page);

if (PageTransTail(page)) {
- /* Middle of THP: zero out the page */
+ /*
+ * Middle of THP: zero out the page. We
+ * still need clear the page even though the
+ * THP is going to be split since the split
+ * may fail.
+ */
clear_highpage(page);
unlock_page(page);
/*
@@ -947,6 +989,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
*/
if (index != round_down(end, HPAGE_PMD_NR))
start++;
+ if (!unfalloc && split)
+ goto rescan_split;
continue;
} else if (PageTransHuge(page)) {
if (index == round_down(end, HPAGE_PMD_NR)) {
@@ -956,6 +1000,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
*/
clear_highpage(page);
unlock_page(page);
+ if (!unfalloc && split)
+ goto rescan_split;
continue;
}
index += HPAGE_PMD_NR - 1;
@@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
unlock_page(page);
}
+rescan_split:
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
+
+ if (split && PageTransCompound(page)) {
+ /* The THP may get freed under us */
+ if (!get_page_unless_zero(compound_head(page)))
+ goto rescan_out;
+
+ lock_page(page);
+
+ /*
+ * The extra pins from page cache lookup have been
+ * released by pagevec_release().
+ */
+ if (!split_huge_page(page)) {
+ unlock_page(page);
+ put_page(page);
+ /* Re-look up page cache from current index */
+ goto again;
+ }
+ unlock_page(page);
+ put_page(page);
+ }
+rescan_out:
index++;
}

@@ -987,9 +1056,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
spin_unlock_irq(&info->lock);
}

-void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
+void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend,
+ bool split)
{
- shmem_undo_range(inode, lstart, lend, false);
+ shmem_undo_range(inode, lstart, lend, false, split);
inode->i_ctime = inode->i_mtime = current_time(inode);
}
EXPORT_SYMBOL_GPL(shmem_truncate_range);
@@ -1049,7 +1119,8 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
holebegin, 0, 1);
if (info->alloced)
shmem_truncate_range(inode,
- newsize, (loff_t)-1);
+ newsize, (loff_t)-1,
+ false);
/* unmap again to remove racily COWed private pages */
if (oldsize > holebegin)
unmap_mapping_range(inode->i_mapping,
@@ -1089,7 +1160,7 @@ static void shmem_evict_inode(struct inode *inode)
if (inode->i_mapping->a_ops == &shmem_aops) {
shmem_unacct_size(info->flags, inode->i_size);
inode->i_size = 0;
- shmem_truncate_range(inode, 0, (loff_t)-1);
+ shmem_truncate_range(inode, 0, (loff_t)-1, false);
if (!list_empty(&info->shrinklist)) {
spin_lock(&sbinfo->shrinklist_lock);
if (!list_empty(&info->shrinklist)) {
@@ -2724,12 +2795,14 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
pgoff_t start, index, end;
int error;

- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_SPLIT_HPAGE))
return -EOPNOTSUPP;

inode_lock(inode);

if (mode & FALLOC_FL_PUNCH_HOLE) {
+ bool split = mode & FALLOC_FL_SPLIT_HPAGE;
struct address_space *mapping = file->f_mapping;
loff_t unmap_start = round_up(offset, PAGE_SIZE);
loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
@@ -2751,7 +2824,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
- shmem_truncate_range(inode, offset, offset + len - 1);
+ shmem_truncate_range(inode, offset, offset + len - 1, split);
/* No need to unmap again: hole-punching leaves COWed pages */

spin_lock(&inode->i_lock);
@@ -2808,7 +2881,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if (index > start) {
shmem_undo_range(inode,
(loff_t)start << PAGE_SHIFT,
- ((loff_t)index << PAGE_SHIFT) - 1, true);
+ ((loff_t)index << PAGE_SHIFT) - 1, true,
+ false);
}
goto undone;
}
@@ -4068,7 +4142,8 @@ unsigned long shmem_get_unmapped_area(struct file *file,
}
#endif

-void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
+void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend,
+ bool split)
{
truncate_inode_pages_range(inode->i_mapping, lstart, lend);
}
--
1.8.3.1


2019-11-25 09:41:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially

On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
> Currently when truncating shmem file, if the range is partial of THP
> (start or end is in the middle of THP), the pages actually will just get
> cleared rather than being freed unless the range cover the whole THP.
> Even though all the subpages are truncated (randomly or sequentially),
> the THP may still be kept in page cache. This might be fine for some
> usecases which prefer preserving THP.
>
> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> So, when using shmem THP as memory backend QEMU inflation actually doesn't
> work as expected since it doesn't free memory. But, the inflation
> usecase really needs get the memory freed. Anonymous THP will not get
> freed right away too but it will be freed eventually when all subpages are
> unmapped, but shmem THP would still stay in page cache.
>
> To protect the usecases which may prefer preserving THP, introduce a
> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is
> preferred behavior if truncating partial THP. This mode just makes
> sense to tmpfs for the time being.

We need to clarify interaction with khugepaged. This implementation
doesn't do anything to prevent khugepaged from collapsing the range back
to THP just after the split.

> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> }
> unlock_page(page);
> }
> +rescan_split:
> pagevec_remove_exceptionals(&pvec);
> pagevec_release(&pvec);
> +
> + if (split && PageTransCompound(page)) {
> + /* The THP may get freed under us */
> + if (!get_page_unless_zero(compound_head(page)))
> + goto rescan_out;
> +
> + lock_page(page);
> +
> + /*
> + * The extra pins from page cache lookup have been
> + * released by pagevec_release().
> + */
> + if (!split_huge_page(page)) {
> + unlock_page(page);
> + put_page(page);
> + /* Re-look up page cache from current index */
> + goto again;
> + }
> + unlock_page(page);
> + put_page(page);
> + }
> +rescan_out:
> index++;
> }

Doing get_page_unless_zero() just after you've dropped the pin for the
page looks very suboptimal.

--
Kirill A. Shutemov

2019-11-25 18:56:03

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially



On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>> Currently when truncating shmem file, if the range is partial of THP
>> (start or end is in the middle of THP), the pages actually will just get
>> cleared rather than being freed unless the range cover the whole THP.
>> Even though all the subpages are truncated (randomly or sequentially),
>> the THP may still be kept in page cache. This might be fine for some
>> usecases which prefer preserving THP.
>>
>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>> work as expected since it doesn't free memory. But, the inflation
>> usecase really needs get the memory freed. Anonymous THP will not get
>> freed right away too but it will be freed eventually when all subpages are
>> unmapped, but shmem THP would still stay in page cache.
>>
>> To protect the usecases which may prefer preserving THP, introduce a
>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is
>> preferred behavior if truncating partial THP. This mode just makes
>> sense to tmpfs for the time being.
> We need to clarify interaction with khugepaged. This implementation
> doesn't do anything to prevent khugepaged from collapsing the range back
> to THP just after the split.

Yes, it doesn't. Will clarify this in the commit log.

>
>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>> }
>> unlock_page(page);
>> }
>> +rescan_split:
>> pagevec_remove_exceptionals(&pvec);
>> pagevec_release(&pvec);
>> +
>> + if (split && PageTransCompound(page)) {
>> + /* The THP may get freed under us */
>> + if (!get_page_unless_zero(compound_head(page)))
>> + goto rescan_out;
>> +
>> + lock_page(page);
>> +
>> + /*
>> + * The extra pins from page cache lookup have been
>> + * released by pagevec_release().
>> + */
>> + if (!split_huge_page(page)) {
>> + unlock_page(page);
>> + put_page(page);
>> + /* Re-look up page cache from current index */
>> + goto again;
>> + }
>> + unlock_page(page);
>> + put_page(page);
>> + }
>> +rescan_out:
>> index++;
>> }
> Doing get_page_unless_zero() just after you've dropped the pin for the
> page looks very suboptimal.

If I don't drop the pins the THP can't be split. And, there might be
more than one pins from find_get_entries() if I read the code correctly.
For example, truncate 8K length in the middle of THP, the THP's refcount
would get bumpped twice since  two sub pages would be returned.

>

2019-11-25 18:56:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially

On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>
>
> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
> > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
> > > Currently when truncating shmem file, if the range is partial of THP
> > > (start or end is in the middle of THP), the pages actually will just get
> > > cleared rather than being freed unless the range cover the whole THP.
> > > Even though all the subpages are truncated (randomly or sequentially),
> > > the THP may still be kept in page cache. This might be fine for some
> > > usecases which prefer preserving THP.
> > >
> > > But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> > > So, when using shmem THP as memory backend QEMU inflation actually doesn't
> > > work as expected since it doesn't free memory. But, the inflation
> > > usecase really needs get the memory freed. Anonymous THP will not get
> > > freed right away too but it will be freed eventually when all subpages are
> > > unmapped, but shmem THP would still stay in page cache.
> > >
> > > To protect the usecases which may prefer preserving THP, introduce a
> > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is
> > > preferred behavior if truncating partial THP. This mode just makes
> > > sense to tmpfs for the time being.
> > We need to clarify interaction with khugepaged. This implementation
> > doesn't do anything to prevent khugepaged from collapsing the range back
> > to THP just after the split.
>
> Yes, it doesn't. Will clarify this in the commit log.

Okay, but I'm not sure that documention alone will be enough. We need
proper design.

> > > @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> > > }
> > > unlock_page(page);
> > > }
> > > +rescan_split:
> > > pagevec_remove_exceptionals(&pvec);
> > > pagevec_release(&pvec);
> > > +
> > > + if (split && PageTransCompound(page)) {
> > > + /* The THP may get freed under us */
> > > + if (!get_page_unless_zero(compound_head(page)))
> > > + goto rescan_out;
> > > +
> > > + lock_page(page);
> > > +
> > > + /*
> > > + * The extra pins from page cache lookup have been
> > > + * released by pagevec_release().
> > > + */
> > > + if (!split_huge_page(page)) {
> > > + unlock_page(page);
> > > + put_page(page);
> > > + /* Re-look up page cache from current index */
> > > + goto again;
> > > + }
> > > + unlock_page(page);
> > > + put_page(page);
> > > + }
> > > +rescan_out:
> > > index++;
> > > }
> > Doing get_page_unless_zero() just after you've dropped the pin for the
> > page looks very suboptimal.
>
> If I don't drop the pins the THP can't be split. And, there might be more
> than one pins from find_get_entries() if I read the code correctly. For
> example, truncate 8K length in the middle of THP, the THP's refcount would
> get bumpped twice since? two sub pages would be returned.

Pin the page before pagevec_release() and avoid get_page_unless_zero().

Current code is buggy. You need to check that the page is still belong to
the file after speculative lookup.

--
Kirill A. Shutemov

2019-11-25 19:38:54

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially



On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>>
>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>>>> Currently when truncating shmem file, if the range is partial of THP
>>>> (start or end is in the middle of THP), the pages actually will just get
>>>> cleared rather than being freed unless the range cover the whole THP.
>>>> Even though all the subpages are truncated (randomly or sequentially),
>>>> the THP may still be kept in page cache. This might be fine for some
>>>> usecases which prefer preserving THP.
>>>>
>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>>>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>>>> work as expected since it doesn't free memory. But, the inflation
>>>> usecase really needs get the memory freed. Anonymous THP will not get
>>>> freed right away too but it will be freed eventually when all subpages are
>>>> unmapped, but shmem THP would still stay in page cache.
>>>>
>>>> To protect the usecases which may prefer preserving THP, introduce a
>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is
>>>> preferred behavior if truncating partial THP. This mode just makes
>>>> sense to tmpfs for the time being.
>>> We need to clarify interaction with khugepaged. This implementation
>>> doesn't do anything to prevent khugepaged from collapsing the range back
>>> to THP just after the split.
>> Yes, it doesn't. Will clarify this in the commit log.
> Okay, but I'm not sure that documention alone will be enough. We need
> proper design.

Maybe we could try to hold inode lock with read during collapse_file().
The shmem fallocate does acquire inode lock with write, this should be
able to synchronize hole punch and khugepaged. And, shmem just needs
hold inode lock for llseek and fallocate, I'm supposed they are should
be called not that frequently to have impact on khugepaged. The llseek
might be often, but it should be quite fast. However, they might get
blocked by khugepaged.

It sounds safe to hold a rwsem during collapsing THP.

Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch and
clear it after truncate, then check the flag before doing collapse in
khugepaged. khugepaged should not need hold the inode lock during
collapse since it could be released after the flag is checked.

>
>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>>> }
>>>> unlock_page(page);
>>>> }
>>>> +rescan_split:
>>>> pagevec_remove_exceptionals(&pvec);
>>>> pagevec_release(&pvec);
>>>> +
>>>> + if (split && PageTransCompound(page)) {
>>>> + /* The THP may get freed under us */
>>>> + if (!get_page_unless_zero(compound_head(page)))
>>>> + goto rescan_out;
>>>> +
>>>> + lock_page(page);
>>>> +
>>>> + /*
>>>> + * The extra pins from page cache lookup have been
>>>> + * released by pagevec_release().
>>>> + */
>>>> + if (!split_huge_page(page)) {
>>>> + unlock_page(page);
>>>> + put_page(page);
>>>> + /* Re-look up page cache from current index */
>>>> + goto again;
>>>> + }
>>>> + unlock_page(page);
>>>> + put_page(page);
>>>> + }
>>>> +rescan_out:
>>>> index++;
>>>> }
>>> Doing get_page_unless_zero() just after you've dropped the pin for the
>>> page looks very suboptimal.
>> If I don't drop the pins the THP can't be split. And, there might be more
>> than one pins from find_get_entries() if I read the code correctly. For
>> example, truncate 8K length in the middle of THP, the THP's refcount would
>> get bumpped twice since  two sub pages would be returned.
> Pin the page before pagevec_release() and avoid get_page_unless_zero().
>
> Current code is buggy. You need to check that the page is still belong to
> the file after speculative lookup.

Yes, I missed this point. Thanks for the suggestion.

>

2019-11-26 23:37:04

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially



On 11/25/19 11:33 AM, Yang Shi wrote:
>
>
> On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
>> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>>>
>>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
>>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>>>>> Currently when truncating shmem file, if the range is partial of THP
>>>>> (start or end is in the middle of THP), the pages actually will
>>>>> just get
>>>>> cleared rather than being freed unless the range cover the whole THP.
>>>>> Even though all the subpages are truncated (randomly or
>>>>> sequentially),
>>>>> the THP may still be kept in page cache.  This might be fine for some
>>>>> usecases which prefer preserving THP.
>>>>>
>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole
>>>>> punch
>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
>>>>> used.
>>>>> So, when using shmem THP as memory backend QEMU inflation actually
>>>>> doesn't
>>>>> work as expected since it doesn't free memory.  But, the inflation
>>>>> usecase really needs get the memory freed.  Anonymous THP will not
>>>>> get
>>>>> freed right away too but it will be freed eventually when all
>>>>> subpages are
>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>
>>>>> To protect the usecases which may prefer preserving THP, introduce a
>>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting
>>>>> THP is
>>>>> preferred behavior if truncating partial THP.  This mode just makes
>>>>> sense to tmpfs for the time being.
>>>> We need to clarify interaction with khugepaged. This implementation
>>>> doesn't do anything to prevent khugepaged from collapsing the range
>>>> back
>>>> to THP just after the split.
>>> Yes, it doesn't. Will clarify this in the commit log.
>> Okay, but I'm not sure that documention alone will be enough. We need
>> proper design.
>
> Maybe we could try to hold inode lock with read during
> collapse_file(). The shmem fallocate does acquire inode lock with
> write, this should be able to synchronize hole punch and khugepaged.
> And, shmem just needs hold inode lock for llseek and fallocate, I'm
> supposed they are should be called not that frequently to have impact
> on khugepaged. The llseek might be often, but it should be quite fast.
> However, they might get blocked by khugepaged.
>
> It sounds safe to hold a rwsem during collapsing THP.
>
> Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch
> and clear it after truncate, then check the flag before doing collapse
> in khugepaged. khugepaged should not need hold the inode lock during
> collapse since it could be released after the flag is checked.

By relooking the code, it looks the latter one (check VM_NOHUGEPAGE)
doesn't make sense, it can't prevent khugepaged from collapsing THP in
parallel.

>
>>
>>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode
>>>>> *inode, loff_t lstart, loff_t lend,
>>>>>                }
>>>>>                unlock_page(page);
>>>>>            }
>>>>> +rescan_split:
>>>>>            pagevec_remove_exceptionals(&pvec);
>>>>>            pagevec_release(&pvec);
>>>>> +
>>>>> +        if (split && PageTransCompound(page)) {
>>>>> +            /* The THP may get freed under us */
>>>>> +            if (!get_page_unless_zero(compound_head(page)))
>>>>> +                goto rescan_out;
>>>>> +
>>>>> +            lock_page(page);
>>>>> +
>>>>> +            /*
>>>>> +             * The extra pins from page cache lookup have been
>>>>> +             * released by pagevec_release().
>>>>> +             */
>>>>> +            if (!split_huge_page(page)) {
>>>>> +                unlock_page(page);
>>>>> +                put_page(page);
>>>>> +                /* Re-look up page cache from current index */
>>>>> +                goto again;
>>>>> +            }
>>>>> +            unlock_page(page);
>>>>> +            put_page(page);
>>>>> +        }
>>>>> +rescan_out:
>>>>>            index++;
>>>>>        }
>>>> Doing get_page_unless_zero() just after you've dropped the pin for the
>>>> page looks very suboptimal.
>>> If I don't drop the pins the THP can't be split. And, there might be
>>> more
>>> than one pins from find_get_entries() if I read the code correctly. For
>>> example, truncate 8K length in the middle of THP, the THP's refcount
>>> would
>>> get bumpped twice since  two sub pages would be returned.
>> Pin the page before pagevec_release() and avoid get_page_unless_zero().
>>
>> Current code is buggy. You need to check that the page is still
>> belong to
>> the file after speculative lookup.
>
> Yes, I missed this point. Thanks for the suggestion.
>
>>
>

2019-11-28 03:07:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially

On Tue, 26 Nov 2019, Yang Shi wrote:
> On 11/25/19 11:33 AM, Yang Shi wrote:
> > On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
> > > On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
> > > > On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
> > > > > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
> > > > > > Currently when truncating shmem file, if the range is partial of
> > > > > > THP
> > > > > > (start or end is in the middle of THP), the pages actually will
> > > > > > just get
> > > > > > cleared rather than being freed unless the range cover the whole
> > > > > > THP.
> > > > > > Even though all the subpages are truncated (randomly or
> > > > > > sequentially),
> > > > > > the THP may still be kept in page cache.  This might be fine for
> > > > > > some
> > > > > > usecases which prefer preserving THP.
> > > > > >
> > > > > > But, when doing balloon inflation in QEMU, QEMU actually does hole
> > > > > > punch
> > > > > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
> > > > > > used.
> > > > > > So, when using shmem THP as memory backend QEMU inflation actually
> > > > > > doesn't
> > > > > > work as expected since it doesn't free memory.  But, the inflation
> > > > > > usecase really needs get the memory freed.  Anonymous THP will not
> > > > > > get
> > > > > > freed right away too but it will be freed eventually when all
> > > > > > subpages are
> > > > > > unmapped, but shmem THP would still stay in page cache.
> > > > > >
> > > > > > To protect the usecases which may prefer preserving THP, introduce
> > > > > > a
> > > > > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP
> > > > > > is
> > > > > > preferred behavior if truncating partial THP.  This mode just makes
> > > > > > sense to tmpfs for the time being.

Sorry, I haven't managed to set aside enough time for this until now.

First off, let me say that I firmly believe this punch-split behavior
should be the standard behavior (like in my huge tmpfs implementation),
and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it.
But I don't know if I'll be able to persuade Kirill of that.

If the caller wants to write zeroes into the file, she can do so with the
write syscall: the caller has asked to punch a hole or truncate the file,
and in our case, like your QEMU case, hopes that memory and memcg charge
will be freed by doing so. I'll be surprised if changing the behavior
to yours and mine turns out to introduce a regression, but if it does,
I guess we'll then have to put it behind a sysctl or whatever.

IIUC the reason that it's currently implemented by clearing the hole
is because split_huge_page() (unlike in older refcounting days) cannot
be guaranteed to succeed. Which is unfortunate, and none of us is very
keen to build a filesystem on unreliable behavior; but the failure cases
appear in practice to be rare enough, that it's on balance better to give
the punch-hole-truncate caller what she asked for whenever possible.

> > > > > We need to clarify interaction with khugepaged. This implementation
> > > > > doesn't do anything to prevent khugepaged from collapsing the range
> > > > > back
> > > > > to THP just after the split.
> > > > Yes, it doesn't. Will clarify this in the commit log.
> > > Okay, but I'm not sure that documention alone will be enough. We need
> > > proper design.
> >
> > Maybe we could try to hold inode lock with read during collapse_file(). The
> > shmem fallocate does acquire inode lock with write, this should be able to
> > synchronize hole punch and khugepaged. And, shmem just needs hold inode
> > lock for llseek and fallocate, I'm supposed they are should be called not
> > that frequently to have impact on khugepaged. The llseek might be often,
> > but it should be quite fast. However, they might get blocked by khugepaged.
> >
> > It sounds safe to hold a rwsem during collapsing THP.

No, I don't think we want to take any more locks while collapsing THP,
but that wasn't really the point. We're not concerned about a *race*
between splitting and khugepaged reassembling (I'm assuming that any
such race would already exist, and be well-guarded against by all the
refcount checks, punchhole not adding anything new here; but perhaps
I'm assuming too blithely, and it is worth checking over).

The point, as I see it anyway, is the contradiction in effort: the
caller asks for hole to be punched, we do that, then a few seconds
or minutes later, khugepaged comes along and fills in the hole (if
huge page availability and memcg limit allow it).

I agree that's not very satisfactory, but I think it's a side issue:
we don't have a good mechanism to tell khugepaged to keep off a range.
As it is, fallocate and ftruncate ought to do the job expected of them,
and khugepaged ought to do the job expected of it. And in many cases,
the punched file will not even be mapped (visible to khugepaged), or
max_ptes_none set to exclude working on such holes.

Is khugepaged's action an issue for your QEMU case?

> >
> > Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch and
> > clear it after truncate, then check the flag before doing collapse in
> > khugepaged. khugepaged should not need hold the inode lock during collapse
> > since it could be released after the flag is checked.
>
> By relooking the code, it looks the latter one (check VM_NOHUGEPAGE) doesn't
> make sense, it can't prevent khugepaged from collapsing THP in parallel.
>
> >
> > >
> > > > > > @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode
> > > > > > *inode, loff_t lstart, loff_t lend,
> > > > > >                }
> > > > > >                unlock_page(page);
> > > > > >            }
> > > > > > +rescan_split:
> > > > > >            pagevec_remove_exceptionals(&pvec);
> > > > > >            pagevec_release(&pvec);
> > > > > > +
> > > > > > +        if (split && PageTransCompound(page)) {
> > > > > > +            /* The THP may get freed under us */
> > > > > > +            if (!get_page_unless_zero(compound_head(page)))
> > > > > > +                goto rescan_out;
> > > > > > +
> > > > > > +            lock_page(page);
> > > > > > +
> > > > > > +            /*
> > > > > > +             * The extra pins from page cache lookup have been
> > > > > > +             * released by pagevec_release().
> > > > > > +             */
> > > > > > +            if (!split_huge_page(page)) {
> > > > > > +                unlock_page(page);
> > > > > > +                put_page(page);
> > > > > > +                /* Re-look up page cache from current index */
> > > > > > +                goto again;
> > > > > > +            }
> > > > > > +            unlock_page(page);
> > > > > > +            put_page(page);
> > > > > > +        }
> > > > > > +rescan_out:
> > > > > >            index++;
> > > > > >        }
> > > > > Doing get_page_unless_zero() just after you've dropped the pin for
> > > > > the
> > > > > page looks very suboptimal.
> > > > If I don't drop the pins the THP can't be split. And, there might be
> > > > more
> > > > than one pins from find_get_entries() if I read the code correctly. For
> > > > example, truncate 8K length in the middle of THP, the THP's refcount
> > > > would
> > > > get bumpped twice since  two sub pages would be returned.
> > > Pin the page before pagevec_release() and avoid get_page_unless_zero().
> > >
> > > Current code is buggy. You need to check that the page is still belong to
> > > the file after speculative lookup.

Yes indeed (I think you can even keep the page locked, but I may be wrong).

> >
> > Yes, I missed this point. Thanks for the suggestion.

The main problem I see is your "goto retry" and "goto again":
split_huge_page() may fail because a get_page() somewhere is holding
a transient reference to the page, or it may fail because there's a
GUP that holds a reference for days: you do not want to be stuck here
going round and around the loop waiting for that GUP to be released!

It's nice that we already have a trylock_page() loop followed by a
lock_page() loop. When split_huge_page() fails, the trylock_page()
loop can simply move on to the next page (skip over the compound page,
or retry it subpage by subpage? I've forgotten the pros and cons),
and leave final resolution to the later lock_page() loop: which has to
accept when split_huge_page() failed, and fall back to clearing instead.

I would prefer a smaller patch than your RFC: making split the
default behavior cuts out a lot of it, but I think there's still
more that can be cut. Here's the patch we've been using internally,
which deletes quite a lot of the old code; but you'll quickly notice
has a "Revisit later" hack in find_get_entries(), which I've not got
around to revisiting yet. Please blend what you can from my patch
into yours, or vice versa.

Hugh

---

mm/filemap.c | 3 +
mm/shmem.c | 86 +++++++++++++++++--------------------------------
2 files changed, 34 insertions(+), 55 deletions(-)

--- v5.4/mm/filemap.c 2019-11-24 16:32:01.000000000 -0800
+++ linux/mm/filemap.c 2019-11-27 16:21:16.316801433 -0800
@@ -1752,6 +1752,9 @@ unsigned find_get_entries(struct address
goto put_page;
page = find_subpage(page, xas.xa_index);

+ /* Revisit later: make shmem_undo_range() easier for now */
+ if (PageTransCompound(page))
+ nr_entries = ret + 1;
export:
indices[ret] = xas.xa_index;
entries[ret] = page;
--- v5.4/mm/shmem.c 2019-11-24 16:32:01.000000000 -0800
+++ linux/mm/shmem.c 2019-11-27 16:21:16.320801450 -0800
@@ -788,6 +788,20 @@ void shmem_unlock_mapping(struct address
}
}

+static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
+{
+ if (!PageTransCompound(page))
+ return true;
+
+ /* Just proceed to delete a huge page wholly within the range punched */
+ if (PageHead(page) &&
+ page->index >= start && page->index + HPAGE_PMD_NR <= end)
+ return true;
+
+ /* Try to split huge page, so we can truly punch the hole or truncate */
+ return split_huge_page(page) >= 0;
+}
+
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -838,31 +852,11 @@ static void shmem_undo_range(struct inod
if (!trylock_page(page))
continue;

- if (PageTransTail(page)) {
- /* Middle of THP: zero out the page */
- clear_highpage(page);
- unlock_page(page);
- continue;
- } else if (PageTransHuge(page)) {
- if (index == round_down(end, HPAGE_PMD_NR)) {
- /*
- * Range ends in the middle of THP:
- * zero out the page
- */
- clear_highpage(page);
- unlock_page(page);
- continue;
- }
- index += HPAGE_PMD_NR - 1;
- i += HPAGE_PMD_NR - 1;
- }
-
- if (!unfalloc || !PageUptodate(page)) {
- VM_BUG_ON_PAGE(PageTail(page), page);
- if (page_mapping(page) == mapping) {
- VM_BUG_ON_PAGE(PageWriteback(page), page);
+ if ((!unfalloc || !PageUptodate(page)) &&
+ page_mapping(page) == mapping) {
+ VM_BUG_ON_PAGE(PageWriteback(page), page);
+ if (shmem_punch_compound(page, start, end))
truncate_inode_page(mapping, page);
- }
}
unlock_page(page);
}
@@ -936,43 +930,25 @@ static void shmem_undo_range(struct inod

lock_page(page);

- if (PageTransTail(page)) {
- /* Middle of THP: zero out the page */
- clear_highpage(page);
- unlock_page(page);
- /*
- * Partial thp truncate due 'start' in middle
- * of THP: don't need to look on these pages
- * again on !pvec.nr restart.
- */
- if (index != round_down(end, HPAGE_PMD_NR))
- start++;
- continue;
- } else if (PageTransHuge(page)) {
- if (index == round_down(end, HPAGE_PMD_NR)) {
- /*
- * Range ends in the middle of THP:
- * zero out the page
- */
- clear_highpage(page);
- unlock_page(page);
- continue;
- }
- index += HPAGE_PMD_NR - 1;
- i += HPAGE_PMD_NR - 1;
- }
-
if (!unfalloc || !PageUptodate(page)) {
- VM_BUG_ON_PAGE(PageTail(page), page);
- if (page_mapping(page) == mapping) {
- VM_BUG_ON_PAGE(PageWriteback(page), page);
- truncate_inode_page(mapping, page);
- } else {
+ if (page_mapping(page) != mapping) {
/* Page was replaced by swap: retry */
unlock_page(page);
index--;
break;
}
+ VM_BUG_ON_PAGE(PageWriteback(page), page);
+ if (shmem_punch_compound(page, start, end))
+ truncate_inode_page(mapping, page);
+ else {
+ /* Wipe the page and don't get stuck */
+ clear_highpage(page);
+ flush_dcache_page(page);
+ set_page_dirty(page);
+ if (index <
+ round_up(start, HPAGE_PMD_NR))
+ start = index + 1;
+ }
}
unlock_page(page);
}

2019-11-28 11:37:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially

On Wed, Nov 27, 2019 at 07:06:01PM -0800, Hugh Dickins wrote:
> On Tue, 26 Nov 2019, Yang Shi wrote:
> > On 11/25/19 11:33 AM, Yang Shi wrote:
> > > On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
> > > > On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
> > > > > On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
> > > > > > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
> > > > > > > Currently when truncating shmem file, if the range is partial of
> > > > > > > THP
> > > > > > > (start or end is in the middle of THP), the pages actually will
> > > > > > > just get
> > > > > > > cleared rather than being freed unless the range cover the whole
> > > > > > > THP.
> > > > > > > Even though all the subpages are truncated (randomly or
> > > > > > > sequentially),
> > > > > > > the THP may still be kept in page cache.? This might be fine for
> > > > > > > some
> > > > > > > usecases which prefer preserving THP.
> > > > > > >
> > > > > > > But, when doing balloon inflation in QEMU, QEMU actually does hole
> > > > > > > punch
> > > > > > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
> > > > > > > used.
> > > > > > > So, when using shmem THP as memory backend QEMU inflation actually
> > > > > > > doesn't
> > > > > > > work as expected since it doesn't free memory.? But, the inflation
> > > > > > > usecase really needs get the memory freed.? Anonymous THP will not
> > > > > > > get
> > > > > > > freed right away too but it will be freed eventually when all
> > > > > > > subpages are
> > > > > > > unmapped, but shmem THP would still stay in page cache.
> > > > > > >
> > > > > > > To protect the usecases which may prefer preserving THP, introduce
> > > > > > > a
> > > > > > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP
> > > > > > > is
> > > > > > > preferred behavior if truncating partial THP.? This mode just makes
> > > > > > > sense to tmpfs for the time being.
>
> Sorry, I haven't managed to set aside enough time for this until now.
>
> First off, let me say that I firmly believe this punch-split behavior
> should be the standard behavior (like in my huge tmpfs implementation),
> and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it.
> But I don't know if I'll be able to persuade Kirill of that.
>
> If the caller wants to write zeroes into the file, she can do so with the
> write syscall: the caller has asked to punch a hole or truncate the file,
> and in our case, like your QEMU case, hopes that memory and memcg charge
> will be freed by doing so. I'll be surprised if changing the behavior
> to yours and mine turns out to introduce a regression, but if it does,
> I guess we'll then have to put it behind a sysctl or whatever.
>
> IIUC the reason that it's currently implemented by clearing the hole
> is because split_huge_page() (unlike in older refcounting days) cannot
> be guaranteed to succeed. Which is unfortunate, and none of us is very
> keen to build a filesystem on unreliable behavior; but the failure cases
> appear in practice to be rare enough, that it's on balance better to give
> the punch-hole-truncate caller what she asked for whenever possible.

I don't have a firm position here. Maybe you are right and we should try
to split pages right away.

It might be useful to consider case wider than shmem.

On traditional filesystem with a backing storage semantics of the same
punch hole operation is somewhat different. It doesn't have explicit
implications on memory footprint. It's about managing persistent storage.
With shmem/tmpfs it is lumped together.

It might be nice to write down pages that can be discarded under memory
pressure and leave the huge page intact until then...

[ I don't see a problem with your patch as long as we agree that it's
desired semantics for the interface. ]

--
Kirill A. Shutemov

2019-12-02 23:09:41

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially



On 11/27/19 7:06 PM, Hugh Dickins wrote:
> On Tue, 26 Nov 2019, Yang Shi wrote:
>> On 11/25/19 11:33 AM, Yang Shi wrote:
>>> On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
>>>> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>>>>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
>>>>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>>>>>>> Currently when truncating shmem file, if the range is partial of
>>>>>>> THP
>>>>>>> (start or end is in the middle of THP), the pages actually will
>>>>>>> just get
>>>>>>> cleared rather than being freed unless the range cover the whole
>>>>>>> THP.
>>>>>>> Even though all the subpages are truncated (randomly or
>>>>>>> sequentially),
>>>>>>> the THP may still be kept in page cache.  This might be fine for
>>>>>>> some
>>>>>>> usecases which prefer preserving THP.
>>>>>>>
>>>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole
>>>>>>> punch
>>>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
>>>>>>> used.
>>>>>>> So, when using shmem THP as memory backend QEMU inflation actually
>>>>>>> doesn't
>>>>>>> work as expected since it doesn't free memory.  But, the inflation
>>>>>>> usecase really needs get the memory freed.  Anonymous THP will not
>>>>>>> get
>>>>>>> freed right away too but it will be freed eventually when all
>>>>>>> subpages are
>>>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>>>
>>>>>>> To protect the usecases which may prefer preserving THP, introduce
>>>>>>> a
>>>>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP
>>>>>>> is
>>>>>>> preferred behavior if truncating partial THP.  This mode just makes
>>>>>>> sense to tmpfs for the time being.
> Sorry, I haven't managed to set aside enough time for this until now.
>
> First off, let me say that I firmly believe this punch-split behavior
> should be the standard behavior (like in my huge tmpfs implementation),
> and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it.
> But I don't know if I'll be able to persuade Kirill of that.
>
> If the caller wants to write zeroes into the file, she can do so with the
> write syscall: the caller has asked to punch a hole or truncate the file,
> and in our case, like your QEMU case, hopes that memory and memcg charge
> will be freed by doing so. I'll be surprised if changing the behavior
> to yours and mine turns out to introduce a regression, but if it does,
> I guess we'll then have to put it behind a sysctl or whatever.

I'm not sure if there may be regression or not so I added the flag to
have a choice. But, sysctl may seem better. Anyway, I agree we don't
have to consider the potential regression until the real regression is
found.

>
> IIUC the reason that it's currently implemented by clearing the hole
> is because split_huge_page() (unlike in older refcounting days) cannot
> be guaranteed to succeed. Which is unfortunate, and none of us is very
> keen to build a filesystem on unreliable behavior; but the failure cases
> appear in practice to be rare enough, that it's on balance better to give
> the punch-hole-truncate caller what she asked for whenever possible.
>
>>>>>> We need to clarify interaction with khugepaged. This implementation
>>>>>> doesn't do anything to prevent khugepaged from collapsing the range
>>>>>> back
>>>>>> to THP just after the split.
>>>>> Yes, it doesn't. Will clarify this in the commit log.
>>>> Okay, but I'm not sure that documention alone will be enough. We need
>>>> proper design.
>>> Maybe we could try to hold inode lock with read during collapse_file(). The
>>> shmem fallocate does acquire inode lock with write, this should be able to
>>> synchronize hole punch and khugepaged. And, shmem just needs hold inode
>>> lock for llseek and fallocate, I'm supposed they are should be called not
>>> that frequently to have impact on khugepaged. The llseek might be often,
>>> but it should be quite fast. However, they might get blocked by khugepaged.
>>>
>>> It sounds safe to hold a rwsem during collapsing THP.
> No, I don't think we want to take any more locks while collapsing THP,
> but that wasn't really the point. We're not concerned about a *race*
> between splitting and khugepaged reassembling (I'm assuming that any
> such race would already exist, and be well-guarded against by all the
> refcount checks, punchhole not adding anything new here; but perhaps
> I'm assuming too blithely, and it is worth checking over).
>
> The point, as I see it anyway, is the contradiction in effort: the
> caller asks for hole to be punched, we do that, then a few seconds
> or minutes later, khugepaged comes along and fills in the hole (if
> huge page availability and memcg limit allow it).
>
> I agree that's not very satisfactory, but I think it's a side issue:
> we don't have a good mechanism to tell khugepaged to keep off a range.
> As it is, fallocate and ftruncate ought to do the job expected of them,
> and khugepaged ought to do the job expected of it. And in many cases,
> the punched file will not even be mapped (visible to khugepaged), or
> max_ptes_none set to exclude working on such holes.
>
> Is khugepaged's action an issue for your QEMU case?

So far not.

>
>>> Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch and
>>> clear it after truncate, then check the flag before doing collapse in
>>> khugepaged. khugepaged should not need hold the inode lock during collapse
>>> since it could be released after the flag is checked.
>> By relooking the code, it looks the latter one (check VM_NOHUGEPAGE) doesn't
>> make sense, it can't prevent khugepaged from collapsing THP in parallel.
>>
>>>>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode
>>>>>>> *inode, loff_t lstart, loff_t lend,
>>>>>>>                }
>>>>>>>                unlock_page(page);
>>>>>>>            }
>>>>>>> +rescan_split:
>>>>>>>            pagevec_remove_exceptionals(&pvec);
>>>>>>>            pagevec_release(&pvec);
>>>>>>> +
>>>>>>> +        if (split && PageTransCompound(page)) {
>>>>>>> +            /* The THP may get freed under us */
>>>>>>> +            if (!get_page_unless_zero(compound_head(page)))
>>>>>>> +                goto rescan_out;
>>>>>>> +
>>>>>>> +            lock_page(page);
>>>>>>> +
>>>>>>> +            /*
>>>>>>> +             * The extra pins from page cache lookup have been
>>>>>>> +             * released by pagevec_release().
>>>>>>> +             */
>>>>>>> +            if (!split_huge_page(page)) {
>>>>>>> +                unlock_page(page);
>>>>>>> +                put_page(page);
>>>>>>> +                /* Re-look up page cache from current index */
>>>>>>> +                goto again;
>>>>>>> +            }
>>>>>>> +            unlock_page(page);
>>>>>>> +            put_page(page);
>>>>>>> +        }
>>>>>>> +rescan_out:
>>>>>>>            index++;
>>>>>>>        }
>>>>>> Doing get_page_unless_zero() just after you've dropped the pin for
>>>>>> the
>>>>>> page looks very suboptimal.
>>>>> If I don't drop the pins the THP can't be split. And, there might be
>>>>> more
>>>>> than one pins from find_get_entries() if I read the code correctly. For
>>>>> example, truncate 8K length in the middle of THP, the THP's refcount
>>>>> would
>>>>> get bumpped twice since  two sub pages would be returned.
>>>> Pin the page before pagevec_release() and avoid get_page_unless_zero().
>>>>
>>>> Current code is buggy. You need to check that the page is still belong to
>>>> the file after speculative lookup.
> Yes indeed (I think you can even keep the page locked, but I may be wrong).
>
>>> Yes, I missed this point. Thanks for the suggestion.
> The main problem I see is your "goto retry" and "goto again":
> split_huge_page() may fail because a get_page() somewhere is holding
> a transient reference to the page, or it may fail because there's a
> GUP that holds a reference for days: you do not want to be stuck here
> going round and around the loop waiting for that GUP to be released!

I think my code already handled this case. Once the split is failed, it
just falls through to next index. It just does retry as long as split
succeeds.

But my patch does clear before split.

>
> It's nice that we already have a trylock_page() loop followed by a
> lock_page() loop. When split_huge_page() fails, the trylock_page()
> loop can simply move on to the next page (skip over the compound page,
> or retry it subpage by subpage? I've forgotten the pros and cons),
> and leave final resolution to the later lock_page() loop: which has to
> accept when split_huge_page() failed, and fall back to clearing instead.
>
> I would prefer a smaller patch than your RFC: making split the
> default behavior cuts out a lot of it, but I think there's still
> more that can be cut. Here's the patch we've been using internally,
> which deletes quite a lot of the old code; but you'll quickly notice
> has a "Revisit later" hack in find_get_entries(), which I've not got
> around to revisiting yet. Please blend what you can from my patch
> into yours, or vice versa.

Thank you very much. It looks your "Revisit later" hack tries to keep
just one extra pin for the THP so that split could succeed.

I will try to blend the two patches.

>
> Hugh
>
> ---
>
> mm/filemap.c | 3 +
> mm/shmem.c | 86 +++++++++++++++++--------------------------------
> 2 files changed, 34 insertions(+), 55 deletions(-)
>
> --- v5.4/mm/filemap.c 2019-11-24 16:32:01.000000000 -0800
> +++ linux/mm/filemap.c 2019-11-27 16:21:16.316801433 -0800
> @@ -1752,6 +1752,9 @@ unsigned find_get_entries(struct address
> goto put_page;
> page = find_subpage(page, xas.xa_index);
>
> + /* Revisit later: make shmem_undo_range() easier for now */
> + if (PageTransCompound(page))
> + nr_entries = ret + 1;
> export:
> indices[ret] = xas.xa_index;
> entries[ret] = page;
> --- v5.4/mm/shmem.c 2019-11-24 16:32:01.000000000 -0800
> +++ linux/mm/shmem.c 2019-11-27 16:21:16.320801450 -0800
> @@ -788,6 +788,20 @@ void shmem_unlock_mapping(struct address
> }
> }
>
> +static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
> +{
> + if (!PageTransCompound(page))
> + return true;
> +
> + /* Just proceed to delete a huge page wholly within the range punched */
> + if (PageHead(page) &&
> + page->index >= start && page->index + HPAGE_PMD_NR <= end)
> + return true;
> +
> + /* Try to split huge page, so we can truly punch the hole or truncate */
> + return split_huge_page(page) >= 0;
> +}
> +
> /*
> * Remove range of pages and swap entries from page cache, and free them.
> * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> @@ -838,31 +852,11 @@ static void shmem_undo_range(struct inod
> if (!trylock_page(page))
> continue;
>
> - if (PageTransTail(page)) {
> - /* Middle of THP: zero out the page */
> - clear_highpage(page);
> - unlock_page(page);
> - continue;
> - } else if (PageTransHuge(page)) {
> - if (index == round_down(end, HPAGE_PMD_NR)) {
> - /*
> - * Range ends in the middle of THP:
> - * zero out the page
> - */
> - clear_highpage(page);
> - unlock_page(page);
> - continue;
> - }
> - index += HPAGE_PMD_NR - 1;
> - i += HPAGE_PMD_NR - 1;
> - }
> -
> - if (!unfalloc || !PageUptodate(page)) {
> - VM_BUG_ON_PAGE(PageTail(page), page);
> - if (page_mapping(page) == mapping) {
> - VM_BUG_ON_PAGE(PageWriteback(page), page);
> + if ((!unfalloc || !PageUptodate(page)) &&
> + page_mapping(page) == mapping) {
> + VM_BUG_ON_PAGE(PageWriteback(page), page);
> + if (shmem_punch_compound(page, start, end))
> truncate_inode_page(mapping, page);
> - }
> }
> unlock_page(page);
> }
> @@ -936,43 +930,25 @@ static void shmem_undo_range(struct inod
>
> lock_page(page);
>
> - if (PageTransTail(page)) {
> - /* Middle of THP: zero out the page */
> - clear_highpage(page);
> - unlock_page(page);
> - /*
> - * Partial thp truncate due 'start' in middle
> - * of THP: don't need to look on these pages
> - * again on !pvec.nr restart.
> - */
> - if (index != round_down(end, HPAGE_PMD_NR))
> - start++;
> - continue;
> - } else if (PageTransHuge(page)) {
> - if (index == round_down(end, HPAGE_PMD_NR)) {
> - /*
> - * Range ends in the middle of THP:
> - * zero out the page
> - */
> - clear_highpage(page);
> - unlock_page(page);
> - continue;
> - }
> - index += HPAGE_PMD_NR - 1;
> - i += HPAGE_PMD_NR - 1;
> - }
> -
> if (!unfalloc || !PageUptodate(page)) {
> - VM_BUG_ON_PAGE(PageTail(page), page);
> - if (page_mapping(page) == mapping) {
> - VM_BUG_ON_PAGE(PageWriteback(page), page);
> - truncate_inode_page(mapping, page);
> - } else {
> + if (page_mapping(page) != mapping) {
> /* Page was replaced by swap: retry */
> unlock_page(page);
> index--;
> break;
> }
> + VM_BUG_ON_PAGE(PageWriteback(page), page);
> + if (shmem_punch_compound(page, start, end))
> + truncate_inode_page(mapping, page);
> + else {
> + /* Wipe the page and don't get stuck */
> + clear_highpage(page);
> + flush_dcache_page(page);
> + set_page_dirty(page);
> + if (index <
> + round_up(start, HPAGE_PMD_NR))
> + start = index + 1;
> + }
> }
> unlock_page(page);
> }

2019-12-02 23:16:01

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially



On 11/28/19 3:34 AM, Kirill A. Shutemov wrote:
> On Wed, Nov 27, 2019 at 07:06:01PM -0800, Hugh Dickins wrote:
>> On Tue, 26 Nov 2019, Yang Shi wrote:
>>> On 11/25/19 11:33 AM, Yang Shi wrote:
>>>> On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
>>>>> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>>>>>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
>>>>>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>>>>>>>> Currently when truncating shmem file, if the range is partial of
>>>>>>>> THP
>>>>>>>> (start or end is in the middle of THP), the pages actually will
>>>>>>>> just get
>>>>>>>> cleared rather than being freed unless the range cover the whole
>>>>>>>> THP.
>>>>>>>> Even though all the subpages are truncated (randomly or
>>>>>>>> sequentially),
>>>>>>>> the THP may still be kept in page cache.  This might be fine for
>>>>>>>> some
>>>>>>>> usecases which prefer preserving THP.
>>>>>>>>
>>>>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole
>>>>>>>> punch
>>>>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
>>>>>>>> used.
>>>>>>>> So, when using shmem THP as memory backend QEMU inflation actually
>>>>>>>> doesn't
>>>>>>>> work as expected since it doesn't free memory.  But, the inflation
>>>>>>>> usecase really needs get the memory freed.  Anonymous THP will not
>>>>>>>> get
>>>>>>>> freed right away too but it will be freed eventually when all
>>>>>>>> subpages are
>>>>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>>>>
>>>>>>>> To protect the usecases which may prefer preserving THP, introduce
>>>>>>>> a
>>>>>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP
>>>>>>>> is
>>>>>>>> preferred behavior if truncating partial THP.  This mode just makes
>>>>>>>> sense to tmpfs for the time being.
>> Sorry, I haven't managed to set aside enough time for this until now.
>>
>> First off, let me say that I firmly believe this punch-split behavior
>> should be the standard behavior (like in my huge tmpfs implementation),
>> and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it.
>> But I don't know if I'll be able to persuade Kirill of that.
>>
>> If the caller wants to write zeroes into the file, she can do so with the
>> write syscall: the caller has asked to punch a hole or truncate the file,
>> and in our case, like your QEMU case, hopes that memory and memcg charge
>> will be freed by doing so. I'll be surprised if changing the behavior
>> to yours and mine turns out to introduce a regression, but if it does,
>> I guess we'll then have to put it behind a sysctl or whatever.
>>
>> IIUC the reason that it's currently implemented by clearing the hole
>> is because split_huge_page() (unlike in older refcounting days) cannot
>> be guaranteed to succeed. Which is unfortunate, and none of us is very
>> keen to build a filesystem on unreliable behavior; but the failure cases
>> appear in practice to be rare enough, that it's on balance better to give
>> the punch-hole-truncate caller what she asked for whenever possible.
> I don't have a firm position here. Maybe you are right and we should try
> to split pages right away.
>
> It might be useful to consider case wider than shmem.
>
> On traditional filesystem with a backing storage semantics of the same
> punch hole operation is somewhat different. It doesn't have explicit
> implications on memory footprint. It's about managing persistent storage.
> With shmem/tmpfs it is lumped together.
>
> It might be nice to write down pages that can be discarded under memory
> pressure and leave the huge page intact until then...

Sounds like another deferred split queue. It could be an option, but our
usecase needs get memory freed right away since the memory might be
reused by others very soon.

>
> [ I don't see a problem with your patch as long as we agree that it's
> desired semantics for the interface. ]
>