2023-09-30 06:08:41

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()

That Trinity livelock shmem_falloc avoidance block is unlikely, and a
distraction from the proper business of shmem_fault(): separate it out.
(This used to help compilers save stack on the fault path too, but both
gcc and clang nowadays seem to make better choices anyway.)

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 126 +++++++++++++++++++++++++++++------------------------
1 file changed, 69 insertions(+), 57 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 824eb55671d2..5501a5bc8d8c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2148,87 +2148,99 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
* entry unconditionally - even if something else had already woken the
* target.
*/
-static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+static int synchronous_wake_function(wait_queue_entry_t *wait,
+ unsigned int mode, int sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);
list_del_init(&wait->entry);
return ret;
}

+/*
+ * Trinity finds that probing a hole which tmpfs is punching can
+ * prevent the hole-punch from ever completing: which in turn
+ * locks writers out with its hold on i_rwsem. So refrain from
+ * faulting pages into the hole while it's being punched. Although
+ * shmem_undo_range() does remove the additions, it may be unable to
+ * keep up, as each new page needs its own unmap_mapping_range() call,
+ * and the i_mmap tree grows ever slower to scan if new vmas are added.
+ *
+ * It does not matter if we sometimes reach this check just before the
+ * hole-punch begins, so that one fault then races with the punch:
+ * we just need to make racing faults a rare case.
+ *
+ * The implementation below would be much simpler if we just used a
+ * standard mutex or completion: but we cannot take i_rwsem in fault,
+ * and bloating every shmem inode for this unlikely case would be sad.
+ */
+static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode)
+{
+ struct shmem_falloc *shmem_falloc;
+ struct file *fpin = NULL;
+ vm_fault_t ret = 0;
+
+ spin_lock(&inode->i_lock);
+ shmem_falloc = inode->i_private;
+ if (shmem_falloc &&
+ shmem_falloc->waitq &&
+ vmf->pgoff >= shmem_falloc->start &&
+ vmf->pgoff < shmem_falloc->next) {
+ wait_queue_head_t *shmem_falloc_waitq;
+ DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
+
+ ret = VM_FAULT_NOPAGE;
+ fpin = maybe_unlock_mmap_for_io(vmf, NULL);
+ shmem_falloc_waitq = shmem_falloc->waitq;
+ prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
+ TASK_UNINTERRUPTIBLE);
+ spin_unlock(&inode->i_lock);
+ schedule();
+
+ /*
+ * shmem_falloc_waitq points into the shmem_fallocate()
+ * stack of the hole-punching task: shmem_falloc_waitq
+ * is usually invalid by the time we reach here, but
+ * finish_wait() does not dereference it in that case;
+ * though i_lock needed lest racing with wake_up_all().
+ */
+ spin_lock(&inode->i_lock);
+ finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
+ }
+ spin_unlock(&inode->i_lock);
+ if (fpin) {
+ fput(fpin);
+ ret = VM_FAULT_RETRY;
+ }
+ return ret;
+}
+
static vm_fault_t shmem_fault(struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
- struct inode *inode = file_inode(vma->vm_file);
+ struct inode *inode = file_inode(vmf->vma->vm_file);
gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
struct folio *folio = NULL;
+ vm_fault_t ret = 0;
int err;
- vm_fault_t ret = VM_FAULT_LOCKED;

/*
* Trinity finds that probing a hole which tmpfs is punching can
- * prevent the hole-punch from ever completing: which in turn
- * locks writers out with its hold on i_rwsem. So refrain from
- * faulting pages into the hole while it's being punched. Although
- * shmem_undo_range() does remove the additions, it may be unable to
- * keep up, as each new page needs its own unmap_mapping_range() call,
- * and the i_mmap tree grows ever slower to scan if new vmas are added.
- *
- * It does not matter if we sometimes reach this check just before the
- * hole-punch begins, so that one fault then races with the punch:
- * we just need to make racing faults a rare case.
- *
- * The implementation below would be much simpler if we just used a
- * standard mutex or completion: but we cannot take i_rwsem in fault,
- * and bloating every shmem inode for this unlikely case would be sad.
+ * prevent the hole-punch from ever completing: noted in i_private.
*/
if (unlikely(inode->i_private)) {
- struct shmem_falloc *shmem_falloc;
-
- spin_lock(&inode->i_lock);
- shmem_falloc = inode->i_private;
- if (shmem_falloc &&
- shmem_falloc->waitq &&
- vmf->pgoff >= shmem_falloc->start &&
- vmf->pgoff < shmem_falloc->next) {
- struct file *fpin;
- wait_queue_head_t *shmem_falloc_waitq;
- DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
-
- ret = VM_FAULT_NOPAGE;
- fpin = maybe_unlock_mmap_for_io(vmf, NULL);
- if (fpin)
- ret = VM_FAULT_RETRY;
-
- shmem_falloc_waitq = shmem_falloc->waitq;
- prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
- TASK_UNINTERRUPTIBLE);
- spin_unlock(&inode->i_lock);
- schedule();
-
- /*
- * shmem_falloc_waitq points into the shmem_fallocate()
- * stack of the hole-punching task: shmem_falloc_waitq
- * is usually invalid by the time we reach here, but
- * finish_wait() does not dereference it in that case;
- * though i_lock needed lest racing with wake_up_all().
- */
- spin_lock(&inode->i_lock);
- finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
- spin_unlock(&inode->i_lock);
-
- if (fpin)
- fput(fpin);
+ ret = shmem_falloc_wait(vmf, inode);
+ if (ret)
return ret;
- }
- spin_unlock(&inode->i_lock);
}

+ WARN_ON_ONCE(vmf->page != NULL);
err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
gfp, vmf, &ret);
if (err)
return vmf_error(err);
- if (folio)
+ if (folio) {
vmf->page = folio_file_page(folio, vmf->pgoff);
+ ret |= VM_FAULT_LOCKED;
+ }
return ret;
}

--
2.35.3


2023-10-03 13:19:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()

On Fri 29-09-23 20:27:53, Hugh Dickins wrote:
> That Trinity livelock shmem_falloc avoidance block is unlikely, and a
> distraction from the proper business of shmem_fault(): separate it out.
> (This used to help compilers save stack on the fault path too, but both
> gcc and clang nowadays seem to make better choices anyway.)
>
> Signed-off-by: Hugh Dickins <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Looking at the code I'm just wondering whether the livelock with
shmem_undo_range() couldn't be more easy to avoid by making
shmem_undo_range() always advance the index by 1 after evicting a page and
thus guaranteeing a forward progress... Because the forward progress within
find_get_entries() is guaranteed these days, it should be enough.

Honza

> ---
> mm/shmem.c | 126 +++++++++++++++++++++++++++++------------------------
> 1 file changed, 69 insertions(+), 57 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 824eb55671d2..5501a5bc8d8c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2148,87 +2148,99 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
> * entry unconditionally - even if something else had already woken the
> * target.
> */
> -static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
> +static int synchronous_wake_function(wait_queue_entry_t *wait,
> + unsigned int mode, int sync, void *key)
> {
> int ret = default_wake_function(wait, mode, sync, key);
> list_del_init(&wait->entry);
> return ret;
> }
>
> +/*
> + * Trinity finds that probing a hole which tmpfs is punching can
> + * prevent the hole-punch from ever completing: which in turn
> + * locks writers out with its hold on i_rwsem. So refrain from
> + * faulting pages into the hole while it's being punched. Although
> + * shmem_undo_range() does remove the additions, it may be unable to
> + * keep up, as each new page needs its own unmap_mapping_range() call,
> + * and the i_mmap tree grows ever slower to scan if new vmas are added.
> + *
> + * It does not matter if we sometimes reach this check just before the
> + * hole-punch begins, so that one fault then races with the punch:
> + * we just need to make racing faults a rare case.
> + *
> + * The implementation below would be much simpler if we just used a
> + * standard mutex or completion: but we cannot take i_rwsem in fault,
> + * and bloating every shmem inode for this unlikely case would be sad.
> + */
> +static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode)
> +{
> + struct shmem_falloc *shmem_falloc;
> + struct file *fpin = NULL;
> + vm_fault_t ret = 0;
> +
> + spin_lock(&inode->i_lock);
> + shmem_falloc = inode->i_private;
> + if (shmem_falloc &&
> + shmem_falloc->waitq &&
> + vmf->pgoff >= shmem_falloc->start &&
> + vmf->pgoff < shmem_falloc->next) {
> + wait_queue_head_t *shmem_falloc_waitq;
> + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
> +
> + ret = VM_FAULT_NOPAGE;
> + fpin = maybe_unlock_mmap_for_io(vmf, NULL);
> + shmem_falloc_waitq = shmem_falloc->waitq;
> + prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> + TASK_UNINTERRUPTIBLE);
> + spin_unlock(&inode->i_lock);
> + schedule();
> +
> + /*
> + * shmem_falloc_waitq points into the shmem_fallocate()
> + * stack of the hole-punching task: shmem_falloc_waitq
> + * is usually invalid by the time we reach here, but
> + * finish_wait() does not dereference it in that case;
> + * though i_lock needed lest racing with wake_up_all().
> + */
> + spin_lock(&inode->i_lock);
> + finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> + }
> + spin_unlock(&inode->i_lock);
> + if (fpin) {
> + fput(fpin);
> + ret = VM_FAULT_RETRY;
> + }
> + return ret;
> +}
> +
> static vm_fault_t shmem_fault(struct vm_fault *vmf)
> {
> - struct vm_area_struct *vma = vmf->vma;
> - struct inode *inode = file_inode(vma->vm_file);
> + struct inode *inode = file_inode(vmf->vma->vm_file);
> gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> struct folio *folio = NULL;
> + vm_fault_t ret = 0;
> int err;
> - vm_fault_t ret = VM_FAULT_LOCKED;
>
> /*
> * Trinity finds that probing a hole which tmpfs is punching can
> - * prevent the hole-punch from ever completing: which in turn
> - * locks writers out with its hold on i_rwsem. So refrain from
> - * faulting pages into the hole while it's being punched. Although
> - * shmem_undo_range() does remove the additions, it may be unable to
> - * keep up, as each new page needs its own unmap_mapping_range() call,
> - * and the i_mmap tree grows ever slower to scan if new vmas are added.
> - *
> - * It does not matter if we sometimes reach this check just before the
> - * hole-punch begins, so that one fault then races with the punch:
> - * we just need to make racing faults a rare case.
> - *
> - * The implementation below would be much simpler if we just used a
> - * standard mutex or completion: but we cannot take i_rwsem in fault,
> - * and bloating every shmem inode for this unlikely case would be sad.
> + * prevent the hole-punch from ever completing: noted in i_private.
> */
> if (unlikely(inode->i_private)) {
> - struct shmem_falloc *shmem_falloc;
> -
> - spin_lock(&inode->i_lock);
> - shmem_falloc = inode->i_private;
> - if (shmem_falloc &&
> - shmem_falloc->waitq &&
> - vmf->pgoff >= shmem_falloc->start &&
> - vmf->pgoff < shmem_falloc->next) {
> - struct file *fpin;
> - wait_queue_head_t *shmem_falloc_waitq;
> - DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
> -
> - ret = VM_FAULT_NOPAGE;
> - fpin = maybe_unlock_mmap_for_io(vmf, NULL);
> - if (fpin)
> - ret = VM_FAULT_RETRY;
> -
> - shmem_falloc_waitq = shmem_falloc->waitq;
> - prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> - TASK_UNINTERRUPTIBLE);
> - spin_unlock(&inode->i_lock);
> - schedule();
> -
> - /*
> - * shmem_falloc_waitq points into the shmem_fallocate()
> - * stack of the hole-punching task: shmem_falloc_waitq
> - * is usually invalid by the time we reach here, but
> - * finish_wait() does not dereference it in that case;
> - * though i_lock needed lest racing with wake_up_all().
> - */
> - spin_lock(&inode->i_lock);
> - finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> - spin_unlock(&inode->i_lock);
> -
> - if (fpin)
> - fput(fpin);
> + ret = shmem_falloc_wait(vmf, inode);
> + if (ret)
> return ret;
> - }
> - spin_unlock(&inode->i_lock);
> }
>
> + WARN_ON_ONCE(vmf->page != NULL);
> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> gfp, vmf, &ret);
> if (err)
> return vmf_error(err);
> - if (folio)
> + if (folio) {
> vmf->page = folio_file_page(folio, vmf->pgoff);
> + ret |= VM_FAULT_LOCKED;
> + }
> return ret;
> }
>
> --
> 2.35.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-10-06 03:48:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()

On Tue, 3 Oct 2023, Jan Kara wrote:
> On Fri 29-09-23 20:27:53, Hugh Dickins wrote:
> > That Trinity livelock shmem_falloc avoidance block is unlikely, and a
> > distraction from the proper business of shmem_fault(): separate it out.
> > (This used to help compilers save stack on the fault path too, but both
> > gcc and clang nowadays seem to make better choices anyway.)
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Looks good. Feel free to add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks a lot for all these reviews, Jan. (And I particularly enjoyed
your "Autumn cleaning" remark: sweeping up the leaves, I've been glad
to have "Autumn Almanac" running through my head since reading that.)

>
> Looking at the code I'm just wondering whether the livelock with
> shmem_undo_range() couldn't be more easy to avoid by making
> shmem_undo_range() always advance the index by 1 after evicting a page and
> thus guaranteeing a forward progress... Because the forward progress within
> find_get_entries() is guaranteed these days, it should be enough.

I'm not sure that I understand your "advance the index by 1" comment.
Since the "/* If all gone or hole-punch or unfalloc, we're done */"
change went in, I believe shmem_undo_range() does make guaranteed
forward progress; but its forward progress is not enough.

I would love to delete all that shmem_falloc_wait() strangeness;
and your comment excited me to look, hey, can we just delete all that
stuff now, instead of shifting it aside? That would be much nicer.

And if I'd replied to you yesterday, I'd have been saying yes we can.
But that's because I hadn't got far enough through re-reading the
various July 2014 3.16-rc mail threads. I had been excited to find
myself posting a revert of the patch; before reaching Sasha's later
livelock which ended up with "belt and braces" retaining the
shmem_falloc wait while adding the "If all gone or hole-punch" mod.

https://marc.info/?l=linux-kernel&m=140487864819409&w=2
though that thread did not resolve, and morphed into lockdep moans.

So I've reverted to my usual position: that it's conceivable that
something has changed meanwhile, to make that Trinity livelock no
longer an issue (in particular, i_mmap_mutex changed to i_mmap_rwsem,
and much later unmap_mapping_range() changed to only take it for read:
but though that change gives hope, I suspect it would turn out to be
ineffectual against the livelock); but that would have to be proved.

If there's someone who can re-demonstrate Sasha's Trinity livelock
on 3.16-with-shmem-falloc-wait-disabled, or re-demonstrate it on any
later release-with-shmem-falloc-wait-disabled, but demonstrate that
the livelock does not occur on 6.6-rc-with-shmem-falloc-wait-disabled
(or that plus some simple adjustment of hacker's choosing): that would
be great news, and we could delete all this - though probably not
without bisecting to satisfy ourselves on what was the crucial change.

But I never got around to running up Trinity to work on it originally,
nor in the years since, nor do I expect to soon. (Vlastimil had a
good simple technique for demonstrating a part of the problem, but
fixing that part turned out not fix the whole issue with Trinity.)

Hugh

2023-10-06 11:02:04

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()

On Thu 05-10-23 20:48:00, Hugh Dickins wrote:
> On Tue, 3 Oct 2023, Jan Kara wrote:
> > On Fri 29-09-23 20:27:53, Hugh Dickins wrote:
> > > That Trinity livelock shmem_falloc avoidance block is unlikely, and a
> > > distraction from the proper business of shmem_fault(): separate it out.
> > > (This used to help compilers save stack on the fault path too, but both
> > > gcc and clang nowadays seem to make better choices anyway.)
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > Looks good. Feel free to add:
> >
> > Reviewed-by: Jan Kara <[email protected]>
>
> Thanks a lot for all these reviews, Jan. (And I particularly enjoyed
> your "Autumn cleaning" remark: sweeping up the leaves, I've been glad
> to have "Autumn Almanac" running through my head since reading that.)

:-) You have widened my musical horizon.

> > Looking at the code I'm just wondering whether the livelock with
> > shmem_undo_range() couldn't be more easy to avoid by making
> > shmem_undo_range() always advance the index by 1 after evicting a page and
> > thus guaranteeing a forward progress... Because the forward progress within
> > find_get_entries() is guaranteed these days, it should be enough.
>
> I'm not sure that I understand your "advance the index by 1" comment.
> Since the "/* If all gone or hole-punch or unfalloc, we're done */"
> change went in, I believe shmem_undo_range() does make guaranteed
> forward progress; but its forward progress is not enough.

Right, I have missed that retry when glancing over the code. And the
"advance the index by 1" was also wrong because find_get_entries() already
does it these days.

> I would love to delete all that shmem_falloc_wait() strangeness;
> and your comment excited me to look, hey, can we just delete all that
> stuff now, instead of shifting it aside? That would be much nicer.

Well, even if you decided to keep the synchronization what you could do
these days is to use inode->i_mapping->invalidate_lock for synchronization
instead of your home-grown solution (like all other filesystems do for
these kind of races). If you don't want to pay the cost of rwsem
acquisition in the fault fast path, you could do it in the hole-punch
running case only like currently.

> And if I'd replied to you yesterday, I'd have been saying yes we can.
> But that's because I hadn't got far enough through re-reading the
> various July 2014 3.16-rc mail threads. I had been excited to find
> myself posting a revert of the patch; before reaching Sasha's later
> livelock which ended up with "belt and braces" retaining the
> shmem_falloc wait while adding the "If all gone or hole-punch" mod.
>
> https://marc.info/?l=linux-kernel&m=140487864819409&w=2
> though that thread did not resolve, and morphed into lockdep moans.
>
> So I've reverted to my usual position: that it's conceivable that
> something has changed meanwhile, to make that Trinity livelock no
> longer an issue (in particular, i_mmap_mutex changed to i_mmap_rwsem,
> and much later unmap_mapping_range() changed to only take it for read:
> but though that change gives hope, I suspect it would turn out to be
> ineffectual against the livelock); but that would have to be proved.
>
> If there's someone who can re-demonstrate Sasha's Trinity livelock
> on 3.16-with-shmem-falloc-wait-disabled, or re-demonstrate it on any
> later release-with-shmem-falloc-wait-disabled, but demonstrate that
> the livelock does not occur on 6.6-rc-with-shmem-falloc-wait-disabled
> (or that plus some simple adjustment of hacker's choosing): that would
> be great news, and we could delete all this - though probably not
> without bisecting to satisfy ourselves on what was the crucial change.
>
> But I never got around to running up Trinity to work on it originally,
> nor in the years since, nor do I expect to soon. (Vlastimil had a
> good simple technique for demonstrating a part of the problem, but
> fixing that part turned out not fix the whole issue with Trinity.)

Fair enough. I agree that we should do some testing before we actually
remove the serialization because the problem was not well understood even
back then and likely had something to do with unmap_mapping_folio()
inefficiency (e.g. unmapping one page at a time acquiring heavily contended
i_mmap_mutex for each page) rather than page cache eviction itself.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR