by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH 07/13] xfs: allow merging ioends over append boundaries

On Fri, Jun 28, 2019 at 07:51:43AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 27, 2019 at 11:23:09AM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 27, 2019 at 12:48:30PM +0200, Christoph Hellwig wrote:
> > > There is no real problem merging ioends that go beyond i_size into an
> > > ioend that doesn't. We just need to move the append transaction to the
> > > base ioend. Also use the opportunity to use a real error code instead
> > > of the magic 1 to cancel the transactions, and write a comment
> > > explaining the scheme.
> > >
> > > Signed-off-by: Christoph Hellwig <[email protected]>
> >
> > Reading through this patch, I have a feeling it fixes the crash that
> > Zorro has been seeing occasionally with generic/475...
>
> So you think for some reason the disk i_size changes underneath and thus
> the xfs_ioend_is_append misfired vs the actual transaction allocations?
> I didn't even think of that, but using the different checks sure sounds
> dangerous. So yes, we'd either need to backport my patch, or at least
> replace the checks in xfs_ioend_can_merge with direct checks of
> io_append_trans.

That's my working theory, yes.

1. Dirty pages 0 and 2 of an empty file.

2. Writeback gets scheduled for pages 0 and 2, creating ioends A and C.
Both ioends describe writes past the on-disk isize so we allocate
transactions.

3. ioend C completes immediately, sets the ondisk isize to (3 * PAGESIZE).

4. Dirty page 1 of the file and immediately schedule writeback for it,
creating ioend B. ioend B describes a write within the on-disk isize so
we do not allocate setfilesize transaction.

5. ioend A and B complete and are sorted into the per-inode ioend
completion list. xfs_ioend_try_merge looks at ioend A, sees that ioend
can be merged with ioend B (same type, same cow status, same current
setfilesize status (which does not reflect the setfilesize status when A
was created)) and therefore decides to merge them.

6. A has a setfilesize transaction so _try_merge calls
xfs_setfilesize_ioend(ioend B, -1) to cancel ioend B's transaction, but
as we saw in (4), ioend B has no transaction and crashes.

--D

2019-06-28 17:42:43

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH 06/13] xfs: remove XFS_TRANS_NOFS

On Fri, Jun 28, 2019 at 07:37:17AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 27, 2019 at 03:30:30PM -0700, Darrick J. Wong wrote:
> > I think the wording of this is too indirect. The reason we need to set
> > NOFS is because we could be doing writeback as part of reclaiming
> > memory, which means that we cannot recurse back into filesystems to
> > satisfy the memory allocation needed to create a transaction. The NOFS
> > part applies to any memory allocation, of course.
> >
> > If you're fine with the wording below I'll just edit that into the
> > patch:
> >
> > /*
> > * We can allocate memory here while doing writeback on behalf of
> > * memory reclaim. To avoid memory allocation deadlocks set the
> > * task-wide nofs context for the following operations.
> > */
> > nofs_flag = memalloc_nofs_save();
>
> Fine with me.
>
> > > trace_xfs_end_io_direct_write(ip, offset, size);
> > > @@ -395,10 +396,11 @@ xfs_dio_write_end_io(
> > > */
> > > XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size);
> > >
> > > + nofs_flag = memalloc_nofs_save();
> >
> > Hmm, do we need this here? I can't remember if there was a need for
> > setting NOFS under xfs_reflink_end_cow from a dio completion or if that
> > was purely the buffered writeback case...
>
> We certainly had to add it for the unwritten extent conversion, maybe
> the corner case just didn't manage to show up for COW yet:

AFAICT the referenced patch solved the problem of writeback ioend
completion deadlocking with memory reclaim by changing the transaction
allocation call in the xfs_iomap_write_unwritten function, which is
called by writeback ioend completion.

However, the directio endio code also calls xfs_iomap_write_unwritten.
I can't tell if NOFS is actually needed in that context, or if we've
just been operating like that for a decade because that's the method
that was chosen to solve the deadlock.

I think this boils down to -- does writeback induced by memory reclaim
ever block on directio?

I don't think it does, but as a counterpoint xfs has been like this for
10 years and there don't seem to be many complaints about directio endio
pushing memory reclaim too hard...

--D

>
> commit 80641dc66a2d6dfb22af4413227a92b8ab84c7bb
> Author: Christoph Hellwig <[email protected]>
> Date: Mon Oct 19 04:00:03 2009 +0000
>
> xfs: I/O completion handlers must use NOFS allocations
>
> When completing I/O requests we must not allow the memory allocator to
> recurse into the filesystem, as we might deadlock on waiting for the
> I/O completion otherwise. The only thing currently allocating normal
> GFP_KERNEL memory is the allocation of the transaction structure for
> the unwritten extent conversion. Add a memflags argument to
> _xfs_trans_alloc to allow controlling the allocator behaviour.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Reported-by: Thomas Neumann <[email protected]>
> Tested-by: Thomas Neumann <[email protected]>
> Reviewed-by: Alex Elder <[email protected]>
> Signed-off-by: Alex Elder <[email protected]>
>
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 2d0b3e1da9e6..6f83f58c099f 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -611,7 +611,7 @@ xfs_fs_log_dummy(
> xfs_inode_t *ip;
> int error;
>
> - tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1);
> + tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP);
> error = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0);
> if (error) {
> xfs_trans_cancel(tp, 0);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 67ae5555a30a..7294abce6ef2 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -860,8 +860,15 @@ xfs_iomap_write_unwritten(
> * set up a transaction to convert the range of extents
> * from unwritten to real. Do allocations in a loop until
> * we have covered the range passed in.
> + *
> + * Note that we open code the transaction allocation here
> + * to pass KM_NOFS--we can't risk to recursing back into
> + * the filesystem here as we might be asked to write out
> + * the same inode that we complete here and might deadlock
> + * on the iolock.
> */
> - tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
> + xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> + tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
> tp->t_flags |= XFS_TRANS_RESERVE;
> error = xfs_trans_reserve(tp, resblks,
> XFS_WRITE_LOG_RES(mp), 0,
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 8b6c9e807efb..4d509f742bd2 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1471,7 +1471,7 @@ xfs_log_sbcount(
> if (!xfs_sb_version_haslazysbcount(&mp->m_sb))
> return 0;
>
> - tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT);
> + tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP);
> error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
> XFS_DEFAULT_LOG_COUNT);
> if (error) {
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 66b849358e62..237badcbac3b 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -236,19 +236,20 @@ xfs_trans_alloc(
> uint type)
> {
> xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> - return _xfs_trans_alloc(mp, type);
> + return _xfs_trans_alloc(mp, type, KM_SLEEP);
> }
>
> xfs_trans_t *
> _xfs_trans_alloc(
> xfs_mount_t *mp,
> - uint type)
> + uint type,
> + uint memflags)
> {
> xfs_trans_t *tp;
>
> atomic_inc(&mp->m_active_trans);
>
> - tp = kmem_zone_zalloc(xfs_trans_zone, KM_SLEEP);
> + tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
> tp->t_magic = XFS_TRANS_MAGIC;
> tp->t_type = type;
> tp->t_mountp = mp;
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index ed47fc77759c..a0574f593f52 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -924,7 +924,7 @@ typedef struct xfs_trans {
> * XFS transaction mechanism exported interfaces.
> */
> xfs_trans_t *xfs_trans_alloc(struct xfs_mount *, uint);
> -xfs_trans_t *_xfs_trans_alloc(struct xfs_mount *, uint);
> +xfs_trans_t *_xfs_trans_alloc(struct xfs_mount *, uint, uint);
> xfs_trans_t *xfs_trans_dup(xfs_trans_t *);
> int xfs_trans_reserve(xfs_trans_t *, uint, uint, uint,
> uint, uint);