by Anand Jain

[permalink] [raw]

Subject: Re: [PATCH 4/7] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT

On 13/08/2021 18:39, Qu Wenruo wrote:
>
>
> On 2021/8/13 下午6:30, Anand Jain wrote:
>>
>>
>> On 13/08/2021 18:26, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/8/13 下午5:55, Anand Jain wrote:
>>>> From: Qu Wenruo <[email protected]>
>>>>
>>>> commit c53e9653605dbf708f5be02902de51831be4b009 upstream
>>>
>>> This lacks certain upstream fixes for it:
>>>
>>> f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when
>>> cloning inline extents and using qgroups
>>>
>>> 4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from
>>> btrfs_delayed_inode_reserve_metadata
>>>
>>> 6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit
>>> transaction when we already hold the handle
>>>
>>> All these fixes are to ensure we don't try to flush in context where we
>>> shouldn't.
>>>
>>> Without them, it can hit various deadlock.
>>>
>>
>> Qu,
>>
>>     Thanks for taking a look. I will send it in v2.
>
> I guess you only need to add the missing fixes?

Yeah, maybe it's better to send it as a new set.

Thx.
Anand

> Thanks,
> Qu
>>
>> -Anand
>>
>>
>>> Thanks,
>>> Qu
>>>>
>>>> [PROBLEM]
>>>> There are known problem related to how btrfs handles qgroup reserved
>>>> space. One of the most obvious case is the the test case btrfs/153,
>>>> which do fallocate, then write into the preallocated range.
>>>>
>>>>    btrfs/153 1s ... - output mismatch (see
>>>> xfstests-dev/results//btrfs/153.out.bad)
>>>>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>>> 20:24:40.730000089 +0800
>>>>        @@ -1,2 +1,5 @@
>>>>         QA output created by 153
>>>>        +pwrite: Disk quota exceeded
>>>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>>>         Silence is golden
>>>>        ...
>>>>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>>> xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
>>>>
>>>> [CAUSE]
>>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we
>>>> have to"),
>>>> we always reserve space no matter if it's COW or not.
>>>>
>>>> Such behavior change is mostly for performance, and reverting it is not
>>>> a good idea anyway.
>>>>
>>>> For preallcoated extent, we reserve qgroup data space for it already,
>>>> and since we also reserve data space for qgroup at buffered write time,
>>>> it needs twice the space for us to write into preallocated space.
>>>>
>>>> This leads to the -EDQUOT in buffered write routine.
>>>>
>>>> And we can't follow the same solution, unlike data/meta space check,
>>>> qgroup reserved space is shared between data/metadata.
>>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>>> check after qgroup reservation failure is not a solution.
>>>>
>>>> [FIX]
>>>> To solve the problem, we don't return -EDQUOT directly, but every time
>>>> we got a -EDQUOT, we try to flush qgroup space:
>>>>
>>>> - Flush all inodes of the root
>>>>    NODATACOW writes will free the qgroup reserved at
>>>> run_dealloc_range().
>>>>    However we don't have the infrastructure to only flush NODATACOW
>>>>    inodes, here we flush all inodes anyway.
>>>>
>>>> - Wait for ordered extents
>>>>    This would convert the preallocated metadata space into per-trans
>>>>    metadata, which can be freed in later transaction commit.
>>>>
>>>> - Commit transaction
>>>>    This will free all per-trans metadata space.
>>>>
>>>> Also we don't want to trigger flush multiple times, so here we
>>>> introduce
>>>> a per-root wait list and a new root status, to ensure only one thread
>>>> starts the flushing.
>>>>
>>>> Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
>>>> Reviewed-by: Josef Bacik <[email protected]>
>>>> Signed-off-by: Qu Wenruo <[email protected]>
>>>> Reviewed-by: David Sterba <[email protected]>
>>>> Signed-off-by: David Sterba <[email protected]>
>>>> Signed-off-by: Anand Jain <[email protected]>
>>>> ---
>>>> fs/btrfs/ctree.h   |   3 ++
>>>> fs/btrfs/disk-io.c |   1 +
>>>> fs/btrfs/qgroup.c | 100
>>>> +++++++++++++++++++++++++++++++++++++++++----
>>>> 3 files changed, 96 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>> index 7960359dbc70..5448dc62e915 100644
>>>> --- a/fs/btrfs/ctree.h
>>>> +++ b/fs/btrfs/ctree.h
>>>> @@ -945,6 +945,8 @@ enum {
>>>>       BTRFS_ROOT_DEAD_TREE,
>>>>       /* The root has a log tree. Used only for subvolume roots. */
>>>>       BTRFS_ROOT_HAS_LOG_TREE,
>>>> +    /* Qgroup flushing is in progress */
>>>> +    BTRFS_ROOT_QGROUP_FLUSHING,
>>>> };
>>>>
>>>> /*
>>>> @@ -1097,6 +1099,7 @@ struct btrfs_root {
>>>>       spinlock_t qgroup_meta_rsv_lock;
>>>>       u64 qgroup_meta_rsv_pertrans;
>>>>       u64 qgroup_meta_rsv_prealloc;
>>>> +    wait_queue_head_t qgroup_flush_wait;
>>>>
>>>>       /* Number of active swapfiles */
>>>>       atomic_t nr_swapfiles;
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index e6aa94a583e9..e3bcab38a166 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -1154,6 +1154,7 @@ static void __setup_root(struct btrfs_root
>>>> *root, struct btrfs_fs_info *fs_info,
>>>>       mutex_init(&root->log_mutex);
>>>>       mutex_init(&root->ordered_extent_mutex);
>>>>       mutex_init(&root->delalloc_mutex);
>>>> +    init_waitqueue_head(&root->qgroup_flush_wait);
>>>>       init_waitqueue_head(&root->log_writer_wait);
>>>>       init_waitqueue_head(&root->log_commit_wait[0]);
>>>>       init_waitqueue_head(&root->log_commit_wait[1]);
>>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>>> index 50c45b4fcfd4..b312ac645e08 100644
>>>> --- a/fs/btrfs/qgroup.c
>>>> +++ b/fs/btrfs/qgroup.c
>>>> @@ -3479,17 +3479,58 @@ static int qgroup_unreserve_range(struct
>>>> btrfs_inode *inode,
>>>> }
>>>>
>>>> /*
>>>> - * Reserve qgroup space for range [start, start + len).
>>>> + * Try to free some space for qgroup.
>>>>    *
>>>> - * This function will either reserve space from related qgroups or
>>>> doing
>>>> - * nothing if the range is already reserved.
>>>> + * For qgroup, there are only 3 ways to free qgroup space:
>>>> + * - Flush nodatacow write
>>>> + *   Any nodatacow write will free its reserved data space at
>>>> run_delalloc_range().
>>>> + *   In theory, we should only flush nodatacow inodes, but it's not
>>>> yet
>>>> + *   possible, so we need to flush the whole root.
>>>>    *
>>>> - * Return 0 for successful reserve
>>>> - * Return <0 for error (including -EQUOT)
>>>> + * - Wait for ordered extents
>>>> + *   When ordered extents are finished, their reserved metadata is
>>>> finally
>>>> + *   converted to per_trans status, which can be freed by later commit
>>>> + *   transaction.
>>>>    *
>>>> - * NOTE: this function may sleep for memory allocation.
>>>> + * - Commit transaction
>>>> + *   This would free the meta_per_trans space.
>>>> + *   In theory this shouldn't provide much space, but any more
>>>> qgroup space
>>>> + *   is needed.
>>>>    */
>>>> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>>>> +static int try_flush_qgroup(struct btrfs_root *root)
>>>> +{
>>>> +    struct btrfs_trans_handle *trans;
>>>> +    int ret;
>>>> +
>>>> +    /*
>>>> +     * We don't want to run flush again and again, so if there is a
>>>> running
>>>> +     * one, we won't try to start a new flush, but exit directly.
>>>> +     */
>>>> +    if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) {
>>>> +        wait_event(root->qgroup_flush_wait,
>>>> +            !test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state));
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    ret = btrfs_start_delalloc_snapshot(root);
>>>> +    if (ret < 0)
>>>> +        goto out;
>>>> +    btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>>> +
>>>> +    trans = btrfs_join_transaction(root);
>>>> +    if (IS_ERR(trans)) {
>>>> +        ret = PTR_ERR(trans);
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    ret = btrfs_commit_transaction(trans);
>>>> +out:
>>>> +    clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
>>>> +    wake_up(&root->qgroup_flush_wait);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int qgroup_reserve_data(struct btrfs_inode *inode,
>>>>               struct extent_changeset **reserved_ret, u64 start,
>>>>               u64 len)
>>>> {
>>>> @@ -3542,6 +3583,34 @@ int btrfs_qgroup_reserve_data(struct
>>>> btrfs_inode *inode,
>>>>       return ret;
>>>> }
>>>>
>>>> +/*
>>>> + * Reserve qgroup space for range [start, start + len).
>>>> + *
>>>> + * This function will either reserve space from related qgroups or
>>>> do nothing
>>>> + * if the range is already reserved.
>>>> + *
>>>> + * Return 0 for successful reservation
>>>> + * Return <0 for error (including -EQUOT)
>>>> + *
>>>> + * NOTE: This function may sleep for memory allocation, dirty page
>>>> flushing and
>>>> + *     commit transaction. So caller should not hold any dirty page
>>>> locked.
>>>> + */
>>>> +int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>>>> +            struct extent_changeset **reserved_ret, u64 start,
>>>> +            u64 len)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = qgroup_reserve_data(inode, reserved_ret, start, len);
>>>> +    if (ret <= 0 && ret != -EDQUOT)
>>>> +        return ret;
>>>> +
>>>> +    ret = try_flush_qgroup(inode->root);
>>>> +    if (ret < 0)
>>>> +        return ret;
>>>> +    return qgroup_reserve_data(inode, reserved_ret, start, len);
>>>> +}
>>>> +
>>>> /* Free ranges specified by @reserved, normally in error path */
>>>> static int qgroup_free_reserved_data(struct btrfs_inode *inode,
>>>>               struct extent_changeset *reserved, u64 start, u64 len)
>>>> @@ -3712,7 +3781,7 @@ static int sub_root_meta_rsv(struct btrfs_root
>>>> *root, int num_bytes,
>>>>       return num_bytes;
>>>> }
>>>>
>>>> -int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int
>>>> num_bytes,
>>>> +static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>>>>                   enum btrfs_qgroup_rsv_type type, bool enforce)
>>>> {
>>>>       struct btrfs_fs_info *fs_info = root->fs_info;
>>>> @@ -3739,6 +3808,21 @@ int __btrfs_qgroup_reserve_meta(struct
>>>> btrfs_root *root, int num_bytes,
>>>>       return ret;
>>>> }
>>>>
>>>> +int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int
>>>> num_bytes,
>>>> +                enum btrfs_qgroup_rsv_type type, bool enforce)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
>>>> +    if (ret <= 0 && ret != -EDQUOT)
>>>> +        return ret;
>>>> +
>>>> +    ret = try_flush_qgroup(root);
>>>> +    if (ret < 0)
>>>> +        return ret;
>>>> +    return qgroup_reserve_meta(root, num_bytes, type, enforce);
>>>> +}
>>>> +
>>>> void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
>>>> {
>>>>       struct btrfs_fs_info *fs_info = root->fs_info;
>>>>
>>
>

2021-08-30 22:32:07

by Anand Jain

[permalink] [raw]

Subject: Re: [PATCH 4/7] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT

On 13/08/2021 19:06, Anand Jain wrote:
>
>
> On 13/08/2021 18:56, Greg KH wrote:
>> On Fri, Aug 13, 2021 at 06:41:53PM +0800, Anand Jain wrote:
>>>
>>>
>>> On 13/08/2021 18:39, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/8/13 下午6:30, Anand Jain wrote:
>>>>>
>>>>>
>>>>> On 13/08/2021 18:26, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2021/8/13 下午5:55, Anand Jain wrote:
>>>>>>> From: Qu Wenruo <[email protected]>
>>>>>>>
>>>>>>> commit c53e9653605dbf708f5be02902de51831be4b009 upstream
>>>>>>
>>>>>> This lacks certain upstream fixes for it:
>>>>>>
>>>>>> f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when
>>>>>> cloning inline extents and using qgroups
>>>>>>
>>>>>> 4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from
>>>>>> btrfs_delayed_inode_reserve_metadata
>>>>>>
>>>>>> 6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit
>>>>>> transaction when we already hold the handle
>>>>>>
>>>>>> All these fixes are to ensure we don't try to flush in context
>>>>>> where we
>>>>>> shouldn't.
>>>>>>
>>>>>> Without them, it can hit various deadlock.
>>>>>>
>>>>>
>>>>> Qu,
>>>>>
>>>>> Thanks for taking a look. I will send it in v2.
>>>>
>>>> I guess you only need to add the missing fixes?
>>>
>>> Yeah, maybe it's better to send it as a new set.
>>
>> So should I drop the existing patches and wait for a whole new series,
>> or will you send these as an additional set?
>
> Greg, I am sending it as an additional set.
>

>> And at least one of the above commits needs to go to the 5.10.y tree, I
>> did not check them all...
>
> I need to look into it.

We don't need 1/7 in 5.10.y it was a preparatory patch in 5.4.y
[PATCH 1/7] btrfs: make qgroup_free_reserved_data take btrfs_inode

The rest of the patches (in patchset 1 and 2) are already in the
stable-5.10.y.

Thx, Anand

>
> Thanks, Anand
>
>> thanks,
>>
>> greg k-h
>>