2023-09-14 14:03:23

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents

On 14.09.23 11:25, Qu Wenruo wrote:
>> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>> + int num_stripes,
>> + struct btrfs_io_context *bioc)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_key stripe_key;
>> + struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
>> + u8 encoding = btrfs_bg_type_to_raid_encoding(bioc->map_type);
>> + struct btrfs_stripe_extent *stripe_extent;
>> + size_t item_size;
>> + int ret;
>> +
>> + item_size = struct_size(stripe_extent, strides, num_stripes);
>
> I guess David has already pointed out this can be done at initialization
> and make it const.

Will do

>
>> +
>> + stripe_extent = kzalloc(item_size, GFP_NOFS);
>> + if (!stripe_extent) {
>> + btrfs_abort_transaction(trans, -ENOMEM);
>> + btrfs_end_transaction(trans);
>> + return -ENOMEM;
>> + }
>> +
>> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
>> + for (int i = 0; i < num_stripes; i++) {
>> + u64 devid = bioc->stripes[i].dev->devid;
>> + u64 physical = bioc->stripes[i].physical;
>> + u64 length = bioc->stripes[i].length;
>> + struct btrfs_raid_stride *raid_stride =
>> + &stripe_extent->strides[i];
>> +
>> + if (length == 0)
>> + length = bioc->size;
>> +
>> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
>> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
>> + btrfs_set_stack_raid_stride_length(raid_stride, length);
>> + }
>> +
>> + stripe_key.objectid = bioc->logical;
>> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
>> + stripe_key.offset = bioc->size;
>> +
>> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
>> + item_size);
>
> Have you tested in near-real-world on how continous the RST items could
> be for RAID0/RAID10?
>
> My concern here is, we may want to try our best to reduce the size of
> RST, due to the 64K BTRFS_STRIPE_LEN.
>

There are two things I can do for it. First is trying to merge contiguus
RST items and second make BTRFS_STRIPE_LEN a mkfs time constant instead
of a compile time constant.

But both can be done in a second step.

>> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>> + case BTRFS_BLOCK_GROUP_DUP:
>> + case BTRFS_BLOCK_GROUP_RAID1:
>> + case BTRFS_BLOCK_GROUP_RAID1C3:
>> + case BTRFS_BLOCK_GROUP_RAID1C4:
>> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
>> + map_type);
>> + break;
>> + case BTRFS_BLOCK_GROUP_RAID0:
>> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
>> + map_type);
>> + break;
>> + case BTRFS_BLOCK_GROUP_RAID10:
>> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
>> + break;
>> + default:
>> + ret = -EINVAL;
>
> Maybe we want to be a little more noisy?

OK.


2023-09-14 15:35:54

by Qu Wenruo

[permalink] [raw]
Subject: Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents



On 2023/9/14 19:21, Johannes Thumshirn wrote:
> On 14.09.23 11:25, Qu Wenruo wrote:
>>> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>>> + int num_stripes,
>>> + struct btrfs_io_context *bioc)
>>> +{
>>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>>> + struct btrfs_key stripe_key;
>>> + struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
>>> + u8 encoding = btrfs_bg_type_to_raid_encoding(bioc->map_type);
>>> + struct btrfs_stripe_extent *stripe_extent;
>>> + size_t item_size;
>>> + int ret;
>>> +
>>> + item_size = struct_size(stripe_extent, strides, num_stripes);
>>
>> I guess David has already pointed out this can be done at initialization
>> and make it const.
>
> Will do
>
>>
>>> +
>>> + stripe_extent = kzalloc(item_size, GFP_NOFS);
>>> + if (!stripe_extent) {
>>> + btrfs_abort_transaction(trans, -ENOMEM);
>>> + btrfs_end_transaction(trans);
>>> + return -ENOMEM;
>>> + }
>>> +
>>> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
>>> + for (int i = 0; i < num_stripes; i++) {
>>> + u64 devid = bioc->stripes[i].dev->devid;
>>> + u64 physical = bioc->stripes[i].physical;
>>> + u64 length = bioc->stripes[i].length;
>>> + struct btrfs_raid_stride *raid_stride =
>>> + &stripe_extent->strides[i];
>>> +
>>> + if (length == 0)
>>> + length = bioc->size;
>>> +
>>> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
>>> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
>>> + btrfs_set_stack_raid_stride_length(raid_stride, length);
>>> + }
>>> +
>>> + stripe_key.objectid = bioc->logical;
>>> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
>>> + stripe_key.offset = bioc->size;
>>> +
>>> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
>>> + item_size);
>>
>> Have you tested in near-real-world on how continous the RST items could
>> be for RAID0/RAID10?
>>
>> My concern here is, we may want to try our best to reduce the size of
>> RST, due to the 64K BTRFS_STRIPE_LEN.
>>
>
> There are two things I can do for it. First is trying to merge contiguus
> RST items

This is much easier, as the RST lookup code is already taking the length
into consideration, thus only the add path need some work.

Although I'm not sure how effective it would be in real world.
As if the merge rate is only 5%, then it barely makes a difference.

Maybe you don't need to implement a full merge in this version, but just
do some trace events to see the merge rate?

> and second make BTRFS_STRIPE_LEN a mkfs time constant instead
> of a compile time constant.

Please be very careful about this, we have quite some bitmap relying on
this. (IIRC RAID56 and scrub)

Currently unsigned long can only support up to 64 bits, thus the maximum
stripe length would be 256K, but I'm pretty sure there would be other
hidden traps somewhere else.

Otherwise the main workflow of RST looks good to me.

Thanks,
Qu
>
> But both can be done in a second step.
>
>>> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>>> + case BTRFS_BLOCK_GROUP_DUP:
>>> + case BTRFS_BLOCK_GROUP_RAID1:
>>> + case BTRFS_BLOCK_GROUP_RAID1C3:
>>> + case BTRFS_BLOCK_GROUP_RAID1C4:
>>> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
>>> + map_type);
>>> + break;
>>> + case BTRFS_BLOCK_GROUP_RAID0:
>>> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
>>> + map_type);
>>> + break;
>>> + case BTRFS_BLOCK_GROUP_RAID10:
>>> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
>>> + break;
>>> + default:
>>> + ret = -EINVAL;
>>
>> Maybe we want to be a little more noisy?
>
> OK.
>

2023-09-14 18:26:10

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents

On 14.09.23 12:07, Qu Wenruo wrote:
>>>> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
>>>> + item_size);
>>>
>>> Have you tested in near-real-world on how continous the RST items could
>>> be for RAID0/RAID10?
>>>
>>> My concern here is, we may want to try our best to reduce the size of
>>> RST, due to the 64K BTRFS_STRIPE_LEN.
>>>
>>
>> There are two things I can do for it. First is trying to merge contiguus
>> RST items
>
> This is much easier, as the RST lookup code is already taking the length
> into consideration, thus only the add path need some work.
>
> Although I'm not sure how effective it would be in real world.
> As if the merge rate is only 5%, then it barely makes a difference.
>

I think this will be very much workload dependent. Also just having
logically contiguous entries doesn't help much. All the physical strides
have to be contiguous as well.

> Maybe you don't need to implement a full merge in this version, but just
> do some trace events to see the merge rate?
>
>> and second make BTRFS_STRIPE_LEN a mkfs time constant instead
>> of a compile time constant.
>
> Please be very careful about this, we have quite some bitmap relying on
> this. (IIRC RAID56 and scrub)
>
> Currently unsigned long can only support up to 64 bits, thus the maximum
> stripe length would be 256K, but I'm pretty sure there would be other
> hidden traps somewhere else.
>

Yeah, this will be (both) a longer term research project as well. RAID56
has priority. Then Erasure Coding.