We had a report from the field where filling a zoned drive with one file
60% of the drive's capacity and then overwriting this file results in
ENOSPC.
If said drive is fast and small enough, the problem can be easily
triggered, as both reclaim of dirty block-groups and deletion of unused
block-groups only happen at transaction commit time. But if the whole test
is faster than we're doing transaction commits we're unnecessarily running
out of usable space on a zoned drive.
This can easily be reproduced by the following fio snippet:
fio --name=foo --filename=$TEST/foo --size=$60_PERCENT_OF_DRIVE --rw=write\
--loops=2
A fstests testcase for this issue will be sent as well.
---
Johannes Thumshirn (2):
btrfs: zoned: use rcu list for iterating devices to collect stats
btrfs: zoned: wake up cleaner sooner if needed
fs/btrfs/free-space-cache.c | 6 ++++++
fs/btrfs/zoned.c | 6 +++---
2 files changed, 9 insertions(+), 3 deletions(-)
---
base-commit: d9796b728dcbf25e0190e542be33902222098fac
change-id: 20240122-reclaim-fix-1fcae9c27fc8
Best regards,
--
Johannes Thumshirn <[email protected]>
On very fast but small devices, waiting for a transaction commit can be
too long of a wait in order to wake up the cleaner kthread to remove unused
and reclaimable block-groups.
Check every time we're adding back free space to a block group, if we need
to activate the cleaner kthread.
Signed-off-by: Johannes Thumshirn <[email protected]>
---
fs/btrfs/free-space-cache.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index d372c7ce0e6b..2d98b9ca0e83 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -30,6 +30,7 @@
#include "file-item.h"
#include "file.h"
#include "super.h"
+#include "zoned.h"
#define BITS_PER_BITMAP (PAGE_SIZE * 8UL)
#define MAX_CACHE_BYTES_PER_GIG SZ_64K
@@ -2694,6 +2695,7 @@ int __btrfs_add_free_space(struct btrfs_block_group *block_group,
static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
u64 bytenr, u64 size, bool used)
{
+ struct btrfs_fs_info *fs_info = block_group->fs_info;
struct btrfs_space_info *sinfo = block_group->space_info;
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
u64 offset = bytenr - block_group->start;
@@ -2745,6 +2747,10 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
btrfs_mark_bg_to_reclaim(block_group);
}
+ if (btrfs_zoned_should_reclaim(fs_info) &&
+ !test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags))
+ wake_up_process(fs_info->cleaner_kthread);
+
return 0;
}
--
2.43.0
As btrfs_zoned_should_reclaim only has to iterate the device list in order
to collect stats on the device's total and used bytes, we don't need to
take the full blown mutex, but can iterate the device list in a rcu_read
context.
Signed-off-by: Johannes Thumshirn <[email protected]>
---
fs/btrfs/zoned.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 168af9d000d1..b7e7b5a5a6fa 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2423,15 +2423,15 @@ bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
if (fs_info->bg_reclaim_threshold == 0)
return false;
- mutex_lock(&fs_devices->device_list_mutex);
- list_for_each_entry(device, &fs_devices->devices, dev_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
if (!device->bdev)
continue;
total += device->disk_total_bytes;
used += device->bytes_used;
}
- mutex_unlock(&fs_devices->device_list_mutex);
+ rcu_read_unlock();
factor = div64_u64(used * 100, total);
return factor >= fs_info->bg_reclaim_threshold;
--
2.43.0
On Mon, Jan 22, 2024 at 02:51:03AM -0800, Johannes Thumshirn wrote:
> As btrfs_zoned_should_reclaim only has to iterate the device list in order
> to collect stats on the device's total and used bytes, we don't need to
> take the full blown mutex, but can iterate the device list in a rcu_read
> context.
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
Looks good.
Reviewed-by: Naohiro Aota <[email protected]>
On Mon, Jan 22, 2024 at 02:51:04AM -0800, Johannes Thumshirn wrote:
> On very fast but small devices, waiting for a transaction commit can be
> too long of a wait in order to wake up the cleaner kthread to remove unused
> and reclaimable block-groups.
>
> Check every time we're adding back free space to a block group, if we need
> to activate the cleaner kthread.
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
> ---
> fs/btrfs/free-space-cache.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index d372c7ce0e6b..2d98b9ca0e83 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -30,6 +30,7 @@
> #include "file-item.h"
> #include "file.h"
> #include "super.h"
> +#include "zoned.h"
>
> #define BITS_PER_BITMAP (PAGE_SIZE * 8UL)
> #define MAX_CACHE_BYTES_PER_GIG SZ_64K
> @@ -2694,6 +2695,7 @@ int __btrfs_add_free_space(struct btrfs_block_group *block_group,
> static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
> u64 bytenr, u64 size, bool used)
> {
> + struct btrfs_fs_info *fs_info = block_group->fs_info;
> struct btrfs_space_info *sinfo = block_group->space_info;
> struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> u64 offset = bytenr - block_group->start;
> @@ -2745,6 +2747,10 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
> btrfs_mark_bg_to_reclaim(block_group);
> }
>
> + if (btrfs_zoned_should_reclaim(fs_info) &&
> + !test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags))
> + wake_up_process(fs_info->cleaner_kthread);
> +
Isn't it too costly to call btrfs_zoned_should_reclaim() every time
something updated? Can we wake up it in btrfs_mark_bg_to_reclaim and
btrfs_mark_bg_unused ?
Also, looking into btrfs_zoned_should_reclaim(), it sums device->bytes_used
for each fs_devices->devices. And, device->bytes_used is set at
create_chunk() or at btrfs_remove_chunk(). Isn't it feasible to do the
calculation only there?
> return 0;
> }
>
>
> --
> 2.43.0
>
On 22.01.24 13:22, Naohiro Aota wrote:
> On Mon, Jan 22, 2024 at 02:51:04AM -0800, Johannes Thumshirn wrote:
>> On very fast but small devices, waiting for a transaction commit can be
>> too long of a wait in order to wake up the cleaner kthread to remove unused
>> and reclaimable block-groups.
>>
>> Check every time we're adding back free space to a block group, if we need
>> to activate the cleaner kthread.
>>
>> Signed-off-by: Johannes Thumshirn <[email protected]>
>> ---
>> fs/btrfs/free-space-cache.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index d372c7ce0e6b..2d98b9ca0e83 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -30,6 +30,7 @@
>> #include "file-item.h"
>> #include "file.h"
>> #include "super.h"
>> +#include "zoned.h"
>>
>> #define BITS_PER_BITMAP (PAGE_SIZE * 8UL)
>> #define MAX_CACHE_BYTES_PER_GIG SZ_64K
>> @@ -2694,6 +2695,7 @@ int __btrfs_add_free_space(struct btrfs_block_group *block_group,
>> static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
>> u64 bytenr, u64 size, bool used)
>> {
>> + struct btrfs_fs_info *fs_info = block_group->fs_info;
>> struct btrfs_space_info *sinfo = block_group->space_info;
>> struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>> u64 offset = bytenr - block_group->start;
>> @@ -2745,6 +2747,10 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
>> btrfs_mark_bg_to_reclaim(block_group);
>> }
>>
>> + if (btrfs_zoned_should_reclaim(fs_info) &&
>> + !test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags))
>> + wake_up_process(fs_info->cleaner_kthread);
>> +
>
> Isn't it too costly to call btrfs_zoned_should_reclaim() every time
> something updated? Can we wake up it in btrfs_mark_bg_to_reclaim and
> btrfs_mark_bg_unused ?
Hmm yes, I've thought of adding a list_count() for the reclaim and
unused lists, and only calling into should_reclaim if these lists have
entries. Or even better !list_is_singular().
>
> Also, looking into btrfs_zoned_should_reclaim(), it sums device->bytes_used
> for each fs_devices->devices. And, device->bytes_used is set at
> create_chunk() or at btrfs_remove_chunk(). Isn't it feasible to do the
> calculation only there?
Oh sh*t! Right we should check bytes_used from all space_infos in
btrfs_zoned_should_reclaim() and compare that to the disk total bytes.
On Mon, Jan 22, 2024 at 12:30:52PM +0000, Johannes Thumshirn wrote:
> On 22.01.24 13:22, Naohiro Aota wrote:
> > On Mon, Jan 22, 2024 at 02:51:04AM -0800, Johannes Thumshirn wrote:
> >> On very fast but small devices, waiting for a transaction commit can be
> >> too long of a wait in order to wake up the cleaner kthread to remove unused
> >> and reclaimable block-groups.
> >>
> >> Check every time we're adding back free space to a block group, if we need
> >> to activate the cleaner kthread.
> >>
> >> Signed-off-by: Johannes Thumshirn <[email protected]>
> >> ---
> >> fs/btrfs/free-space-cache.c | 6 ++++++
> >> 1 file changed, 6 insertions(+)
> >>
> >> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> >> index d372c7ce0e6b..2d98b9ca0e83 100644
> >> --- a/fs/btrfs/free-space-cache.c
> >> +++ b/fs/btrfs/free-space-cache.c
> >> @@ -30,6 +30,7 @@
> >> #include "file-item.h"
> >> #include "file.h"
> >> #include "super.h"
> >> +#include "zoned.h"
> >>
> >> #define BITS_PER_BITMAP (PAGE_SIZE * 8UL)
> >> #define MAX_CACHE_BYTES_PER_GIG SZ_64K
> >> @@ -2694,6 +2695,7 @@ int __btrfs_add_free_space(struct btrfs_block_group *block_group,
> >> static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
> >> u64 bytenr, u64 size, bool used)
> >> {
> >> + struct btrfs_fs_info *fs_info = block_group->fs_info;
> >> struct btrfs_space_info *sinfo = block_group->space_info;
> >> struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >> u64 offset = bytenr - block_group->start;
> >> @@ -2745,6 +2747,10 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
> >> btrfs_mark_bg_to_reclaim(block_group);
> >> }
> >>
> >> + if (btrfs_zoned_should_reclaim(fs_info) &&
> >> + !test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags))
> >> + wake_up_process(fs_info->cleaner_kthread);
> >> +
> >
> > Isn't it too costly to call btrfs_zoned_should_reclaim() every time
> > something updated? Can we wake up it in btrfs_mark_bg_to_reclaim and
> > btrfs_mark_bg_unused ?
>
> Hmm yes, I've thought of adding a list_count() for the reclaim and
> unused lists, and only calling into should_reclaim if these lists have
> entries. Or even better !list_is_singular().
That sounds reasonable.
> >
> > Also, looking into btrfs_zoned_should_reclaim(), it sums device->bytes_used
> > for each fs_devices->devices. And, device->bytes_used is set at
> > create_chunk() or at btrfs_remove_chunk(). Isn't it feasible to do the
> > calculation only there?
>
> Oh sh*t! Right we should check bytes_used from all space_infos in
> btrfs_zoned_should_reclaim() and compare that to the disk total bytes.
You mean device->bytes_used? space_info->bytes_used does not count free
space and zone_unusable in BGs, so using that changes the behavior. Even,
it won't kick the thread if there are many zone_unusable but small used
space.
>
On 22.01.24 15:39, Naohiro Aota wrote:
>>>
>>> Also, looking into btrfs_zoned_should_reclaim(), it sums device->bytes_used
>>> for each fs_devices->devices. And, device->bytes_used is set at
>>> create_chunk() or at btrfs_remove_chunk(). Isn't it feasible to do the
>>> calculation only there?
>>
>> Oh sh*t! Right we should check bytes_used from all space_infos in
>> btrfs_zoned_should_reclaim() and compare that to the disk total bytes.
>
> You mean device->bytes_used? space_info->bytes_used does not count free
> space and zone_unusable in BGs, so using that changes the behavior. Even,
> it won't kick the thread if there are many zone_unusable but small used
> space.
>
I did mean btrfs_space_info_used():
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b7e7b5a5a6fa..d5242c96c97c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2414,6 +2414,7 @@ bool btrfs_zoned_should_reclaim(struct
btrfs_fs_info *fs_info)
{
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
struct btrfs_device *device;
+ struct btrfs_space_info *space_info;
u64 used = 0;
u64 total = 0;
u64 factor;
@@ -2429,10 +2430,15 @@ bool btrfs_zoned_should_reclaim(struct
btrfs_fs_info *fs_info)
continue;
total += device->disk_total_bytes;
- used += device->bytes_used;
}
rcu_read_unlock();
+ list_for_each_entry(space_info, &fs_info->space_info, list) {
+ spin_lock(&space_info->lock);
+ used += btrfs_space_info_used(space_info, true);
+ spin_unlock(&space_info->lock);
+ }
+
factor = div64_u64(used * 100, total);
return factor >= fs_info->bg_reclaim_threshold;
}
On Mon, Jan 22, 2024 at 02:43:03PM +0000, Johannes Thumshirn wrote:
> On 22.01.24 15:39, Naohiro Aota wrote:
> >>>
> >>> Also, looking into btrfs_zoned_should_reclaim(), it sums device->bytes_used
> >>> for each fs_devices->devices. And, device->bytes_used is set at
> >>> create_chunk() or at btrfs_remove_chunk(). Isn't it feasible to do the
> >>> calculation only there?
> >>
> >> Oh sh*t! Right we should check bytes_used from all space_infos in
> >> btrfs_zoned_should_reclaim() and compare that to the disk total bytes.
> >
> > You mean device->bytes_used? space_info->bytes_used does not count free
> > space and zone_unusable in BGs, so using that changes the behavior. Even,
> > it won't kick the thread if there are many zone_unusable but small used
> > space.
> >
>
> I did mean btrfs_space_info_used():
>
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index b7e7b5a5a6fa..d5242c96c97c 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -2414,6 +2414,7 @@ bool btrfs_zoned_should_reclaim(struct
> btrfs_fs_info *fs_info)
> {
> struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> struct btrfs_device *device;
> + struct btrfs_space_info *space_info;
> u64 used = 0;
> u64 total = 0;
> u64 factor;
> @@ -2429,10 +2430,15 @@ bool btrfs_zoned_should_reclaim(struct
> btrfs_fs_info *fs_info)
> continue;
>
> total += device->disk_total_bytes;
> - used += device->bytes_used;
> }
> rcu_read_unlock();
>
> + list_for_each_entry(space_info, &fs_info->space_info, list) {
> + spin_lock(&space_info->lock);
> + used += btrfs_space_info_used(space_info, true);
> + spin_unlock(&space_info->lock);
> + }
> +
> factor = div64_u64(used * 100, total);
> return factor >= fs_info->bg_reclaim_threshold;
> }
>
This changes the behavior. btrfs_space_info_used() excludes unallocated
space.
Also, if we calculate it with device->disk_total_bytes, it screws up on
DUP/RAID* profile because btrfs_space_info_used() counts the logical space
vs disk_total_bytes counts the physical space.
On 22.01.24 11:51, Johannes Thumshirn wrote:
> We had a report from the field where filling a zoned drive with one file
> 60% of the drive's capacity and then overwriting this file results in
> ENOSPC.
>
> If said drive is fast and small enough, the problem can be easily
> triggered, as both reclaim of dirty block-groups and deletion of unused
> block-groups only happen at transaction commit time. But if the whole test
> is faster than we're doing transaction commits we're unnecessarily running
> out of usable space on a zoned drive.
>
> This can easily be reproduced by the following fio snippet:
> fio --name=foo --filename=$TEST/foo --size=$60_PERCENT_OF_DRIVE --rw=write\
> --loops=2
>
> A fstests testcase for this issue will be sent as well.
Please disregard this series. I'm stupid and had lockdep active during
testing, so the timing was messed up and reclaim had a chance to kick in.
On Mon, Jan 22, 2024 at 02:51:03AM -0800, Johannes Thumshirn wrote:
> As btrfs_zoned_should_reclaim only has to iterate the device list in order
> to collect stats on the device's total and used bytes, we don't need to
> take the full blown mutex, but can iterate the device list in a rcu_read
> context.
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
> ---
> fs/btrfs/zoned.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 168af9d000d1..b7e7b5a5a6fa 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -2423,15 +2423,15 @@ bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
> if (fs_info->bg_reclaim_threshold == 0)
> return false;
>
> - mutex_lock(&fs_devices->device_list_mutex);
> - list_for_each_entry(device, &fs_devices->devices, dev_list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
> if (!device->bdev)
> continue;
>
> total += device->disk_total_bytes;
> used += device->bytes_used;
> }
> - mutex_unlock(&fs_devices->device_list_mutex);
> + rcu_read_unlock();
This is basically only a hint and inaccuracies in the total or used
values would be transient, right? The sum is calculated each time the
funciton is called, not stored anywhere so in the unlikely case of
device removal it may skip reclaim once, but then pick it up later.
Any actual removal of the block groups in verified again and properly
locked in btrfs_reclaim_bgs_work().
On Mon, Jan 22, 2024 at 02:51:04AM -0800, Johannes Thumshirn wrote:
> On very fast but small devices, waiting for a transaction commit can be
> too long of a wait in order to wake up the cleaner kthread to remove unused
> and reclaimable block-groups.
>
> Check every time we're adding back free space to a block group, if we need
> to activate the cleaner kthread.
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
> ---
> fs/btrfs/free-space-cache.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index d372c7ce0e6b..2d98b9ca0e83 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -30,6 +30,7 @@
> #include "file-item.h"
> #include "file.h"
> #include "super.h"
> +#include "zoned.h"
>
> #define BITS_PER_BITMAP (PAGE_SIZE * 8UL)
> #define MAX_CACHE_BYTES_PER_GIG SZ_64K
> @@ -2694,6 +2695,7 @@ int __btrfs_add_free_space(struct btrfs_block_group *block_group,
> static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
> u64 bytenr, u64 size, bool used)
> {
I thought add_free_space are only called from various error/backout
conditions and then for real from unpin_extent_range, which is also in
the transaction commit.
The normal reclaim/unused decision is made in btrfs_update_block_group
for that reason.
OTOH, I assume you had some repro that was performing poorly and this
patch fixed it so, I am very likely missing something.
Thanks,
Boris
> + struct btrfs_fs_info *fs_info = block_group->fs_info;
> struct btrfs_space_info *sinfo = block_group->space_info;
> struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> u64 offset = bytenr - block_group->start;
> @@ -2745,6 +2747,10 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
> btrfs_mark_bg_to_reclaim(block_group);
> }
>
> + if (btrfs_zoned_should_reclaim(fs_info) &&
> + !test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags))
> + wake_up_process(fs_info->cleaner_kthread);
> +
> return 0;
> }
>
>
> --
> 2.43.0
>
On 22.01.24 22:35, David Sterba wrote:
> On Mon, Jan 22, 2024 at 02:51:03AM -0800, Johannes Thumshirn wrote:
>> As btrfs_zoned_should_reclaim only has to iterate the device list in order
>> to collect stats on the device's total and used bytes, we don't need to
>> take the full blown mutex, but can iterate the device list in a rcu_read
>> context.
>>
>> Signed-off-by: Johannes Thumshirn <[email protected]>
>> ---
>> fs/btrfs/zoned.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>> index 168af9d000d1..b7e7b5a5a6fa 100644
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -2423,15 +2423,15 @@ bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
>> if (fs_info->bg_reclaim_threshold == 0)
>> return false;
>>
>> - mutex_lock(&fs_devices->device_list_mutex);
>> - list_for_each_entry(device, &fs_devices->devices, dev_list) {
>> + rcu_read_lock();
>> + list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
>> if (!device->bdev)
>> continue;
>>
>> total += device->disk_total_bytes;
>> used += device->bytes_used;
>> }
>> - mutex_unlock(&fs_devices->device_list_mutex);
>> + rcu_read_unlock();
>
> This is basically only a hint and inaccuracies in the total or used
> values would be transient, right? The sum is calculated each time the
> funciton is called, not stored anywhere so in the unlikely case of
> device removal it may skip reclaim once, but then pick it up later.
> Any actual removal of the block groups in verified again and properly
> locked in btrfs_reclaim_bgs_work().
>
Yes.
On Tue, Jan 23, 2024 at 07:49:22AM +0000, Johannes Thumshirn wrote:
> On 22.01.24 22:35, David Sterba wrote:
> > On Mon, Jan 22, 2024 at 02:51:03AM -0800, Johannes Thumshirn wrote:
> >> As btrfs_zoned_should_reclaim only has to iterate the device list in order
> >> to collect stats on the device's total and used bytes, we don't need to
> >> take the full blown mutex, but can iterate the device list in a rcu_read
> >> context.
> >>
> >> Signed-off-by: Johannes Thumshirn <[email protected]>
> >> ---
> >> fs/btrfs/zoned.c | 6 +++---
> >> 1 file changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> >> index 168af9d000d1..b7e7b5a5a6fa 100644
> >> --- a/fs/btrfs/zoned.c
> >> +++ b/fs/btrfs/zoned.c
> >> @@ -2423,15 +2423,15 @@ bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
> >> if (fs_info->bg_reclaim_threshold == 0)
> >> return false;
> >>
> >> - mutex_lock(&fs_devices->device_list_mutex);
> >> - list_for_each_entry(device, &fs_devices->devices, dev_list) {
> >> + rcu_read_lock();
> >> + list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
> >> if (!device->bdev)
> >> continue;
> >>
> >> total += device->disk_total_bytes;
> >> used += device->bytes_used;
> >> }
> >> - mutex_unlock(&fs_devices->device_list_mutex);
> >> + rcu_read_unlock();
> >
> > This is basically only a hint and inaccuracies in the total or used
> > values would be transient, right? The sum is calculated each time the
> > funciton is called, not stored anywhere so in the unlikely case of
> > device removal it may skip reclaim once, but then pick it up later.
> > Any actual removal of the block groups in verified again and properly
> > locked in btrfs_reclaim_bgs_work().
> >
>
> Yes.
So please add it to the changelog as an explanation why the mutex -> rcu
switch is safe, thanks.