Zoned devices request sequential writing on the same zone. That means
if 2 requests on the saem zone, the lower pos request need to dispatch
to device first.
While different priority has it's own tree & list, request with high
priority will be disptch first.
So if requestA & requestB are on the same zone. RequestA is BE and pos
is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before
requestA, which got an ERROR from zoned device.
This is found in a practice scenario when using F2FS on zoned device.
And it is very easy to reproduce:
1. Use fsstress to run 8 test processes
2. Use ionice to change 4/8 processes to RT priority
Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support")
Cc: <[email protected]>
Signed-off-by: Wu Bo <[email protected]>
---
block/mq-deadline.c | 31 +++++++++++++++++++++++++++++++
include/linux/blk-mq.h | 15 +++++++++++++++
2 files changed, 46 insertions(+)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 02a916ba62ee..6a05dd86e8ca 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -539,6 +539,37 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
if (started_after(dd, rq, latest_start))
return NULL;
+ if (!blk_rq_is_seq_zoned_write(rq))
+ goto skip_check;
+ /*
+ * To ensure sequential writing, check the lower priority class to see
+ * if there is a request on the same zone and need to be dispatched
+ * first
+ */
+ ioprio_class = dd_rq_ioclass(rq);
+ prio = ioprio_class_to_prio[ioprio_class];
+ prio++;
+ for (; prio <= DD_PRIO_MAX; prio++) {
+ struct request *temp_rq;
+ unsigned long flags;
+ bool can_dispatch;
+
+ if (!dd_queued(dd, prio))
+ continue;
+
+ temp_rq = deadline_from_pos(&dd->per_prio[prio], data_dir, blk_rq_pos(rq));
+ if (temp_rq && blk_req_zone_in_one(temp_rq, rq) &&
+ blk_rq_pos(temp_rq) < blk_rq_pos(rq)) {
+ spin_lock_irqsave(&dd->zone_lock, flags);
+ can_dispatch = blk_req_can_dispatch_to_zone(temp_rq);
+ spin_unlock_irqrestore(&dd->zone_lock, flags);
+ if (!can_dispatch)
+ return NULL;
+ rq = temp_rq;
+ per_prio = &dd->per_prio[prio];
+ }
+ }
+skip_check:
/*
* rq is the selected appropriate request.
*/
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d3d8fd8e229b..bca1e639e0f3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1202,6 +1202,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
return true;
return !blk_req_zone_is_write_locked(rq);
}
+
+static inline bool blk_req_zone_in_one(struct request *rq_a,
+ struct request *rq_b)
+{
+ unsigned int zone_sectors = rq_a->q->limits.chunk_sectors;
+
+ return round_down(blk_rq_pos(rq_a), zone_sectors) ==
+ round_down(blk_rq_pos(rq_b), zone_sectors);
+}
#else /* CONFIG_BLK_DEV_ZONED */
static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
{
@@ -1229,6 +1238,12 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
{
return true;
}
+
+static inline bool blk_req_zone_in_one(struct request *rq_a,
+ struct request *rq_b)
+{
+ return false;
+}
#endif /* CONFIG_BLK_DEV_ZONED */
#endif /* BLK_MQ_H */
--
2.35.3
On 5/16/24 03:28, Wu Bo wrote:
> Zoned devices request sequential writing on the same zone. That means
> if 2 requests on the saem zone, the lower pos request need to dispatch
> to device first.
> While different priority has it's own tree & list, request with high
> priority will be disptch first.
> So if requestA & requestB are on the same zone. RequestA is BE and pos
> is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before
> requestA, which got an ERROR from zoned device.
>
> This is found in a practice scenario when using F2FS on zoned device.
> And it is very easy to reproduce:
> 1. Use fsstress to run 8 test processes
> 2. Use ionice to change 4/8 processes to RT priority
Hi Wu,
I agree that there is a problem related to the interaction of I/O
priority and zoned storage. A solution with a lower runtime overhead
is available here:
https://lore.kernel.org/linux-block/[email protected]/T/#me97b088c535278fe3d1dc5846b388ed58aa53f46
Are you OK with that alternative solution?
Thanks,
Bart.
On Thu, May 16, 2024 at 07:45:21AM -0600, Bart Van Assche wrote:
> On 5/16/24 03:28, Wu Bo wrote:
> > Zoned devices request sequential writing on the same zone. That means
> > if 2 requests on the saem zone, the lower pos request need to dispatch
> > to device first.
> > While different priority has it's own tree & list, request with high
> > priority will be disptch first.
> > So if requestA & requestB are on the same zone. RequestA is BE and pos
> > is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before
> > requestA, which got an ERROR from zoned device.
> >
> > This is found in a practice scenario when using F2FS on zoned device.
> > And it is very easy to reproduce:
> > 1. Use fsstress to run 8 test processes
> > 2. Use ionice to change 4/8 processes to RT priority
>
> Hi Wu,
>
> I agree that there is a problem related to the interaction of I/O
> priority and zoned storage. A solution with a lower runtime overhead
> is available here:
> https://lore.kernel.org/linux-block/[email protected]/T/#me97b088c535278fe3d1dc5846b388ed58aa53f46
Hi Bart,
I have tried to set all seq write requests the same priority:
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6a05dd86e8ca..b560846c63cb 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -841,7 +841,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx,
struct request *rq,
*/
blk_req_zone_write_unlock(rq);
- prio = ioprio_class_to_prio[ioprio_class];
+ if (blk_rq_is_seq_zoned_write(rq))
+ prio = DD_BE_PRIO;
+ else
+ prio = ioprio_class_to_prio[ioprio_class];
per_prio = &dd->per_prio[prio];
if (!rq->elv.priv[0]) {
per_prio->stats.inserted++;
I think this is the same effect as the patch you mentioned here. Unfortunatelly,
this fix causes another issue.
As all write requests are set to the same priority while read requests still
have different priotities. This makes f2fs prone to hung when under stress test:
[129412.105440][T1100129] vkhungtaskd: INFO: task "f2fs_ckpt-254:5":769 blocked for more than 193 seconds.
[129412.106629][T1100129] vkhungtaskd: 6.1.25-android14-11-maybe-dirty #1
[129412.107624][T1100129] vkhungtaskd: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[129412.108873][T1100129] vkhungtaskd: task:f2fs_ckpt-254:5 state:D stack:10496 pid:769 ppid:2 flags:0x00000408
[129412.110194][T1100129] vkhungtaskd: Call trace:
[129412.110769][T1100129] vkhungtaskd: __switch_to+0x174/0x338
[129412.111566][T1100129] vkhungtaskd: __schedule+0x604/0x9e4
[129412.112275][T1100129] vkhungtaskd: schedule+0x7c/0xe8
[129412.112938][T1100129] vkhungtaskd: rwsem_down_write_slowpath+0x4cc/0xf98
[129412.113813][T1100129] vkhungtaskd: down_write+0x38/0x40
[129412.114500][T1100129] vkhungtaskd: __write_checkpoint_sync+0x8c/0x11c
[129412.115409][T1100129] vkhungtaskd: __checkpoint_and_complete_reqs+0x54/0x1dc
[129412.116323][T1100129] vkhungtaskd: issue_checkpoint_thread+0x8c/0xec
[129412.117148][T1100129] vkhungtaskd: kthread+0x110/0x224
[129412.117826][T1100129] vkhungtaskd: ret_from_fork+0x10/0x20
[129412.484027][T1700129] vkhungtaskd: task:f2fs_gc-254:55 state:D stack:10832 pid:771 ppid:2 flags:0x00000408
[129412.485337][T1700129] vkhungtaskd: Call trace:
[129412.485906][T1700129] vkhungtaskd: __switch_to+0x174/0x338
[129412.486618][T1700129] vkhungtaskd: __schedule+0x604/0x9e4
[129412.487327][T1700129] vkhungtaskd: schedule+0x7c/0xe8
[129412.487985][T1700129] vkhungtaskd: io_schedule+0x38/0xc4
[129412.488675][T1700129] vkhungtaskd: folio_wait_bit_common+0x3d8/0x4f8
[129412.489496][T1700129] vkhungtaskd: __folio_lock+0x1c/0x2c
[129412.490196][T1700129] vkhungtaskd: __folio_lock_io+0x24/0x44
[129412.490936][T1700129] vkhungtaskd: __filemap_get_folio+0x190/0x400
[129412.491736][T1700129] vkhungtaskd: pagecache_get_page+0x1c/0x5c
[129412.492501][T1700129] vkhungtaskd: f2fs_wait_on_block_writeback+0x60/0xf8
[129412.493376][T1700129] vkhungtaskd: do_garbage_collect+0x1100/0x223c
[129412.494185][T1700129] vkhungtaskd: f2fs_gc+0x284/0x778
[129412.494858][T1700129] vkhungtaskd: gc_thread_func+0x304/0x838
[129412.495603][T1700129] vkhungtaskd: kthread+0x110/0x224
[129412.496271][T1700129] vkhungtaskd: ret_from_fork+0x10/0x20
I think because f2fs is a CoW filesystem. Some threads holding lock need much
reading & writing at the same time. Different reading & writing priority of this
thread makes this process very long. And other FS operations will be blocked.
So I figured this solution to fix this priority issue on zoned device. It sure
raises the overhead but can do fix it.
Thanks,
Wu Bo
>
> Are you OK with that alternative solution?
>
> Thanks,
>
> Bart.
On 5/16/24 18:44, Wu Bo wrote:
> So I figured this solution to fix this priority issue on zoned device. It sure
> raises the overhead but can do fix it.
Something I should have realized earlier is that this patch is not
necessary with the latest upstream kernel (v6.10-rc1). Damien's zoned
write plugging patch series has been merged. Hence, I/O schedulers,
including the mq-deadline I/O schedulers, will only see a single
zoned write at a time per zone. So it is no longer possible that
zoned writes are reordered by the I/O scheduler because of their I/O
priorities.
Thanks,
Bart.
On 2024/5/18 01:53, Bart Van Assche wrote:
> On 5/16/24 18:44, Wu Bo wrote:
>> So I figured this solution to fix this priority issue on zoned
>> device. It sure
>> raises the overhead but can do fix it.
>
> Something I should have realized earlier is that this patch is not
> necessary with the latest upstream kernel (v6.10-rc1). Damien's zoned
> write plugging patch series has been merged. Hence, I/O schedulers,
> including the mq-deadline I/O schedulers, will only see a single
> zoned write at a time per zone. So it is no longer possible that
> zoned writes are reordered by the I/O scheduler because of their I/O
> priorities.
Hi Bart,
Yes, I noticed that 'zone write plugging' has been merged to latest
branch. But it seems hard to backport to old version which mq-deadline
priority feature has been merged. So is it possible to apply this fix to
old versions?
Thanks,
Wu Bo
>
> Thanks,
>
> Bart.
On 5/17/24 18:52, Wu Bo wrote:
> Yes, I noticed that 'zone write plugging' has been merged to latest
> branch. But it seems hard to backport to old version which mq-deadline
> priority feature has been merged. So is it possible to apply this fix to
> old versions?
If you need this change in the Android kernel, please either submit a CL
to the Android kernel repository or submit a request to Android Partner
Engineering program.
Thanks,
Bart.