From: Zhaoyang Huang <[email protected]>
Currently, request's ioprio are set via task's schedule priority(when no
blkcg configured), which has high priority tasks possess the privilege on
both of CPU and IO scheduling.
This commit works as a hint of original policy by promoting the request ioprio
based on the page/folio's activity. The original idea comes from LRU_GEN
which provides more precised folio activity than before. This commit try
to adjust the request's ioprio when certain part of its folios are hot,
which indicate that this request carry important contents and need be
scheduled ealier.
This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
by changing the bio_add_page/folio API in ext4 and f2fs.
Case 1:
script[a] which get significant improved fault time as expected[b]
where dd's cost also shrink from 55s to 40s.
(1). fault_latency.bin is an ebpf based test tool which measure all task's
iowait latency during page fault when scheduled out/in.
(2). costmem generate page fault by mmaping a file and access the VA.
(3). dd generate concurrent vfs io.
[a]
/fault_latency.bin 1 5 > /data/dd_costmem &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
[b]
mainline commit
io wait 836us 156us
Case 2:
fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
mainline: 513MiB/s
READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec
commit: 633MiB/s
READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec
Case 3:
This commit is also verified by the case of launching camera APP which is
usually considered as heavy working load on both of memory and IO, which
shows 12%-24% improvement.
ttl = 0 ttl = 50 ttl = 100
mainline 2267ms 2420ms 2316ms
commit 1992ms 1806ms 1998ms
case 4:
androbench has no improvment as well as regression which supposed to be
its test time is short which MGLRU hasn't take effect yet.
Signed-off-by: Zhaoyang Huang <[email protected]>
---
change of v2: calculate page's activity via helper function
change of v3: solve layer violation by move API into mm
change of v4: keep block clean by removing the page related API
---
---
include/linux/act_ioprio.h | 62 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/ioprio.h | 44 +++++++++++++++++++++++---
mm/Kconfig | 8 +++++
3 files changed, 110 insertions(+), 4 deletions(-)
create mode 100644 include/linux/act_ioprio.h
diff --git a/include/linux/act_ioprio.h b/include/linux/act_ioprio.h
new file mode 100644
index 000000000000..8cfb3df270bd
--- /dev/null
+++ b/include/linux/act_ioprio.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ACT_IOPRIO_H
+#define _ACT_IOPRIO_H
+
+#include <linux/bio.h>
+
+#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
+bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
+ size_t off)
+{
+ int class, level, hint, activity;
+
+ if (len > UINT_MAX || off > UINT_MAX)
+ return false;
+
+ class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+ level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+ hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+ activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+
+ activity += (bio->bi_vcnt + 1 <= IOPRIO_NR_ACTIVITY &&
+ PageWorkingset(&folio->page)) ? 1 : 0;
+ if (activity >= bio->bi_vcnt / 2)
+ class = IOPRIO_CLASS_RT;
+ else if (activity >= bio->bi_vcnt / 4)
+ class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
+
+ bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
+
+ return bio_add_page(bio, &folio->page, len, off) > 0;
+}
+
+int BIO_ADD_PAGE(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
+{
+ int class, level, hint, activity;
+
+ if (bio_add_page(bio, page, len, offset) > 0) {
+ class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+ level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+ hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+ activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+ activity += (bio->bi_vcnt <= IOPRIO_NR_ACTIVITY && PageWorkingset(page)) ? 1 : 0;
+ bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
+ }
+
+ return len;
+}
+#else
+bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
+ size_t off)
+{
+ return bio_add_folio(bio, folio, len, off);
+}
+
+int BIO_ADD_PAGE(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
+{
+ return bio_add_page(bio, page, len, offset);
+}
+#endif
+#endif
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index bee2bdb0eedb..f933af54d71e 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -71,12 +71,24 @@ enum {
* class and level.
*/
#define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS
+#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
+#define IOPRIO_HINT_NR_BITS 3
+#else
#define IOPRIO_HINT_NR_BITS 10
+#endif
#define IOPRIO_NR_HINTS (1 << IOPRIO_HINT_NR_BITS)
#define IOPRIO_HINT_MASK (IOPRIO_NR_HINTS - 1)
#define IOPRIO_PRIO_HINT(ioprio) \
(((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
+#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
+#define IOPRIO_ACTIVITY_SHIFT (IOPRIO_HINT_NR_BITS + IOPRIO_LEVEL_NR_BITS)
+#define IOPRIO_ACTIVITY_NR_BITS 7
+#define IOPRIO_NR_ACTIVITY (1 << IOPRIO_ACTIVITY_NR_BITS)
+#define IOPRIO_ACTIVITY_MASK (IOPRIO_NR_ACTIVITY - 1)
+#define IOPRIO_PRIO_ACTIVITY(ioprio) \
+ (((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
+#endif
/*
* I/O hints.
*/
@@ -104,24 +116,48 @@ enum {
#define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
+#ifndef CONFIG_CONTENT_ACT_BASED_IOPRIO
/*
* Return an I/O priority value based on a class, a level and a hint.
*/
static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
- int priohint)
+ int priohint)
{
if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
- IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
- IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
+ IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
+ IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
return (prioclass << IOPRIO_CLASS_SHIFT) |
(priohint << IOPRIO_HINT_SHIFT) | priolevel;
}
-
#define IOPRIO_PRIO_VALUE(prioclass, priolevel) \
ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint) \
ioprio_value(prioclass, priolevel, priohint)
+#else
+/*
+ * Return an I/O priority value based on a class, a level and a hint.
+ */
+static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
+ int priohint, int activity)
+{
+ if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
+ IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
+ IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
+ IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
+ return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
+ return (prioclass << IOPRIO_CLASS_SHIFT) |
+ (activity << IOPRIO_ACTIVITY_SHIFT) |
+ (priohint << IOPRIO_HINT_SHIFT) | priolevel;
+}
+
+#define IOPRIO_PRIO_VALUE(prioclass, priolevel) \
+ ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
+#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint) \
+ ioprio_value(prioclass, priolevel, priohint, 0)
+#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint, activity) \
+ ioprio_value(prioclass, priolevel, priohint, activity)
+#endif
#endif /* _UAPI_LINUX_IOPRIO_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..e0e5a5a44ded 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1240,6 +1240,14 @@ config LRU_GEN_STATS
from evicted generations for debugging purpose.
This option has a per-memcg and per-node memory overhead.
+
+config CONTENT_ACT_BASED_IOPRIO
+ bool "Enable content activity based ioprio"
+ depends on LRU_GEN
+ default n
+ help
+ This item enable the feature of adjust bio's priority by
+ calculating its content's activity.
# }
config ARCH_SUPPORTS_PER_VMA_LOCK
--
2.25.1
On 1/25/24 16:19, zhaoyang.huang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> Currently, request's ioprio are set via task's schedule priority(when no
> blkcg configured), which has high priority tasks possess the privilege on
> both of CPU and IO scheduling.
> This commit works as a hint of original policy by promoting the request ioprio
> based on the page/folio's activity. The original idea comes from LRU_GEN
> which provides more precised folio activity than before. This commit try
> to adjust the request's ioprio when certain part of its folios are hot,
> which indicate that this request carry important contents and need be
> scheduled ealier.
>
> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> by changing the bio_add_page/folio API in ext4 and f2fs.
And as mentioned already by Chrisoph and Jens, why don't you just simply set
bio->bi_ioprio to the value you want before calling submit_bio() in these file
systems ? Why all the hacking of the priority code for that ? That is not
justified at all.
Furthermore, the activity things reduces the ioprio hint bits to the bare
minimum 3 bits necessary for command duration limits. Not great. But if you
simply set the prio class based on your activity algorithm, you do not need to
change all that.
Note: Your patch is full of whitespace changes.
--
Damien Le Moal
Western Digital Research
On Thu, Jan 25, 2024 at 3:40 PM Damien Le Moal <[email protected]> wrote:
>
> On 1/25/24 16:19, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > Currently, request's ioprio are set via task's schedule priority(when no
> > blkcg configured), which has high priority tasks possess the privilege on
> > both of CPU and IO scheduling.
> > This commit works as a hint of original policy by promoting the request ioprio
> > based on the page/folio's activity. The original idea comes from LRU_GEN
> > which provides more precised folio activity than before. This commit try
> > to adjust the request's ioprio when certain part of its folios are hot,
> > which indicate that this request carry important contents and need be
> > scheduled ealier.
> >
> > This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> > by changing the bio_add_page/folio API in ext4 and f2fs.
>
> And as mentioned already by Chrisoph and Jens, why don't you just simply set
> bio->bi_ioprio to the value you want before calling submit_bio() in these file
> systems ? Why all the hacking of the priority code for that ? That is not
> justified at all.
>
> Furthermore, the activity things reduces the ioprio hint bits to the bare
> minimum 3 bits necessary for command duration limits. Not great. But if you
> simply set the prio class based on your activity algorithm, you do not need to
> change all that.
That is because bio->io_prio changes during bio grows with adding
different activity pages in. I have to wrap these into an API which
has both of fs and block be transparent to the process.
>
> Note: Your patch is full of whitespace changes.
Sorry, I double checked the source code and the patch with
checkpatch.pl again, no whitespace found, don't know why
>
> --
> Damien Le Moal
> Western Digital Research
>
On 1/25/24 16:52, Zhaoyang Huang wrote:
> On Thu, Jan 25, 2024 at 3:40 PM Damien Le Moal <[email protected]> wrote:
>>
>> On 1/25/24 16:19, zhaoyang.huang wrote:
>>> From: Zhaoyang Huang <[email protected]>
>>>
>>> Currently, request's ioprio are set via task's schedule priority(when no
>>> blkcg configured), which has high priority tasks possess the privilege on
>>> both of CPU and IO scheduling.
>>> This commit works as a hint of original policy by promoting the request ioprio
>>> based on the page/folio's activity. The original idea comes from LRU_GEN
>>> which provides more precised folio activity than before. This commit try
>>> to adjust the request's ioprio when certain part of its folios are hot,
>>> which indicate that this request carry important contents and need be
>>> scheduled ealier.
>>>
>>> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
>>> by changing the bio_add_page/folio API in ext4 and f2fs.
>>
>> And as mentioned already by Chrisoph and Jens, why don't you just simply set
>> bio->bi_ioprio to the value you want before calling submit_bio() in these file
>> systems ? Why all the hacking of the priority code for that ? That is not
>> justified at all.
>>
>> Furthermore, the activity things reduces the ioprio hint bits to the bare
>> minimum 3 bits necessary for command duration limits. Not great. But if you
>> simply set the prio class based on your activity algorithm, you do not need to
>> change all that.
> That is because bio->io_prio changes during bio grows with adding
> different activity pages in. I have to wrap these into an API which
> has both of fs and block be transparent to the process.
Pages are not added to BIOs on the fly. The FS does bio_add_page() or similar
(it can be a get user pages for direct IOs) and then calls bio_submit(). Between
these 2, you can set your IO priority according to how many pages you have.
You can even likely do all of this based on the iocb (and use iocb->ki_ioprio to
set the prio), so before one starts allocating and setting up BIOs to process
the user IO.
--
Damien Le Moal
Western Digital Research
On Thu, Jan 25, 2024 at 4:26 PM Damien Le Moal <[email protected]> wrote:
>
> On 1/25/24 16:52, Zhaoyang Huang wrote:
> > On Thu, Jan 25, 2024 at 3:40 PM Damien Le Moal <[email protected]> wrote:
> >>
> >> On 1/25/24 16:19, zhaoyang.huang wrote:
> >>> From: Zhaoyang Huang <[email protected]>
> >>>
> >>> Currently, request's ioprio are set via task's schedule priority(when no
> >>> blkcg configured), which has high priority tasks possess the privilege on
> >>> both of CPU and IO scheduling.
> >>> This commit works as a hint of original policy by promoting the request ioprio
> >>> based on the page/folio's activity. The original idea comes from LRU_GEN
> >>> which provides more precised folio activity than before. This commit try
> >>> to adjust the request's ioprio when certain part of its folios are hot,
> >>> which indicate that this request carry important contents and need be
> >>> scheduled ealier.
> >>>
> >>> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> >>> by changing the bio_add_page/folio API in ext4 and f2fs.
> >>
> >> And as mentioned already by Chrisoph and Jens, why don't you just simply set
> >> bio->bi_ioprio to the value you want before calling submit_bio() in these file
> >> systems ? Why all the hacking of the priority code for that ? That is not
> >> justified at all.
> >>
> >> Furthermore, the activity things reduces the ioprio hint bits to the bare
> >> minimum 3 bits necessary for command duration limits. Not great. But if you
> >> simply set the prio class based on your activity algorithm, you do not need to
> >> change all that.
> > That is because bio->io_prio changes during bio grows with adding
> > different activity pages in. I have to wrap these into an API which
> > has both of fs and block be transparent to the process.
>
> Pages are not added to BIOs on the fly. The FS does bio_add_page() or similar
> (it can be a get user pages for direct IOs) and then calls bio_submit(). Between
> these 2, you can set your IO priority according to how many pages you have.
Please correct me if I am wrong. So you suggest iterating the
request->bios->bvecs(pages) before final submit_bio? Is it too costly
and introduces too many modifications on each fs.
>
> You can even likely do all of this based on the iocb (and use iocb->ki_ioprio to
> set the prio), so before one starts allocating and setting up BIOs to process
> the user IO.
Actually, the activity information comes from page's history (recorded
at page cache's slot) instead of user space in step(1) and can be
associate with bio in step(2) or iterate the bio in step(3)
page fault \
(1)
(2) (3)
allocate_pages==>add_page_to_page_cache(get
activity information)==>xxx_readpage==>bio_add_page==>submit_bio
vfs read/write /
>
> --
> Damien Le Moal
> Western Digital Research
>
On Thu, Jan 25, 2024 at 5:32 PM Zhaoyang Huang <huangzhaoyang@gmailcom> wrote:
>
> On Thu, Jan 25, 2024 at 4:26 PM Damien Le Moal <[email protected]> wrote:
> >
> > On 1/25/24 16:52, Zhaoyang Huang wrote:
> > > On Thu, Jan 25, 2024 at 3:40 PM Damien Le Moal <[email protected]> wrote:
> > >>
> > >> On 1/25/24 16:19, zhaoyang.huang wrote:
> > >>> From: Zhaoyang Huang <[email protected]>
> > >>>
> > >>> Currently, request's ioprio are set via task's schedule priority(when no
> > >>> blkcg configured), which has high priority tasks possess the privilege on
> > >>> both of CPU and IO scheduling.
> > >>> This commit works as a hint of original policy by promoting the request ioprio
> > >>> based on the page/folio's activity. The original idea comes from LRU_GEN
> > >>> which provides more precised folio activity than before. This commit try
> > >>> to adjust the request's ioprio when certain part of its folios are hot,
> > >>> which indicate that this request carry important contents and need be
> > >>> scheduled ealier.
> > >>>
> > >>> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> > >>> by changing the bio_add_page/folio API in ext4 and f2fs.
> > >>
> > >> And as mentioned already by Chrisoph and Jens, why don't you just simply set
> > >> bio->bi_ioprio to the value you want before calling submit_bio() in these file
> > >> systems ? Why all the hacking of the priority code for that ? That is not
> > >> justified at all.
> > >>
> > >> Furthermore, the activity things reduces the ioprio hint bits to the bare
> > >> minimum 3 bits necessary for command duration limits. Not great. But if you
> > >> simply set the prio class based on your activity algorithm, you do not need to
> > >> change all that.
> > > That is because bio->io_prio changes during bio grows with adding
> > > different activity pages in. I have to wrap these into an API which
> > > has both of fs and block be transparent to the process.
> >
> > Pages are not added to BIOs on the fly. The FS does bio_add_page() or similar
> > (it can be a get user pages for direct IOs) and then calls bio_submit() Between
> > these 2, you can set your IO priority according to how many pages you have.
> Please correct me if I am wrong. So you suggest iterating the
> request->bios->bvecs(pages) before final submit_bio? Is it too costly
> and introduces too many modifications on each fs.
> >
> > You can even likely do all of this based on the iocb (and use iocb->ki_ioprio to
> > set the prio), so before one starts allocating and setting up BIOs to process
> > the user IO.
> Actually, the activity information comes from page's history (recorded
> at page cache's slot) instead of user space in step(1) and can be
> associate with bio in step(2) or iterate the bio in step(3)
>
> page fault \
> (1)
> (2) (3)
> allocate_pages==>add_page_to_page_cache(get
> activity information)==>xxx_readpage==>bio_add_page==>submit_bio
> vfs read/write /
The chart goes wrong again:( change it to vertical mode
Actually, the activity information comes from page's history (recorded
at page cache's slot) instead of user space in step(1) and can be
associate with bio in step(2) or iterate the bio in step(3)
page fault(or vfs)(1)
|
alloc_pages
|
add_page_to_pagecache(get the page's activity information)
|
fs_readpage
|
bio_add_page(2)
|
submit_bio(3)
> >
> > --
> > Damien Le Moal
> > Western Digital Research
> >
On 1/25/24 18:37, Zhaoyang Huang wrote:
> Actually, the activity information comes from page's history (recorded
> at page cache's slot) instead of user space in step(1) and can be
> associate with bio in step(2) or iterate the bio in step(3)
> page fault(or vfs)(1)
> |
> alloc_pages
> |
> add_page_to_pagecache(get the page's activity information)
> |
> fs_readpage
> |
> bio_add_page(2)
> |
set your prio here. No change to the block layer needed. That is FS land.
> submit_bio(3)
--
Damien Le Moal
Western Digital Research
On Thu, Jan 25, 2024 at 5:54 PM Damien Le Moal <[email protected]> wrote:
>
> On 1/25/24 18:37, Zhaoyang Huang wrote:
> > Actually, the activity information comes from page's history (recorded
> > at page cache's slot) instead of user space in step(1) and can be
> > associate with bio in step(2) or iterate the bio in step(3)
> > page fault(or vfs)(1)
> > |
> > alloc_pages
> > |
> > add_page_to_pagecache(get the page's activity information)
> > |
> > fs_readpage
> > |
> > bio_add_page(2)
> > |
>
> set your prio here. No change to the block layer needed. That is FS land.
Yes, that is what I suggested in patchv3 which keeps the block layer
clean by introducing a set of helper functions which wrap up the
activity things for FS utilization.
>
> > submit_bio(3)
>
>
> --
> Damien Le Moal
> Western Digital Research
>
On Thu, Jan 25, 2024 at 6:10 PM Zhaoyang Huang <huangzhaoyang@gmailcom> wrote:
>
> On Thu, Jan 25, 2024 at 5:54 PM Damien Le Moal <[email protected]> wrote:
> >
> > On 1/25/24 18:37, Zhaoyang Huang wrote:
> > > Actually, the activity information comes from page's history (recorded
> > > at page cache's slot) instead of user space in step(1) and can be
> > > associate with bio in step(2) or iterate the bio in step(3)
> > > page fault(or vfs)(1)
> > > |
> > > alloc_pages
> > > |
> > > add_page_to_pagecache(get the page's activity information)
> > > |
> > > fs_readpage
> > > |
> > > bio_add_page(2)
> > > |
> >
> > set your prio here. No change to the block layer needed. That is FS land.
> Yes, that is what I suggested in patchv3 which keeps the block layer
> clean by introducing a set of helper functions which wrap up the
> activity things for FS utilization.
Maybe I misunderstand Damien's opinion which suggests iterating the
bio->bvec here via a new API. I think it is more costly than counting
activities when adding pages to bio and need more modifications on
each fs.
> >
> > > submit_bio(3)
> >
> >
> > --
> > Damien Le Moal
> > Western Digital Research
> >
loop more mm and fs guys for more comments
On Thu, Jan 25, 2024 at 3:22 PM zhaoyang.huang
<[email protected]> wrote:
>
> From: Zhaoyang Huang <[email protected]>
>
> Currently, request's ioprio are set via task's schedule priority(when no
> blkcg configured), which has high priority tasks possess the privilege on
> both of CPU and IO scheduling.
> This commit works as a hint of original policy by promoting the request ioprio
> based on the page/folio's activity. The original idea comes from LRU_GEN
> which provides more precised folio activity than before. This commit try
> to adjust the request's ioprio when certain part of its folios are hot,
> which indicate that this request carry important contents and need be
> scheduled ealier.
>
> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> by changing the bio_add_page/folio API in ext4 and f2fs.
>
> Case 1:
> script[a] which get significant improved fault time as expected[b]
> where dd's cost also shrink from 55s to 40s.
> (1). fault_latency.bin is an ebpf based test tool which measure all task's
> iowait latency during page fault when scheduled out/in.
> (2). costmem generate page fault by mmaping a file and access the VA.
> (3). dd generate concurrent vfs io.
>
> [a]
> ./fault_latency.bin 1 5 > /data/dd_costmem &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
> [b]
> mainline commit
> io wait 836us 156us
>
> Case 2:
> fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
> mainline: 513MiB/s
> READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
> READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
> READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
> READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
> READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
> READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
> READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
> READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
> READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
> READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec
>
> commit: 633MiB/s
> READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
> READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
> READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
> READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
> READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
> READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
> READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
> READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
> READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
> READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec
>
> Case 3:
> This commit is also verified by the case of launching camera APP which is
> usually considered as heavy working load on both of memory and IO, which
> shows 12%-24% improvement.
>
> ttl = 0 ttl = 50 ttl = 100
> mainline 2267ms 2420ms 2316ms
> commit 1992ms 1806ms 1998ms
>
> case 4:
> androbench has no improvment as well as regression which supposed to be
> its test time is short which MGLRU hasn't take effect yet.
>
> Signed-off-by: Zhaoyang Huang <[email protected]>
> ---
> change of v2: calculate page's activity via helper function
> change of v3: solve layer violation by move API into mm
> change of v4: keep block clean by removing the page related API
> ---
> ---
> include/linux/act_ioprio.h | 62 +++++++++++++++++++++++++++++++++++++
> include/uapi/linux/ioprio.h | 44 +++++++++++++++++++++++---
> mm/Kconfig | 8 +++++
> 3 files changed, 110 insertions(+), 4 deletions(-)
> create mode 100644 include/linux/act_ioprio.h
>
> diff --git a/include/linux/act_ioprio.h b/include/linux/act_ioprio.h
> new file mode 100644
> index 000000000000..8cfb3df270bd
> --- /dev/null
> +++ b/include/linux/act_ioprio.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _ACT_IOPRIO_H
> +#define _ACT_IOPRIO_H
> +
> +#include <linux/bio.h>
> +
> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> +bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
> + size_t off)
> +{
> + int class, level, hint, activity;
> +
> + if (len > UINT_MAX || off > UINT_MAX)
> + return false;
> +
> + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> + activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> +
> + activity += (bio->bi_vcnt + 1 <= IOPRIO_NR_ACTIVITY &&
> + PageWorkingset(&folio->page)) ? 1 : 0;
> + if (activity >= bio->bi_vcnt / 2)
> + class = IOPRIO_CLASS_RT;
> + else if (activity >= bio->bi_vcnt / 4)
> + class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> +
> + bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> +
> + return bio_add_page(bio, &folio->page, len, off) > 0;
> +}
> +
> +int BIO_ADD_PAGE(struct bio *bio, struct page *page,
> + unsigned int len, unsigned int offset)
> +{
> + int class, level, hint, activity;
> +
> + if (bio_add_page(bio, page, len, offset) > 0) {
> + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> + activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> + activity += (bio->bi_vcnt <= IOPRIO_NR_ACTIVITY && PageWorkingset(page)) ? 1 : 0;
> + bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> + }
> +
> + return len;
> +}
> +#else
> +bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
> + size_t off)
> +{
> + return bio_add_folio(bio, folio, len, off);
> +}
> +
> +int BIO_ADD_PAGE(struct bio *bio, struct page *page,
> + unsigned int len, unsigned int offset)
> +{
> + return bio_add_page(bio, page, len, offset);
> +}
> +#endif
> +#endif
> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> index bee2bdb0eedb..f933af54d71e 100644
> --- a/include/uapi/linux/ioprio.h
> +++ b/include/uapi/linux/ioprio.h
> @@ -71,12 +71,24 @@ enum {
> * class and level.
> */
> #define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS
> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> +#define IOPRIO_HINT_NR_BITS 3
> +#else
> #define IOPRIO_HINT_NR_BITS 10
> +#endif
> #define IOPRIO_NR_HINTS (1 << IOPRIO_HINT_NR_BITS)
> #define IOPRIO_HINT_MASK (IOPRIO_NR_HINTS - 1)
> #define IOPRIO_PRIO_HINT(ioprio) \
> (((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
>
> +#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
> +#define IOPRIO_ACTIVITY_SHIFT (IOPRIO_HINT_NR_BITS + IOPRIO_LEVEL_NR_BITS)
> +#define IOPRIO_ACTIVITY_NR_BITS 7
> +#define IOPRIO_NR_ACTIVITY (1 << IOPRIO_ACTIVITY_NR_BITS)
> +#define IOPRIO_ACTIVITY_MASK (IOPRIO_NR_ACTIVITY - 1)
> +#define IOPRIO_PRIO_ACTIVITY(ioprio) \
> + (((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
> +#endif
> /*
> * I/O hints.
> */
> @@ -104,24 +116,48 @@ enum {
>
> #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
>
> +#ifndef CONFIG_CONTENT_ACT_BASED_IOPRIO
> /*
> * Return an I/O priority value based on a class, a level and a hint.
> */
> static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> - int priohint)
> + int priohint)
> {
> if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
> - IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> - IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
> + IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> + IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
> return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
>
> return (prioclass << IOPRIO_CLASS_SHIFT) |
> (priohint << IOPRIO_HINT_SHIFT) | priolevel;
> }
> -
> #define IOPRIO_PRIO_VALUE(prioclass, priolevel) \
> ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
> #define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint) \
> ioprio_value(prioclass, priolevel, priohint)
> +#else
> +/*
> + * Return an I/O priority value based on a class, a level and a hint.
> + */
> +static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> + int priohint, int activity)
> +{
> + if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
> + IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> + IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
> + IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
> + return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
>
> + return (prioclass << IOPRIO_CLASS_SHIFT) |
> + (activity << IOPRIO_ACTIVITY_SHIFT) |
> + (priohint << IOPRIO_HINT_SHIFT) | priolevel;
> +}
> +
> +#define IOPRIO_PRIO_VALUE(prioclass, priolevel) \
> + ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
> +#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint) \
> + ioprio_value(prioclass, priolevel, priohint, 0)
> +#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint, activity) \
> + ioprio_value(prioclass, priolevel, priohint, activity)
> +#endif
> #endif /* _UAPI_LINUX_IOPRIO_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 264a2df5ecf5..e0e5a5a44ded 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1240,6 +1240,14 @@ config LRU_GEN_STATS
> from evicted generations for debugging purpose.
>
> This option has a per-memcg and per-node memory overhead.
> +
> +config CONTENT_ACT_BASED_IOPRIO
> + bool "Enable content activity based ioprio"
> + depends on LRU_GEN
> + default n
> + help
> + This item enable the feature of adjust bio's priority by
> + calculating its content's activity.
> # }
>
> config ARCH_SUPPORTS_PER_VMA_LOCK
> --
> 2.25.1
>
On Fri, Jan 26, 2024 at 5:36 PM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jan 26, 2024 at 05:28:58PM +0800, Zhaoyang Huang wrote:
> > On Fri, Jan 26, 2024 at 4:55 PM Matthew Wilcox <willy@infradeadorg> wrote:
> > >
> > > On Fri, Jan 26, 2024 at 03:59:48PM +0800, Zhaoyang Huang wrote:
> > > > loop more mm and fs guys for more comments
> > >
> > > I agree with everything Damien said. But also ...
> > ok, I will find a way to solve this problem.
> > >
> > > > > +bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
> > > > > + size_t off)
> > >
> > > You don't add any users of these functions. It's hard to assess whether
> > > this is the right API when there are no example users.
> > Actually, the code has been tested on ext4 and f2fs by patchv2 on a
> > v6.6 6GB android system where I get the test result posted on the
> > commit message. These APIs is to keep block layer clean and wrap
> > things up for fs.
>
> well, where's patch v2? i don't see it in my inbox. i'm not going
> to go hunting around the email lists for it. this is not good enough.
>
> > > why are BIO_ADD_PAGE and BIO_ADD_FOLIO so very different from each
> > > other?
> > These two API just repeat the same thing that bio_add_page and
> > bio_add_folio do.
>
> what?
>
> here's the patch you sent. these two functions do wildly different
> things:
>
> +bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
> + size_t off)
> +{
> + int class, level, hint, activity;
> +
> + if (len > UINT_MAX || off > UINT_MAX)
> + return false;
> +
> + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> + activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> +
> + activity += (bio->bi_vcnt + 1 <= IOPRIO_NR_ACTIVITY &&
> + PageWorkingset(&folio->page)) ? 1 : 0;
> + if (activity >= bio->bi_vcnt / 2)
> + class = IOPRIO_CLASS_RT;
> + else if (activity >= bio->bi_vcnt / 4)
> + class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> +
> + bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> +
> + return bio_add_page(bio, &folio->page, len, off) > 0;
> +}
> +
> +int BIO_ADD_PAGE(struct bio *bio, struct page *page,
> + unsigned int len, unsigned int offset)
> +{
> + int class, level, hint, activity;
> +
> + if (bio_add_page(bio, page, len, offset) > 0) {
> + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> + activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> + activity += (bio->bi_vcnt <= IOPRIO_NR_ACTIVITY && PageWorkingset(page)) ? 1 : 0;
> + bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> + }
> +
> + return len;
> +}
>
> did you change one and forget to change the other?
Sorry for missing you in the list. Please find below patchv2 where all
activity calculation is located within _bio_add_page which aims at
avoiding iterating the bio->bvec before submit_bio. This is rejected
by Jens as it introduces page operation in the block layer.
block/Kconfig | 8 ++++++++
block/bio.c | 10 ++++++++++
block/blk-mq.c | 21 +++++++++++++++++++++
fs/buffer.c | 6 ++++++
include/linux/buffer_head.h | 1 +
include/uapi/linux/ioprio.h | 20 +++++++++++++++-----
6 files changed, 61 insertions(+), 5 deletions(-)
diff --git a/block/Kconfig b/block/Kconfig
index f1364d1c0d93..8d6075575eae 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,6 +228,14 @@ config BLOCK_HOLDER_DEPRECATED
config BLK_MQ_STACKING
bool
+config CONTENT_ACT_BASED_IOPRIO
+ bool "Enable content activity based ioprio"
+ depends on LRU_GEN
+ default y
+ help
+ This item enable the feature of adjust bio's priority by
+ calculating its content's activity.
+
source "block/Kconfig.iosched"
endif # BLOCK
diff --git a/block/bio.c b/block/bio.c
index 816d412c06e9..1228e2a4940f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -24,6 +24,7 @@
#include "blk.h"
#include "blk-rq-qos.h"
#include "blk-cgroup.h"
+#include "blk-ioprio.h"
#define ALLOC_CACHE_THRESHOLD 16
#define ALLOC_CACHE_MAX 256
@@ -1069,12 +1070,21 @@ EXPORT_SYMBOL_GPL(bio_add_zone_append_page);
void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off)
{
+ int class, level, hint, activity;
+
+ class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+ level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+ hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+ activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
WARN_ON_ONCE(bio_full(bio, len));
bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, off);
bio->bi_iter.bi_size += len;
bio->bi_vcnt++;
+ activity += bio_page_if_active(bio, page, IOPRIO_NR_ACTIVITY);
+ bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level,
hint, activity);
}
EXPORT_SYMBOL_GPL(__bio_add_page);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fafd54dce3c..05cdd3adde94 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2939,6 +2939,26 @@ static inline struct request
*blk_mq_get_cached_request(struct request_queue *q,
return rq;
}
+#ifdef CONFIG_CONTENT_ACT_BASED_IOPRIO
+static void bio_set_ioprio(struct bio *bio)
+{
+ int class, level, hint, activity;
+
+ class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+ level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+ hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+ activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+
+ if (activity >= bio->bi_vcnt / 2)
+ class = IOPRIO_CLASS_RT;
+ else if (activity >= bio->bi_vcnt / 4)
+ class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()),
IOPRIO_CLASS_BE);
+
+ bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level,
hint, activity);
+
+ blkcg_set_ioprio(bio);
+}
+#else
static void bio_set_ioprio(struct bio *bio)
{
/* Nobody set ioprio so far? Initialize it based on task's nice value */
@@ -2946,6 +2966,7 @@ static void bio_set_ioprio(struct bio *bio)
bio->bi_ioprio = get_current_ioprio();
blkcg_set_ioprio(bio);
}
+#endif
/**
* blk_mq_submit_bio - Create and send a request to block device.
diff --git a/fs/buffer.c b/fs/buffer.c
index 12e9a71c693d..b15bff481706 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2832,6 +2832,12 @@ void submit_bh(blk_opf_t opf, struct buffer_head *bh)
}
EXPORT_SYMBOL(submit_bh);
+int bio_page_if_active(struct bio *bio, struct page *page, unsigned
short limit)
+{
+ return (bio->bi_vcnt <= limit && PageWorkingset(page)) ? 1 : 0;
+}
+EXPORT_SYMBOL(bio_page_if_active);
+
void write_dirty_buffer(struct buffer_head *bh, blk_opf_t op_flags)
{
lock_buffer(bh);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 44e9de51eedf..9a374f5965ec 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -248,6 +248,7 @@ int bh_uptodate_or_lock(struct buffer_head *bh);
int __bh_read(struct buffer_head *bh, blk_opf_t op_flags, bool wait);
void __bh_read_batch(int nr, struct buffer_head *bhs[],
blk_opf_t op_flags, bool force_lock);
+int bio_page_if_active(struct bio *bio, struct page *page, unsigned
short limit);
/*
* Generic address_space_operations implementations for buffer_head-backed
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index bee2bdb0eedb..d1c6081e796b 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -71,12 +71,18 @@ enum {
* class and level.
*/
#define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS
-#define IOPRIO_HINT_NR_BITS 10
+#define IOPRIO_HINT_NR_BITS 3
#define IOPRIO_NR_HINTS (1 << IOPRIO_HINT_NR_BITS)
#define IOPRIO_HINT_MASK (IOPRIO_NR_HINTS - 1)
#define IOPRIO_PRIO_HINT(ioprio) \
(((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK)
+#define IOPRIO_ACTIVITY_SHIFT (IOPRIO_HINT_NR_BITS +
IOPRIO_LEVEL_NR_BITS)
+#define IOPRIO_ACTIVITY_NR_BITS 7
+#define IOPRIO_NR_ACTIVITY (1 << IOPRIO_ACTIVITY_NR_BITS)
+#define IOPRIO_ACTIVITY_MASK (IOPRIO_NR_ACTIVITY - 1)
+#define IOPRIO_PRIO_ACTIVITY(ioprio) \
+ (((ioprio) >> IOPRIO_ACTIVITY_SHIFT) & IOPRIO_ACTIVITY_MASK)
/*
* I/O hints.
*/
@@ -108,20 +114,24 @@ enum {
* Return an I/O priority value based on a class, a level and a hint.
*/
static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
- int priohint)
+ int priohint, int activity)
{
if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
- IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
+ IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS) ||
+ IOPRIO_BAD_VALUE(activity, IOPRIO_NR_ACTIVITY))
return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
return (prioclass << IOPRIO_CLASS_SHIFT) |
+ (activity << IOPRIO_ACTIVITY_SHIFT) |
(priohint << IOPRIO_HINT_SHIFT) | priolevel;
}
#define IOPRIO_PRIO_VALUE(prioclass, priolevel) \
- ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE)
+ ioprio_value(prioclass, priolevel, IOPRIO_HINT_NONE, 0)
#define IOPRIO_PRIO_VALUE_HINT(prioclass, priolevel, priohint) \
- ioprio_value(prioclass, priolevel, priohint)
+ ioprio_value(prioclass, priolevel, priohint, 0)
+#define IOPRIO_PRIO_VALUE_ACTIVITY(prioclass, priolevel, priohint,
activity) \
+ ioprio_value(prioclass, priolevel, priohint, activity)
#endif /* _UAPI_LINUX_IOPRIO_H */
>
> > These white spaces are trimmed by vim, I will change them back in next version.
>
> vim doesn't do that by default.
>
On Fri, Jan 26, 2024 at 03:59:48PM +0800, Zhaoyang Huang wrote:
> loop more mm and fs guys for more comments
I agree with everything Damien said. But also ...
> > +bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
> > + size_t off)
You don't add any users of these functions. It's hard to assess whether
this is the right API when there are no example users.
> > + activity += (bio->bi_vcnt + 1 <= IOPRIO_NR_ACTIVITY &&
> > + PageWorkingset(&folio->page)) ? 1 : 0;
folio_test_workingset().
> > + return bio_add_page(bio, &folio->page, len, off) > 0;
bio_add_folio().
> > +int BIO_ADD_PAGE(struct bio *bio, struct page *page,
> > + unsigned int len, unsigned int offset)
> > +{
> > + int class, level, hint, activity;
> > +
> > + if (bio_add_page(bio, page, len, offset) > 0) {
> > + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> > + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> > + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> > + activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> > + activity += (bio->bi_vcnt <= IOPRIO_NR_ACTIVITY && PageWorkingset(page)) ? 1 : 0;
> > + bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> > + }
why are BIO_ADD_PAGE and BIO_ADD_FOLIO so very different from each
other?
> > static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> > - int priohint)
> > + int priohint)
why did you change this whitespace?
> > {
> > if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
> > - IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> > - IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
> > + IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> > + IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
ditto
> > return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
> >
> > return (prioclass << IOPRIO_CLASS_SHIFT) |
> > (priohint << IOPRIO_HINT_SHIFT) | priolevel;
> > }
> > -
more gratuitous whitespace change
On Fri, Jan 26, 2024 at 4:55 PM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jan 26, 2024 at 03:59:48PM +0800, Zhaoyang Huang wrote:
> > loop more mm and fs guys for more comments
>
> I agree with everything Damien said. But also ...
ok, I will find a way to solve this problem.
>
> > > +bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
> > > + size_t off)
>
> You don't add any users of these functions. It's hard to assess whether
> this is the right API when there are no example users.
Actually, the code has been tested on ext4 and f2fs by patchv2 on a
v6.6 6GB android system where I get the test result posted on the
commit message. These APIs is to keep block layer clean and wrap
things up for fs.
>
> > > + activity += (bio->bi_vcnt + 1 <= IOPRIO_NR_ACTIVITY &&
> > > + PageWorkingset(&folio->page)) ? 1 : 0;
>
> folio_test_workingset().
>
> > > + return bio_add_page(bio, &folio->page, len, off) > 0;
>
> bio_add_folio().
>
> > > +int BIO_ADD_PAGE(struct bio *bio, struct page *page,
> > > + unsigned int len, unsigned int offset)
> > > +{
> > > + int class, level, hint, activity;
> > > +
> > > + if (bio_add_page(bio, page, len, offset) > 0) {
> > > + class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> > > + level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> > > + hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> > > + activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
> > > + activity += (bio->bi_vcnt <= IOPRIO_NR_ACTIVITY && PageWorkingset(page)) ? 1 : 0;
> > > + bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
> > > + }
>
> why are BIO_ADD_PAGE and BIO_ADD_FOLIO so very different from each
> other?
These two API just repeat the same thing that bio_add_page and
bio_add_folio do.
>
> > > static __always_inline __u16 ioprio_value(int prioclass, int priolevel,
> > > - int priohint)
> > > + int priohint)
>
> why did you change this whitespace?
>
> > > {
> > > if (IOPRIO_BAD_VALUE(prioclass, IOPRIO_NR_CLASSES) ||
> > > - IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> > > - IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
> > > + IOPRIO_BAD_VALUE(priolevel, IOPRIO_NR_LEVELS) ||
> > > + IOPRIO_BAD_VALUE(priohint, IOPRIO_NR_HINTS))
>
> ditto
These white spaces are trimmed by vim, I will change them back in next version.
>
> > > return IOPRIO_CLASS_INVALID << IOPRIO_CLASS_SHIFT;
> > >
> > > return (prioclass << IOPRIO_CLASS_SHIFT) |
> > > (priohint << IOPRIO_HINT_SHIFT) | priolevel;
> > > }
> > > -
>
> more gratuitous whitespace change
>
On Fri, Jan 26, 2024 at 05:28:58PM +0800, Zhaoyang Huang wrote:
> On Fri, Jan 26, 2024 at 4:55 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Jan 26, 2024 at 03:59:48PM +0800, Zhaoyang Huang wrote:
> > > loop more mm and fs guys for more comments
> >
> > I agree with everything Damien said. But also ...
> ok, I will find a way to solve this problem.
> >
> > > > +bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
> > > > + size_t off)
> >
> > You don't add any users of these functions. It's hard to assess whether
> > this is the right API when there are no example users.
> Actually, the code has been tested on ext4 and f2fs by patchv2 on a
> v6.6 6GB android system where I get the test result posted on the
> commit message. These APIs is to keep block layer clean and wrap
> things up for fs.
well, where's patch v2? i don't see it in my inbox. i'm not going
to go hunting around the email lists for it. this is not good enough.
> > why are BIO_ADD_PAGE and BIO_ADD_FOLIO so very different from each
> > other?
> These two API just repeat the same thing that bio_add_page and
> bio_add_folio do.
what?
here's the patch you sent. these two functions do wildly different
things:
+bool BIO_ADD_FOLIO(struct bio *bio, struct folio *folio, size_t len,
+ size_t off)
+{
+ int class, level, hint, activity;
+
+ if (len > UINT_MAX || off > UINT_MAX)
+ return false;
+
+ class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+ level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+ hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+ activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+
+ activity += (bio->bi_vcnt + 1 <= IOPRIO_NR_ACTIVITY &&
+ PageWorkingset(&folio->page)) ? 1 : 0;
+ if (activity >= bio->bi_vcnt / 2)
+ class = IOPRIO_CLASS_RT;
+ else if (activity >= bio->bi_vcnt / 4)
+ class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
+
+ bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
+
+ return bio_add_page(bio, &folio->page, len, off) > 0;
+}
+
+int BIO_ADD_PAGE(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
+{
+ int class, level, hint, activity;
+
+ if (bio_add_page(bio, page, len, offset) > 0) {
+ class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+ level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
+ hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
+ activity = IOPRIO_PRIO_ACTIVITY(bio->bi_ioprio);
+ activity += (bio->bi_vcnt <= IOPRIO_NR_ACTIVITY && PageWorkingset(page)) ? 1 : 0;
+ bio->bi_ioprio = IOPRIO_PRIO_VALUE_ACTIVITY(class, level, hint, activity);
+ }
+
+ return len;
+}
did you change one and forget to change the other?
> These white spaces are trimmed by vim, I will change them back in next version.
vim doesn't do that by default.