2021-04-21 10:29:00

by Changheun Lee

[permalink] [raw]
Subject: [PATCH v8] bio: limit bio max size

bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.

When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.

| bio merge for 32MB. total 8,192 pages are merged.
| total elapsed time is over 2ms.
|------------------ ... ----------------------->|
| 8,192 pages merged a bio.
| at this time, first bio submit is done.
| 1 bio is split to 32 read request and issue.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 19ms elapsed to complete 32MB read done from device. |

If bio max size is limited with 1MB, behavior is changed below.

| bio merge for 1MB. 256 pages are merged for each bio.
| total 32 bio will be made.
| total elapsed time is over 2ms. it's same.
| but, first bio submit timing is fast. about 100us.
|--->|--->|--->|---> ... -->|--->|--->|--->|--->|
| 256 pages merged a bio.
| at this time, first bio submit is done.
| and 1 read request is issued for 1 bio.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 17ms elapsed to complete 32MB read done from device. |

As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.

Signed-off-by: Changheun Lee <[email protected]>
---
block/bio.c | 9 ++++++++-
block/blk-settings.c | 5 +++++
include/linux/bio.h | 4 +++-
include/linux/blkdev.h | 2 ++
4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 50e579088aca..9e5061ecc317 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -255,6 +255,13 @@ void bio_init(struct bio *bio, struct bio_vec *table,
}
EXPORT_SYMBOL(bio_init);

+unsigned int bio_max_size(struct bio *bio)
+{
+ struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+
+ return q->limits.bio_max_bytes;
+}
+
/**
* bio_reset - reinitialize a bio
* @bio: bio to reset
@@ -866,7 +873,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];

if (page_is_mergeable(bv, page, len, off, same_page)) {
- if (bio->bi_iter.bi_size > UINT_MAX - len) {
+ if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
*same_page = false;
return false;
}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index b4aa2f37fab6..cd3dcb5afe50 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -37,6 +37,7 @@ EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
*/
void blk_set_default_limits(struct queue_limits *lim)
{
+ lim->bio_max_bytes = UINT_MAX;
lim->max_segments = BLK_MAX_SEGMENTS;
lim->max_discard_segments = 1;
lim->max_integrity_segments = 0;
@@ -168,6 +169,10 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto
limits->logical_block_size >> SECTOR_SHIFT);
limits->max_sectors = max_sectors;

+ if (check_shl_overflow(max_sectors, SECTOR_SHIFT,
+ &limits->bio_max_bytes))
+ limits->bio_max_bytes = UINT_MAX;
+
q->backing_dev_info->io_pages = max_sectors >> (PAGE_SHIFT - 9);
}
EXPORT_SYMBOL(blk_queue_max_hw_sectors);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d0246c92a6e8..e5add63da3af 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -106,6 +106,8 @@ static inline void *bio_data(struct bio *bio)
return NULL;
}

+extern unsigned int bio_max_size(struct bio *bio);
+
/**
* bio_full - check if the bio is full
* @bio: bio to check
@@ -119,7 +121,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
if (bio->bi_vcnt >= bio->bi_max_vecs)
return true;

- if (bio->bi_iter.bi_size > UINT_MAX - len)
+ if (bio->bi_iter.bi_size > bio_max_size(bio) - len)
return true;

return false;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 158aefae1030..c205d60ac611 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -312,6 +312,8 @@ enum blk_zoned_model {
};

struct queue_limits {
+ unsigned int bio_max_bytes;
+
unsigned long bounce_pfn;
unsigned long seg_boundary_mask;
unsigned long virt_boundary_mask;
--
2.29.0


2021-04-22 01:29:45

by Changheun Lee

[permalink] [raw]
Subject: Re: [PATCH v8] bio: limit bio max size

> On 4/21/21 2:47 AM, Changheun Lee wrote:
> > bio size can grow up to 4GB when muli-page bvec is enabled.
> > but sometimes it would lead to inefficient behaviors.
> > in case of large chunk direct I/O, - 32MB chunk read in user space -
>
> This patch looks good to me, hence:
>
> Reviewed-by: Bart Van Assche <[email protected]>
>
> Do you plan to submit a dm-crypt patch that limits the bio size when
> using dm-crypt to the bio size of the underling device?
>
> Thanks,
>
> Bart.

Yes, I'm preparing a dm-crypt patch. I'll upstream after applying of this.
Thank you for your review, and advice. :)

Thanks,

Changheun Lee.

2021-04-22 01:48:07

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v8] bio: limit bio max size

On 4/21/21 2:47 AM, Changheun Lee wrote:
> bio size can grow up to 4GB when muli-page bvec is enabled.
> but sometimes it would lead to inefficient behaviors.
> in case of large chunk direct I/O, - 32MB chunk read in user space -

This patch looks good to me, hence:

Reviewed-by: Bart Van Assche <[email protected]>

Do you plan to submit a dm-crypt patch that limits the bio size when
using dm-crypt to the bio size of the underling device?

Thanks,

Bart.

2021-04-22 20:34:37

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v8] bio: limit bio max size

On 4/21/21 2:47 AM, Changheun Lee wrote:
> bio size can grow up to 4GB when muli-page bvec is enabled.
> but sometimes it would lead to inefficient behaviors.
> in case of large chunk direct I/O, - 32MB chunk read in user space -
> all pages for 32MB would be merged to a bio structure if the pages
> physical addresses are contiguous. it makes some delay to submit
> until merge complete. bio max size should be limited to a proper size.

Hi Christoph, Ming and Damien,

Can one of you take a look at this patch?

Thanks,

Bart.

2021-04-23 11:39:53

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v8] bio: limit bio max size

On 2021/04/21 19:05, Changheun Lee wrote:
> bio size can grow up to 4GB when muli-page bvec is enabled.
> but sometimes it would lead to inefficient behaviors.
> in case of large chunk direct I/O, - 32MB chunk read in user space -
> all pages for 32MB would be merged to a bio structure if the pages
> physical addresses are contiguous. it makes some delay to submit
> until merge complete. bio max size should be limited to a proper size.

And what is the "proper size" ? You should be specific here and mention
max_sectors or max_hw_sectors.

>
> When 32MB chunk read with direct I/O option is coming from userspace,
> kernel behavior is below now in do_direct_IO() loop. it's timeline.
>
> | bio merge for 32MB. total 8,192 pages are merged.
> | total elapsed time is over 2ms.
> |------------------ ... ----------------------->|
> | 8,192 pages merged a bio.
> | at this time, first bio submit is done.
> | 1 bio is split to 32 read request and issue.
> |--------------->
> |--------------->
> |--------------->
> ......
> |--------------->
> |--------------->|
> total 19ms elapsed to complete 32MB read done from device. |
>
> If bio max size is limited with 1MB, behavior is changed below.
>
> | bio merge for 1MB. 256 pages are merged for each bio.
> | total 32 bio will be made.
> | total elapsed time is over 2ms. it's same.
> | but, first bio submit timing is fast. about 100us.
> |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
> | 256 pages merged a bio.
> | at this time, first bio submit is done.
> | and 1 read request is issued for 1 bio.
> |--------------->
> |--------------->
> |--------------->
> ......
> |--------------->
> |--------------->|
> total 17ms elapsed to complete 32MB read done from device. |
>
> As a result, read request issue timing is faster if bio max size is limited.
> Current kernel behavior with multipage bvec, super large bio can be created.
> And it lead to delay first I/O request issue.

The above message describes the problem very well, but there is no explanation
of the solution implemented. So one cannot check if the implementation matches
the intent. Please add a description of the solution, more detailed than just
saying "limit bio size".

>
> Signed-off-by: Changheun Lee <[email protected]>
> ---
> block/bio.c | 9 ++++++++-
> block/blk-settings.c | 5 +++++
> include/linux/bio.h | 4 +++-
> include/linux/blkdev.h | 2 ++
> 4 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 50e579088aca..9e5061ecc317 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -255,6 +255,13 @@ void bio_init(struct bio *bio, struct bio_vec *table,
> }
> EXPORT_SYMBOL(bio_init);
>
> +unsigned int bio_max_size(struct bio *bio)
> +{
> + struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +
> + return q->limits.bio_max_bytes;
> +}
> +
> /**
> * bio_reset - reinitialize a bio
> * @bio: bio to reset
> @@ -866,7 +873,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
> struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>
> if (page_is_mergeable(bv, page, len, off, same_page)) {
> - if (bio->bi_iter.bi_size > UINT_MAX - len) {
> + if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
> *same_page = false;
> return false;
> }
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index b4aa2f37fab6..cd3dcb5afe50 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -37,6 +37,7 @@ EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
> */
> void blk_set_default_limits(struct queue_limits *lim)
> {
> + lim->bio_max_bytes = UINT_MAX;
> lim->max_segments = BLK_MAX_SEGMENTS;
> lim->max_discard_segments = 1;
> lim->max_integrity_segments = 0;
> @@ -168,6 +169,10 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto
> limits->logical_block_size >> SECTOR_SHIFT);
> limits->max_sectors = max_sectors;
>
> + if (check_shl_overflow(max_sectors, SECTOR_SHIFT,
> + &limits->bio_max_bytes))
> + limits->bio_max_bytes = UINT_MAX;

limits->bio_max_bytes = min_t(sector_t,
(sector_t)max_sectors << SECTOR_SHIFT, UINT_MAX);

is easier to understand in my opinion.

> +
> q->backing_dev_info->io_pages = max_sectors >> (PAGE_SHIFT - 9);
> }
> EXPORT_SYMBOL(blk_queue_max_hw_sectors);
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index d0246c92a6e8..e5add63da3af 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -106,6 +106,8 @@ static inline void *bio_data(struct bio *bio)
> return NULL;
> }
>
> +extern unsigned int bio_max_size(struct bio *bio);

No need for extern.

> +
> /**
> * bio_full - check if the bio is full
> * @bio: bio to check
> @@ -119,7 +121,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
> if (bio->bi_vcnt >= bio->bi_max_vecs)
> return true;
>
> - if (bio->bi_iter.bi_size > UINT_MAX - len)
> + if (bio->bi_iter.bi_size > bio_max_size(bio) - len)
> return true;
>
> return false;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 158aefae1030..c205d60ac611 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -312,6 +312,8 @@ enum blk_zoned_model {
> };
>
> struct queue_limits {
> + unsigned int bio_max_bytes;

Please move this below in the structure together with all other fields that are
unsigned int too.

> +
> unsigned long bounce_pfn;
> unsigned long seg_boundary_mask;
> unsigned long virt_boundary_mask;
>

As far as I can tell, bio_max_bytes is UINT_MAX by default, becomes equal to
max_hw_sectors when that limit is set by a driver and cannot take any other
value. So why introduce this new limit at all ? Why not use max_hw_sectors
directly as a bio size limit ?



--
Damien Le Moal
Western Digital Research

2021-04-23 14:40:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v8] bio: limit bio max size

On 4/21/21 3:47 AM, Changheun Lee wrote:
> bio size can grow up to 4GB when muli-page bvec is enabled.
> but sometimes it would lead to inefficient behaviors.
> in case of large chunk direct I/O, - 32MB chunk read in user space -
> all pages for 32MB would be merged to a bio structure if the pages
> physical addresses are contiguous. it makes some delay to submit
> until merge complete. bio max size should be limited to a proper size.
>
> When 32MB chunk read with direct I/O option is coming from userspace,
> kernel behavior is below now in do_direct_IO() loop. it's timeline.
>
> | bio merge for 32MB. total 8,192 pages are merged.
> | total elapsed time is over 2ms.
> |------------------ ... ----------------------->|
> | 8,192 pages merged a bio.
> | at this time, first bio submit is done.
> | 1 bio is split to 32 read request and issue.
> |--------------->
> |--------------->
> |--------------->
> ......
> |--------------->
> |--------------->|
> total 19ms elapsed to complete 32MB read done from device. |
>
> If bio max size is limited with 1MB, behavior is changed below.
>
> | bio merge for 1MB. 256 pages are merged for each bio.
> | total 32 bio will be made.
> | total elapsed time is over 2ms. it's same.
> | but, first bio submit timing is fast. about 100us.
> |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
> | 256 pages merged a bio.
> | at this time, first bio submit is done.
> | and 1 read request is issued for 1 bio.
> |--------------->
> |--------------->
> |--------------->
> ......
> |--------------->
> |--------------->|
> total 17ms elapsed to complete 32MB read done from device. |
>
> As a result, read request issue timing is faster if bio max size is limited.
> Current kernel behavior with multipage bvec, super large bio can be created.
> And it lead to delay first I/O request issue.

Applied, thanks.

--
Jens Axboe

2021-04-25 02:07:10

by Oliver Sang

[permalink] [raw]
Subject: [bio] 803f54ef52: BUG:kernel_NULL_pointer_dereference,address



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 803f54ef52fc0eec23aa58fa64f2b6fcf67dd466 ("[PATCH v8] bio: limit bio max size")
url: https://github.com/0day-ci/linux/commits/Changheun-Lee/bio-limit-bio-max-size/20210421-180805
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 1fe5501ba1abf2b7e78295df73675423bd6899a0

in testcase: kernel-builtin
version:
with following parameters:

sleep: 10



on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------+------------+------------+
| | 1fe5501ba1 | 803f54ef52 |
+---------------------------------------------+------------+------------+
| boot_successes | 7 | 0 |
| boot_failures | 0 | 10 |
| BUG:kernel_NULL_pointer_dereference,address | 0 | 10 |
| Oops:#[##] | 0 | 10 |
| RIP:bio_add_hw_page | 0 | 10 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 10 |
+---------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 7.411064] BUG: kernel NULL pointer dereference, address: 0000000000000368
[ 7.411687] #PF: supervisor read access in kernel mode
[ 7.412167] #PF: error_code(0x0000) - not-present page
[ 7.412649] PGD 0 P4D 0
[ 7.412930] Oops: 0000 [#1] SMP PTI
[ 7.413278] CPU: 0 PID: 173 Comm: kworker/u4:2 Not tainted 5.12.0-rc8-00005-g803f54ef52fc #1
[ 7.414041] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 7.414791] Workqueue: events_unbound async_run_entry_fn
[ 7.415280] RIP: 0010:bio_add_hw_page (kbuild/src/consumer/block/bio.c:260 kbuild/src/consumer/include/linux/bio.h:124 kbuild/src/consumer/include/linux/bio.h:119 kbuild/src/consumer/block/bio.c:778 kbuild/src/consumer/block/bio.c:753)
[ 7.415717] Code: 09 44 39 c8 0f 87 f5 00 00 00 0f b7 46 60 49 89 fc 49 89 d6 45 89 c5 66 85 c0 75 60 66 39 43 62 0f 86 d9 00 00 00 48 8b 53 08 <48> 8b 92 68 03 00 00 48 8b 52 50 8b 92 08 04 00 00 29 ea 39 53 28
All code
========
0: 09 44 39 c8 or %eax,-0x38(%rcx,%rdi,1)
4: 0f 87 f5 00 00 00 ja 0xff
a: 0f b7 46 60 movzwl 0x60(%rsi),%eax
e: 49 89 fc mov %rdi,%r12
11: 49 89 d6 mov %rdx,%r14
14: 45 89 c5 mov %r8d,%r13d
17: 66 85 c0 test %ax,%ax
1a: 75 60 jne 0x7c
1c: 66 39 43 62 cmp %ax,0x62(%rbx)
20: 0f 86 d9 00 00 00 jbe 0xff
26: 48 8b 53 08 mov 0x8(%rbx),%rdx
2a:* 48 8b 92 68 03 00 00 mov 0x368(%rdx),%rdx <-- trapping instruction
31: 48 8b 52 50 mov 0x50(%rdx),%rdx
35: 8b 92 08 04 00 00 mov 0x408(%rdx),%edx
3b: 29 ea sub %ebp,%edx
3d: 39 53 28 cmp %edx,0x28(%rbx)

Code starting with the faulting instruction
===========================================
0: 48 8b 92 68 03 00 00 mov 0x368(%rdx),%rdx
7: 48 8b 52 50 mov 0x50(%rdx),%rdx
b: 8b 92 08 04 00 00 mov 0x408(%rdx),%edx
11: 29 ea sub %ebp,%edx
13: 39 53 28 cmp %edx,0x28(%rbx)
[ 7.417280] RSP: 0000:ffffaef600247c00 EFLAGS: 00010202
[ 7.417757] RAX: 0000000000000000 RBX: ffff9144f8624cc0 RCX: 0000000000000024
[ 7.418378] RDX: 0000000000000000 RSI: ffff9144f8624cc0 RDI: ffff9144af936d60
[ 7.418998] RBP: 0000000000000024 R08: 0000000000000200 R09: 0000000000000200
[ 7.419615] R10: 0000000000000002 R11: ffff9144f8619c77 R12: ffff9144af936d60
[ 7.420233] R13: 0000000000000200 R14: ffffdf6144d7ab40 R15: 0000000000000024
[ 7.420861] FS: 0000000000000000(0000) GS:ffff9147afc00000(0000) knlGS:0000000000000000
[ 7.421595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.422108] CR2: 0000000000000368 CR3: 0000000135e8a000 CR4: 00000000000406f0
[ 7.422728] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7.423350] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 7.423971] Call Trace:
[ 7.424256] bio_add_pc_page (kbuild/src/consumer/block/bio.c:812)
[ 7.424633] blk_rq_map_kern (kbuild/src/consumer/block/blk-map.c:414 kbuild/src/consumer/block/blk-map.c:698)
[ 7.425017] __scsi_execute (kbuild/src/consumer/drivers/scsi/scsi_lib.c:258 (discriminator 1))
[ 7.425395] scsi_probe_and_add_lun (kbuild/src/consumer/include/scsi/scsi_device.h:461 kbuild/src/consumer/drivers/scsi/scsi_scan.c:592 kbuild/src/consumer/drivers/scsi/scsi_scan.c:1086)
[ 7.425821] ? __pm_runtime_resume (kbuild/src/consumer/drivers/base/power/runtime.c:1114)
[ 7.426229] __scsi_add_device (kbuild/src/consumer/drivers/scsi/scsi_scan.c:1480)
[ 7.426619] ata_scsi_scan_host (kbuild/src/consumer/drivers/ata/libata-scsi.c:4336) libata
[ 7.427087] async_run_entry_fn (kbuild/src/consumer/kernel/async.c:124)
[ 7.427485] process_one_work (kbuild/src/consumer/arch/x86/include/asm/jump_label.h:25 kbuild/src/consumer/include/linux/jump_label.h:200 kbuild/src/consumer/include/trace/events/workqueue.h:108 kbuild/src/consumer/kernel/workqueue.c:2280)
[ 7.427872] ? process_one_work (kbuild/src/consumer/kernel/workqueue.c:2364)
[ 7.428274] worker_thread (kbuild/src/consumer/include/linux/list.h:282 kbuild/src/consumer/kernel/workqueue.c:2422)
[ 7.428644] ? process_one_work (kbuild/src/consumer/kernel/workqueue.c:2364)
[ 7.429047] kthread (kbuild/src/consumer/kernel/kthread.c:292)
[ 7.429376] ? kthread_park (kbuild/src/consumer/kernel/kthread.c:245)
[ 7.429738] ret_from_fork (kbuild/src/consumer/arch/x86/entry/entry_64.S:300)
[ 7.430097] Modules linked in: syscopyarea sysfillrect sysimgblt fb_sys_fops drm intel_rapl_msr ppdev intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel rapl joydev ata_piix libata serio_raw i2c_piix4 ipmi_devintf ipmi_msghandler parport_pc parport ip_tables
[ 7.432217] CR2: 0000000000000368
[ 7.432563] ---[ end trace da8ba044c8e60dc6 ]---
[ 7.432992] RIP: 0010:bio_add_hw_page (kbuild/src/consumer/block/bio.c:260 kbuild/src/consumer/include/linux/bio.h:124 kbuild/src/consumer/include/linux/bio.h:119 kbuild/src/consumer/block/bio.c:778 kbuild/src/consumer/block/bio.c:753)
[ 7.433424] Code: 09 44 39 c8 0f 87 f5 00 00 00 0f b7 46 60 49 89 fc 49 89 d6 45 89 c5 66 85 c0 75 60 66 39 43 62 0f 86 d9 00 00 00 48 8b 53 08 <48> 8b 92 68 03 00 00 48 8b 52 50 8b 92 08 04 00 00 29 ea 39 53 28
All code
========
0: 09 44 39 c8 or %eax,-0x38(%rcx,%rdi,1)
4: 0f 87 f5 00 00 00 ja 0xff
a: 0f b7 46 60 movzwl 0x60(%rsi),%eax
e: 49 89 fc mov %rdi,%r12
11: 49 89 d6 mov %rdx,%r14
14: 45 89 c5 mov %r8d,%r13d
17: 66 85 c0 test %ax,%ax
1a: 75 60 jne 0x7c
1c: 66 39 43 62 cmp %ax,0x62(%rbx)
20: 0f 86 d9 00 00 00 jbe 0xff
26: 48 8b 53 08 mov 0x8(%rbx),%rdx
2a:* 48 8b 92 68 03 00 00 mov 0x368(%rdx),%rdx <-- trapping instruction
31: 48 8b 52 50 mov 0x50(%rdx),%rdx
35: 8b 92 08 04 00 00 mov 0x408(%rdx),%edx
3b: 29 ea sub %ebp,%edx
3d: 39 53 28 cmp %edx,0x28(%rbx)

Code starting with the faulting instruction
===========================================
0: 48 8b 92 68 03 00 00 mov 0x368(%rdx),%rdx
7: 48 8b 52 50 mov 0x50(%rdx),%rdx
b: 8b 92 08 04 00 00 mov 0x408(%rdx),%edx
11: 29 ea sub %ebp,%edx
13: 39 53 28 cmp %edx,0x28(%rbx)


To reproduce:

# build kernel
cd linux
cp config-5.12.0-rc8-00005-g803f54ef52fc .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (8.50 kB)
config-5.12.0-rc8-00005-g803f54ef52fc (172.81 kB)
job-script (4.50 kB)
dmesg.xz (13.93 kB)
Download all attachments

2021-04-26 13:19:38

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v8] bio: limit bio max size

Hi Changheu,

On 21.04.2021 11:47, Changheun Lee wrote:
> bio size can grow up to 4GB when muli-page bvec is enabled.
> but sometimes it would lead to inefficient behaviors.
> in case of large chunk direct I/O, - 32MB chunk read in user space -
> all pages for 32MB would be merged to a bio structure if the pages
> physical addresses are contiguous. it makes some delay to submit
> until merge complete. bio max size should be limited to a proper size.
>
> When 32MB chunk read with direct I/O option is coming from userspace,
> kernel behavior is below now in do_direct_IO() loop. it's timeline.
>
> | bio merge for 32MB. total 8,192 pages are merged.
> | total elapsed time is over 2ms.
> |------------------ ... ----------------------->|
> | 8,192 pages merged a bio.
> | at this time, first bio submit is done.
> | 1 bio is split to 32 read request and issue.
> |--------------->
> |--------------->
> |--------------->
> ......
> |--------------->
> |--------------->|
> total 19ms elapsed to complete 32MB read done from device. |
>
> If bio max size is limited with 1MB, behavior is changed below.
>
> | bio merge for 1MB. 256 pages are merged for each bio.
> | total 32 bio will be made.
> | total elapsed time is over 2ms. it's same.
> | but, first bio submit timing is fast. about 100us.
> |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
> | 256 pages merged a bio.
> | at this time, first bio submit is done.
> | and 1 read request is issued for 1 bio.
> |--------------->
> |--------------->
> |--------------->
> ......
> |--------------->
> |--------------->|
> total 17ms elapsed to complete 32MB read done from device. |
>
> As a result, read request issue timing is faster if bio max size is limited.
> Current kernel behavior with multipage bvec, super large bio can be created.
> And it lead to delay first I/O request issue.
>
> Signed-off-by: Changheun Lee <[email protected]>

This patch landed in linux-next 20210426 as commit 42fb54fbc707 ("bio:
limit bio max size"). Sadly it causes the following regression during
boot on my test systems:

8<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address 0000023c
pgd = (ptrval)
[0000023c] *pgd=00000000
Internal error: Oops: 5 [#2] SMP ARM
Modules linked in:
CPU: 0 PID: 186 Comm: systemd-udevd Tainted: G      D
5.12.0-next-20210426 #3045
Hardware name: Generic DT based system
PC is at bio_add_hw_page+0x58/0x1fc
LR is at bio_add_pc_page+0x40/0x5c
pc : [<c06c5bf0>]    lr : [<c06c5dd4>]    psr: 20000013
sp : c3cc7de0  ip : ffffffff  fp : 00000000
r10: 00000cc0  r9 : c20b2000  r8 : c21b5680
r7 : dbc51b80  r6 : c30d0540  r5 : 00000014  r4 : c21b5680
r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : c30d0540
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 43ccc06a  DAC: 00000051
Register r0 information: slab request_queue start c30d0540 pointer offset 0
Register r1 information: NULL pointer
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab kmalloc-128 start c21b5680 pointer offset
0 size 128
Register r5 information: non-paged memory
Register r6 information: slab request_queue start c30d0540 pointer offset 0
Register r7 information: non-slab/vmalloc memory
Register r8 information: slab kmalloc-128 start c21b5680 pointer offset
0 size 128
Register r9 information: slab kmalloc-4k start c20b2000 pointer offset 0
size 4096
Register r10 information: non-paged memory
Register r11 information: NULL pointer
Register r12 information: non-paged memory
Process systemd-udevd (pid: 186, stack limit = 0x(ptrval))
Stack: (0xc3cc7de0 to 0xc3cc8000)
...
[<c06c5bf0>] (bio_add_hw_page) from [<c06c5dd4>] (bio_add_pc_page+0x40/0x5c)
[<c06c5dd4>] (bio_add_pc_page) from [<c06cf0ac>]
(blk_rq_map_kern+0x234/0x304)
[<c06cf0ac>] (blk_rq_map_kern) from [<c0a54634>] (serial_show+0x64/0xd4)
[<c0a54634>] (serial_show) from [<c0a228ac>] (dev_attr_show+0x18/0x48)
[<c0a228ac>] (dev_attr_show) from [<c054721c>] (sysfs_kf_seq_show+0x88/0xf4)
[<c054721c>] (sysfs_kf_seq_show) from [<c04d7a44>]
(seq_read_iter+0x10c/0x4bc)
[<c04d7a44>] (seq_read_iter) from [<c04adf60>] (vfs_read+0x1d4/0x2e0)
[<c04adf60>] (vfs_read) from [<c04ae47c>] (ksys_read+0x5c/0xd0)
[<c04ae47c>] (ksys_read) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
Exception stack(0xc3cc7fa8 to 0xc3cc7ff0)
...
Code: e1520003 9a000021 e5942004 e5941020 (e592223c)
---[ end trace 51c4d8003ec70244 ]---

It can be also reproduced with qemu and ARM 32bit virt machine.
Reverting it on top of linux-next 20210426 fixes the issue. If you need
more information, let me know.

> ---
> block/bio.c | 9 ++++++++-
> block/blk-settings.c | 5 +++++
> include/linux/bio.h | 4 +++-
> include/linux/blkdev.h | 2 ++
> 4 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 50e579088aca..9e5061ecc317 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -255,6 +255,13 @@ void bio_init(struct bio *bio, struct bio_vec *table,
> }
> EXPORT_SYMBOL(bio_init);
>
> +unsigned int bio_max_size(struct bio *bio)
> +{
> + struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +
> + return q->limits.bio_max_bytes;
> +}
> +
> /**
> * bio_reset - reinitialize a bio
> * @bio: bio to reset
> @@ -866,7 +873,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
> struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>
> if (page_is_mergeable(bv, page, len, off, same_page)) {
> - if (bio->bi_iter.bi_size > UINT_MAX - len) {
> + if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
> *same_page = false;
> return false;
> }
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index b4aa2f37fab6..cd3dcb5afe50 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -37,6 +37,7 @@ EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
> */
> void blk_set_default_limits(struct queue_limits *lim)
> {
> + lim->bio_max_bytes = UINT_MAX;
> lim->max_segments = BLK_MAX_SEGMENTS;
> lim->max_discard_segments = 1;
> lim->max_integrity_segments = 0;
> @@ -168,6 +169,10 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto
> limits->logical_block_size >> SECTOR_SHIFT);
> limits->max_sectors = max_sectors;
>
> + if (check_shl_overflow(max_sectors, SECTOR_SHIFT,
> + &limits->bio_max_bytes))
> + limits->bio_max_bytes = UINT_MAX;
> +
> q->backing_dev_info->io_pages = max_sectors >> (PAGE_SHIFT - 9);
> }
> EXPORT_SYMBOL(blk_queue_max_hw_sectors);
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index d0246c92a6e8..e5add63da3af 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -106,6 +106,8 @@ static inline void *bio_data(struct bio *bio)
> return NULL;
> }
>
> +extern unsigned int bio_max_size(struct bio *bio);
> +
> /**
> * bio_full - check if the bio is full
> * @bio: bio to check
> @@ -119,7 +121,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
> if (bio->bi_vcnt >= bio->bi_max_vecs)
> return true;
>
> - if (bio->bi_iter.bi_size > UINT_MAX - len)
> + if (bio->bi_iter.bi_size > bio_max_size(bio) - len)
> return true;
>
> return false;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 158aefae1030..c205d60ac611 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -312,6 +312,8 @@ enum blk_zoned_model {
> };
>
> struct queue_limits {
> + unsigned int bio_max_bytes;
> +
> unsigned long bounce_pfn;
> unsigned long seg_boundary_mask;
> unsigned long virt_boundary_mask;

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2021-04-26 20:13:25

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v8] bio: limit bio max size


On 26/04/2021 14:18, Marek Szyprowski wrote:

...

> This patch landed in linux-next 20210426 as commit 42fb54fbc707 ("bio:
> limit bio max size"). Sadly it causes the following regression during
> boot on my test systems:
>
> 8<--- cut here ---
> Unable to handle kernel NULL pointer dereference at virtual address 0000023c
> pgd = (ptrval)
> [0000023c] *pgd=00000000
> Internal error: Oops: 5 [#2] SMP ARM
> Modules linked in:
> CPU: 0 PID: 186 Comm: systemd-udevd Tainted: G      D
> 5.12.0-next-20210426 #3045
> Hardware name: Generic DT based system
> PC is at bio_add_hw_page+0x58/0x1fc
> LR is at bio_add_pc_page+0x40/0x5c
> pc : [<c06c5bf0>]    lr : [<c06c5dd4>]    psr: 20000013
> sp : c3cc7de0  ip : ffffffff  fp : 00000000
> r10: 00000cc0  r9 : c20b2000  r8 : c21b5680
> r7 : dbc51b80  r6 : c30d0540  r5 : 00000014  r4 : c21b5680
> r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : c30d0540
> Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> Control: 10c5387d  Table: 43ccc06a  DAC: 00000051
> Register r0 information: slab request_queue start c30d0540 pointer offset 0
> Register r1 information: NULL pointer
> Register r2 information: NULL pointer
> Register r3 information: NULL pointer
> Register r4 information: slab kmalloc-128 start c21b5680 pointer offset
> 0 size 128
> Register r5 information: non-paged memory
> Register r6 information: slab request_queue start c30d0540 pointer offset 0
> Register r7 information: non-slab/vmalloc memory
> Register r8 information: slab kmalloc-128 start c21b5680 pointer offset
> 0 size 128
> Register r9 information: slab kmalloc-4k start c20b2000 pointer offset 0
> size 4096
> Register r10 information: non-paged memory
> Register r11 information: NULL pointer
> Register r12 information: non-paged memory
> Process systemd-udevd (pid: 186, stack limit = 0x(ptrval))
> Stack: (0xc3cc7de0 to 0xc3cc8000)
> ...
> [<c06c5bf0>] (bio_add_hw_page) from [<c06c5dd4>] (bio_add_pc_page+0x40/0x5c)
> [<c06c5dd4>] (bio_add_pc_page) from [<c06cf0ac>]
> (blk_rq_map_kern+0x234/0x304)
> [<c06cf0ac>] (blk_rq_map_kern) from [<c0a54634>] (serial_show+0x64/0xd4)
> [<c0a54634>] (serial_show) from [<c0a228ac>] (dev_attr_show+0x18/0x48)
> [<c0a228ac>] (dev_attr_show) from [<c054721c>] (sysfs_kf_seq_show+0x88/0xf4)
> [<c054721c>] (sysfs_kf_seq_show) from [<c04d7a44>]
> (seq_read_iter+0x10c/0x4bc)
> [<c04d7a44>] (seq_read_iter) from [<c04adf60>] (vfs_read+0x1d4/0x2e0)
> [<c04adf60>] (vfs_read) from [<c04ae47c>] (ksys_read+0x5c/0xd0)
> [<c04ae47c>] (ksys_read) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
> Exception stack(0xc3cc7fa8 to 0xc3cc7ff0)
> ...
> Code: e1520003 9a000021 e5942004 e5941020 (e592223c)
> ---[ end trace 51c4d8003ec70244 ]---


I have also noticed that an eMMC test we have started failing
today and bisect is pointing to this commit. Reverting this
change fixes it. The signature is a bit different to the
above, but nonetheless seems to be causing problems ...

[ 76.675488] ------------[ cut here ]------------
[ 76.680147] WARNING: CPU: 1 PID: 705 at /dvs/git/dirty/git-master_l4t-upstream/kernel/block/bio.c:1033 bio_iov_iter_get_pages+0x480/0x490
[ 76.692518] Modules linked in: snd_soc_tegra30_i2s snd_soc_tegra_pcm snd_hda_codec_hdmi snd_soc_rt5640 snd_soc_tegra_rt5640 snd_soc_rl6231 snd_soc_tegra_utils snd_soc_core ac97_bus snd_pcm_dmaengine snd_soc_tegra30_ahub snd_hda_tegra snd_hda_codec snd_hda_core snd_pcm tegra_soctherm xhci_tegra snd_timer snd soundcore nouveau drm_ttm_helper ttm tegra30_devfreq tegra_wdt
[ 76.725279] CPU: 1 PID: 705 Comm: dd Not tainted 5.12.0-next-20210426-g3f1fee3e7237 #1
[ 76.733192] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[ 76.739457] [<c0311628>] (unwind_backtrace) from [<c030bdd4>] (show_stack+0x10/0x14)
[ 76.747203] [<c030bdd4>] (show_stack) from [<c0fe550c>] (dump_stack+0xc8/0xdc)
[ 76.754423] [<c0fe550c>] (dump_stack) from [<c0345900>] (__warn+0x104/0x108)
[ 76.761466] [<c0345900>] (__warn) from [<c03459b8>] (warn_slowpath_fmt+0xb4/0xbc)
[ 76.768950] [<c03459b8>] (warn_slowpath_fmt) from [<c06cbb58>] (bio_iov_iter_get_pages+0x480/0x490)
[ 76.777996] [<c06cbb58>] (bio_iov_iter_get_pages) from [<c0533864>] (iomap_dio_bio_actor+0x278/0x528)
[ 76.787216] [<c0533864>] (iomap_dio_bio_actor) from [<c052f2d0>] (iomap_apply+0x170/0x440)
[ 76.795476] [<c052f2d0>] (iomap_apply) from [<c053433c>] (__iomap_dio_rw+0x3f0/0x638)
[ 76.803297] [<c053433c>] (__iomap_dio_rw) from [<c0534598>] (iomap_dio_rw+0x14/0x3c)
[ 76.811043] [<c0534598>] (iomap_dio_rw) from [<c056347c>] (ext4_file_write_iter+0x550/0xa78)
[ 76.819483] [<c056347c>] (ext4_file_write_iter) from [<c04b1700>] (vfs_write+0x2ec/0x3bc)
[ 76.827662] [<c04b1700>] (vfs_write) from [<c04b1954>] (ksys_write+0xa8/0xd8)
[ 76.834792] [<c04b1954>] (ksys_write) from [<c03000c0>] (ret_fast_syscall+0x0/0x58)
[ 76.842435] Exception stack(0xc5471fa8 to 0xc5471ff0)
[ 76.847485] 1fa0: 0000006c 06400000 00000001 b0a2f000 06400000 00000000
[ 76.855653] 1fc0: 0000006c 06400000 0050d0b8 00000004 0050d0b8 0050d0b8 00000000 00000000
[ 76.863823] 1fe0: 00000004 beb0c9d0 b6ebdc0b b6e48206
[ 76.868917] ---[ end trace d33cae3bcbc64fcb ]---


Cheers
Jon

--
nvpublic