Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp599528rdf; Fri, 3 Nov 2023 09:22:19 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEsIf4DCD/Nk0KcXjoC10zGCqKbDwqz5ZtGAAbT2QoUO1PhCz3a+cb7k6Qgw9kbBZp4/ypL X-Received: by 2002:a05:6a21:7905:b0:16b:c019:f5d2 with SMTP id bg5-20020a056a21790500b0016bc019f5d2mr21439918pzc.45.1699028539212; Fri, 03 Nov 2023 09:22:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1699028539; cv=none; d=google.com; s=arc-20160816; b=HLAZCrGQtTK2S9xMd3QNGjIid6wk95z3s5yDJYGQSwfN2qbEd/1tjiWMufnMiBHj98 W7GoHYMoixNrtPyKUi8fjf6XIx6dn/GYiwkDTz/Pf+RUt+p0YeE7ObKHK/n0P/aEbDJq HmHnTbmD8BMUPykNM+xs6qvPTFvY8R7fpJBSRDSxUQZKCJ3/IshkTHHd2eiK57JZENfk 8dAYwHdw7mPl34U6VMODcMMyOSeIAeeqAnTzGthioVBzZk/wMvw/ehDS5/F1QsrMW8ZL Uvsn0BhNRu4H8x5rL7ELyUNeQWwKGf54CVKRP++jxlNVpe7JM5LBKrB5caSgtNqIMRS7 U3Ag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=cNgudPCc13D4/tjUOv7XPhTpnXTcLS65Xsn8KNXaBsc=; fh=wQ/nOKxJ+a9ZV9A67LlA6BjT1Gb8wJdc4bpNi2AW37g=; b=jyFj/AJumFf4S6HkBxfWLO1gWyTEBhSCQBI26bqGszbqnVfB6J7iC6Uh+MNysjZAnn OhEV4sP+ncIQZIw8dyAUnn459B7eqIbcG9hQe8DEzh0wqDy3FvWZvInbJ+TENuAFOR2l +oGO6swMts/1z8b1UrruIFSIhwLsP0EdqMKTk40Lgu/ibCiN7PqN7NKM1ghRRYLpZyfp iGB6CG8sm8Eu5giXO+eta9xJdmf6/PUbbw4hsZNtr9JNmCvO59fRmsynG2Zimzp96CYE wtrBArym4aCFvNQ1HcP2InKr+Lae5QonAHYwEHz379ESvl1gkoTW1HXBckTfapnarqUk tPpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=a7upcP+o; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id v64-20020a638943000000b005ad8009e304si1753365pgd.784.2023.11.03.09.22.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Nov 2023 09:22:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=a7upcP+o; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 8205F82972DD; Fri, 3 Nov 2023 09:22:16 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344472AbjKCQWF (ORCPT + 99 others); Fri, 3 Nov 2023 12:22:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230121AbjKCQWE (ORCPT ); Fri, 3 Nov 2023 12:22:04 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F2BF111 for ; Fri, 3 Nov 2023 09:21:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699028474; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cNgudPCc13D4/tjUOv7XPhTpnXTcLS65Xsn8KNXaBsc=; b=a7upcP+ohYVYsfBMtCj/eAxMfif3PMlF3ivb5hYVIanFwtWlB/iOIlSTKgacqG207fpYDL ZSPN4NOG16dt51w6VhKYXYYDjuC/j+AF9KLaI3yJZvXf34EKJMvfsHgkr4scsg03Y66AeI 56EdyXfGWHj3gTt6Qw+2Q9+bErWHwUI= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-362-qTo86_hcP5C2SOhjH-WgDg-1; Fri, 03 Nov 2023 12:21:07 -0400 X-MC-Unique: qTo86_hcP5C2SOhjH-WgDg-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 77B522932483; Fri, 3 Nov 2023 16:21:05 +0000 (UTC) Received: from fedora (unknown [10.72.120.13]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A91BC2026D4C; Fri, 3 Nov 2023 16:20:56 +0000 (UTC) Date: Sat, 4 Nov 2023 00:20:51 +0800 From: Ming Lei To: Ed Tsai =?utf-8?B?KOiUoeWul+i7kik=?= Cc: "matthias.bgg@gmail.com" , "angelogioacchino.delregno@collabora.com" , "axboe@kernel.dk" , Will Shiu =?utf-8?B?KOioseaBreeRnCk=?= , Peter Wang =?utf-8?B?KOeOi+S/oeWPiyk=?= , "linux-block@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Alice Chao =?utf-8?B?KOi2meePruWdhyk=?= , "linux-mediatek@lists.infradead.org" , wsd_upstream , Casper Li =?utf-8?B?KOadjuS4reamrik=?= , Chun-Hung Wu =?utf-8?B?KOW3q+mnv+Wujyk=?= , Powen Kao =?utf-8?B?KOmrmOS8r+aWhyk=?= , Naomi Chu =?utf-8?B?KOacseipoOeUsCk=?= , "linux-arm-kernel@lists.infradead.org" , Stanley Chu =?utf-8?B?KOacseWOn+mZnik=?= , ming.lei@redhat.com Subject: Re: [PATCH 1/1] block: Check the queue limit before bio submitting Message-ID: References: <20231025092255.27930-1-ed.tsai@mediatek.com> <64db8f5406571c2f89b70f852eb411320201abe6.camel@mediatek.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <64db8f5406571c2f89b70f852eb411320201abe6.camel@mediatek.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.4 X-Spam-Status: No, score=-1.3 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 03 Nov 2023 09:22:16 -0700 (PDT) On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote: > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com wrote: > > From: Ed Tsai > > > > Referring to commit 07173c3ec276 ("block: enable multipage bvecs"), > > each bio_vec now holds more than one page, potentially exceeding > > 1MB in size and causing alignment issues with the queue limit. > > > > In a sequential read/write scenario, the file system maximizes the > > bio's capacity before submitting. However, misalignment with the > > queue limit can result in the bio being split into smaller I/O > > operations. > > > > For instance, assuming the maximum I/O size is set to 512KB and the > > memory is highly fragmented, resulting in each bio containing only > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause the > > bio to be split into two 512KB portions and one 4KB portion. As a > > result, the originally expected continuous large I/O operations are > > interspersed with many small I/O operations. > > > > To address this issue, this patch adds a check for the max_sectors > > before submitting the bio. This allows the upper layers to > > proactively > > detect and handle alignment issues. > > > > I performed the Antutu V10 Storage Test on a UFS 4.0 device, which > > resulted in a significant improvement in the Sequential test: > > > > Sequential Read (average of 5 rounds): > > Original: 3033.7 MB/sec > > Patched: 3520.9 MB/sec > > > > Sequential Write (average of 5 rounds): > > Original: 2225.4 MB/sec > > Patched: 2800.3 MB/sec > > > > Signed-off-by: Ed Tsai > > --- > > block/bio.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/block/bio.c b/block/bio.c > > index 816d412c06e9..a4a1f775b9ea 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -1227,6 +1227,7 @@ static int __bio_iov_iter_get_pages(struct bio > > *bio, struct iov_iter *iter) > > iov_iter_extraction_t extraction_flags = 0; > > unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > > unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; > > + struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)- > > >limits; > > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > > struct page **pages = (struct page **)bv; > > ssize_t size, left; > > @@ -1275,6 +1276,11 @@ static int __bio_iov_iter_get_pages(struct bio > > *bio, struct iov_iter *iter) > > struct page *page = pages[i]; > > > > len = min_t(size_t, PAGE_SIZE - offset, left); > > + if (bio->bi_iter.bi_size + len > > > + lim->max_sectors << SECTOR_SHIFT) { > > + ret = left; > > + break; > > + } > > if (bio_op(bio) == REQ_OP_ZONE_APPEND) { > > ret = bio_iov_add_zone_append_page(bio, page, > > len, > > offset); > > -- > > 2.18.0 > > > > > Hi Jens, > > Just to clarify any potential confusion, I would like to provide > further details based on the assumed scenario mentioned above. > > When the upper layer continuously sends 1028KB full-sized bios for > sequential reads, the Block Layer sees the following sequence: > submit bio: size = 1028KB, start LBA = n > submit bio: size = 1028KB, start LBA = n + 1028KB > submit bio: size = 1028KB, start LBA = n + 2056KB > ... > > However, due to the queue limit restricting the I/O size to a maximum > of 512KB, the Block Layer splits into the following sequence: > submit bio: size = 512KB, start LBA = n > submit bio: size = 512KB, start LBA = n + 512KB > submit bio: size = 4KB, start LBA = n + 1024KB > submit bio: size = 512KB, start LBA = n + 1028KB > submit bio: size = 512KB, start LBA = n + 1540KB > submit bio: size = 4KB, start LBA = n + 2052KB > submit bio: size = 512KB, start LBA = n + 2056KB > submit bio: size = 512KB, start LBA = n + 2568KB > submit bio: size = 4KB, start LBA = n + 3080KB > ... > > The original expectation was for the storage to receive large, > contiguous requests. However, due to non-alignment, many small I/O > requests are generated. This problem is easily visible because the > user pages passed in are often allocated by the buddy system as order 0 > pages during page faults, resulting in highly non-contiguous memory. If order 0 page is added to bio, the multipage bvec becomes nop basically(256bvec holds 256 pages), then how can it make a difference for you? > > As observed in the Antutu Sequential Read test below, it is similar to > the description above where the splitting caused by the queue limit > leaves small requests sandwiched in between: > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51] > block_split: 8,32 R 86925864 / 86926888 [Thread-51] > block_split: 8,32 R 86926888 / 86927912 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51] > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51] > block_split: 8,32 R 86928008 / 86929032 [Thread-51] > block_split: 8,32 R 86929032 / 86930056 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51] > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51] > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51] > block_split: 8,32 R 86930152 / 86931176 [Thread-51] > block_split: 8,32 R 86931176 / 86932200 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51] > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51] > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51] > block_split: 8,32 R 86932264 / 86933288 [Thread-51] > block_split: 8,32 R 86933288 / 86934312 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51] > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51] > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51] > > I simply prevents non-aligned situations in bio_iov_iter_get_pages. But there is still 4KB IO left if you limit max bio size is 512KB, then how does this 4KB IO finally go in case of 1028KB IO? > Besides making the upper layer application aware of the queue limit, I > would appreciate any other directions or suggestions you may have. The problem is related with IO size from application. If you send unaligned IO, you can't avoid the last IO with small size, no matter if block layer bio split is involved or not. Your patch just lets __bio_iov_iter_get_pages split the bio, and you still have 4KB left finally when application submits 1028KB, right? Then I don't understand why your patch improves sequential IO performance. Thanks, Ming