Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp569866pxa; Fri, 21 Aug 2020 14:55:14 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwtfRiTlzg4xNrKSGfyTfgSwmd83RXkslJ1QZj4EEsssuUVU0PsESca33lRlProqiuMziU7 X-Received: by 2002:a17:907:2805:: with SMTP id eb5mr4847514ejc.139.1598046914311; Fri, 21 Aug 2020 14:55:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1598046914; cv=none; d=google.com; s=arc-20160816; b=wuQBvU7C0qwNQgvmprKKMsEvYKwhe3JSMWG8Ecn859IWPnJKY3iJlr6UwvpGWNNcDW 2Fv9kX6d1KR/RTRK90X46oug1eujvbAz5IXNedUzG9hvbNlJx1Bv6RQcp2xDIoZop1aY XCRKvovBVCXu+wGaRGvgy+c2cq2wi2ugI5Ro6TQZTAPjcZ3cM1eAr5iqQWnoab1//bZY MJyGPlKBxSihgdZAXvy1s3Y7gCKb8ZEsHKWtiQJXQQ8ISCfY9+N1aSPtjcanVaiN8Okj //J5WhuM3gIWJ5ThiJUN3kJRULr4ldx+H8TbkwGwAtpTmfHUohx7iUOsjpFve+yYbvEP d8zA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=/X+jyEqRs7NUhG1B/RZmHN3K0UYIwX0pWp1LGXu3d5Y=; b=hmwWwqldGURRaR7FCwY08QCz7wv/XpVihAhQALDlCipAI2mrKi+2BH0hHhopA3SpKX TigMQhmEFFIHPrnI0otQieYmKPbaPxzo14ukAXkGnGlrOXBRDOsBZHQmPPDEPbICufWB Kx8CHpA8GpGIsznm4EN495Lm+XyodCc3UvbsZyUBGfB+EcW9b9TkwxauSSXgtvJS2qFb o0LpAMi7jmLK/wB0e5/fQH5vNODOLyB7G2NxFVhABsjnp+FgCUIOn90BiphJlpEJzDyL JnUILj27z379EJ+NB2f2Ayc2EUJ8HxeOa7VANsAcKpNT392JqxrZEHIAkLNdsPkG1eaL cadQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v4si2081172ejb.563.2020.08.21.14.54.50; Fri, 21 Aug 2020 14:55:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726820AbgHUVyM (ORCPT + 99 others); Fri, 21 Aug 2020 17:54:12 -0400 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:35825 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726672AbgHUVyM (ORCPT ); Fri, 21 Aug 2020 17:54:12 -0400 Received: from dread.disaster.area (pa49-181-146-199.pa.nsw.optusnet.com.au [49.181.146.199]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id DC7B6108A69; Sat, 22 Aug 2020 07:53:59 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1k9Ez0-0002OM-F0; Sat, 22 Aug 2020 07:53:58 +1000 Date: Sat, 22 Aug 2020 07:53:58 +1000 From: Dave Chinner To: Ritesh Harjani Cc: Anju T Sudhakar , hch@infradead.org, darrick.wong@oracle.com, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, willy@infradead.org Subject: Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend(). Message-ID: <20200821215358.GG7941@dread.disaster.area> References: <20200819102841.481461-1-anju@linux.vnet.ibm.com> <20200820231140.GE7941@dread.disaster.area> <20200821044533.BBFD1A405F@d06av23.portsmouth.uk.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200821044533.BBFD1A405F@d06av23.portsmouth.uk.ibm.com> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QKgWuTDL c=1 sm=1 tr=0 cx=a_idp_d a=GorAHYkI+xOargNMzM6qxQ==:117 a=GorAHYkI+xOargNMzM6qxQ==:17 a=kj9zAlcOel0A:10 a=y4yBn9ojGxQA:10 a=VnNF1IyMAAAA:8 a=7-415B0cAAAA:8 a=Qh6IE-2GF9-oy0DaVTgA:9 a=RXxoB3lGECdc3DDF:21 a=r0DmFvm1JCai0nq2:21 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 21, 2020 at 10:15:33AM +0530, Ritesh Harjani wrote: > Hello Dave, > > Thanks for reviewing this. > > On 8/21/20 4:41 AM, Dave Chinner wrote: > > On Wed, Aug 19, 2020 at 03:58:41PM +0530, Anju T Sudhakar wrote: > > > From: Ritesh Harjani > > > > > > __bio_try_merge_page() may return same_page = 1 and merged = 0. > > > This could happen when bio->bi_iter.bi_size + len > UINT_MAX. > > > > Ummm, silly question, but exactly how are we getting a bio that > > large in ->writepages getting built? Even with 64kB pages, that's a > > bio with 2^16 pages attached to it. We shouldn't be building single > > bios in writeback that large - what storage hardware is allowing > > such huge bios to be built? (i.e. can you dump all the values in > > /sys/block//queue/* for that device for us?) > > Please correct me here, but as I see, bio has only these two limits > which it checks for adding page to bio. It doesn't check for limits > of /sys/block//queue/* no? I guess then it could be checked > by block layer below b4 submitting the bio? > > 113 static inline bool bio_full(struct bio *bio, unsigned len) > 114 { > 115 if (bio->bi_vcnt >= bio->bi_max_vecs) > 116 return true; > 117 > 118 if (bio->bi_iter.bi_size > UINT_MAX - len) > 119 return true; > 120 > 121 return false; > 122 } but iomap only allows BIO_MAX_PAGES when creating the bio. And: #define BIO_MAX_PAGES 256 So even on a 64k page machine, we should not be building a bio with more than 16MB of data in it. So how are we getting 4GB of data into it? Further, the writeback code is designed around the bios having a bound size that is relatively small to keep IO submission occurring as we pack pages into bios. This keeps IO latency down and minimises the individual IO completion overhead of each IO. This is especially important as the writeback path is critical for memory relcaim to make progress because we do not want to trap gigabytes of dirty memory in the writeback IO path. IOWs, seeing huge bios being built by writeback is indicative of design assumptions and contraints being violated - huge bios on the buffered writeback path like this are not a good thing to see. FWIW, We've also recently got reports of hard lockups in IO completion of overwrites because our ioend bio chains have grown to almost 3 million pages and all the writeback pages get processed as a single completion. This is a similar situation to this bug report in that the bio chains are unbound in length, and I'm betting the cause is the same: overwrite a 10GB file in memory (with dirty limits turned up), then run fsync so we do a single writepages call that tries to write 10GB of dirty pages.... The only reason we don't normally see this is that background writeback caps the number of pages written per writepages call to 1024. i.e. it caps writeback IO sizes to a small amount so that IO latency, writeback fairness across dirty inodes, etc can be maintained for background writeback - no one dirty file can monopolise the available writeback bandwidth and starve writeback to other dirty inodes. So combine the two, and we've got a problem that the writeback IO sizes are not being bound to sane IO sizes. I have no problems with building individual bios that are 4MB or even 16MB in size - that allows the block layer to work efficiently. Problems at a system start to occur, however, when individual bios or bio chains built by writeback end up being orders of magnitude larger than this.... i.e. I'm not looking at this as a "bio overflow bug" - I'm commenting on what this overflow implies from an architectural point of view. i.e. that uncapped bio sizes and bio chain lengths in writeback are actually a bad thing and something we've always tried to avoid doing.... ..... > /sys/block//queue/* > ======================== > > setup:/run/perf$ cat /sys/block/loop1/queue/max_segments > 128 > setup:/run/perf$ cat /sys/block/loop1/queue/max_segment_size > 65536 A maximumally size bio (16MB) will get split into two bios for this hardware based on this (8MB max size). > setup:/run/perf$ cat /sys/block/loop1/queue/max_hw_sectors_kb > 1280 Except this says 1280kB is the max size, so it will actually get split into 14 bios. So a stream of 16MB bios from writeback will be more than large enough to keep this hardware's pipeline full.... Cheers, Dave. -- Dave Chinner david@fromorbit.com