by Dmitry Monakhov

[permalink] [raw]

Subject: Re: [PATCH -v2 6/6] ext4: use bio layer instead of buffer layer in mpage_da_submit_io

On Mon, 25 Oct 2010 08:33:53 -0400, Ted Ts'o <[email protected]> wrote:
> On Mon, Oct 25, 2010 at 09:16:16AM +0400, Dmitry wrote:
> > > + if (bio) {
> > > + bio_get(io->io_bio);
> > > + submit_bio(io->io_op, io->io_bio);
> > > + BUG_ON(bio_flagged(io->io_bio, BIO_EOPNOTSUPP));
> > Definitly this BUG_ON should be converted to ext4_error or something
> > similar, otherwhise writeback attempt to removed usb-stick will be fatal
> > for a whole system. IMHO it is reasonable to skip this check at all,
> > because all work will be done in ext4_end_bio() anyway.
> > > + bio_put(io->io_bio);
>
> Cut and pasted from XFS. From what I could tell from the block I/O
> layer, the only time the buffer I/O layer should return BIO_EOPNOTSUPP
> is if we pass it a discard or barrier request, and we're doing neither
> here. So I don't think it should trigger on a removed usb-stick.
>
> At the same time, it's not clear what good the BUG_ON() is doing here,
> either. So perhaps we could could drop the BUG_ON, at which point we
> could drop the bio_get() and bio_put() calls, too. To be honest I'm
> not entirely sure why the XFS code does this.
There are number of reasons why this can happen, for example
submit_bio()
->__generic_make_request()
->bio_check_eod() /* In case of virtual, device size may become
zero, after some error */
or if device may has fancy ->make_request_fn() callback.
Off course this is very unlikely(but i saw this couple of times)
and bio->bi_end_io() will be called in any case, so we can drop that
extra safety logic, because sane bi_end_io(-EIO) implementation must
result in journal_abort. The only difference is the number of bio-s
we can issue before journal_abort was triggered.
So there is no an ambiguity there, you can just drop that extra check.
>
> Jens? Any reason why I shouldn't just remove the bio_get(), the
> BUG_ON()check, and bio_put() calls?
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-10-30 19:10:48

by Eric Whitney

[permalink] [raw]

Subject: Re: [PATCH -v2 0/6] ext4: use the bio layer directly

On 10/23/2010 04:40 PM, Theodore Ts'o wrote:
> This set of patches passes xfstests for both 1k and 4k block sizes. For
> streaming I/O writes, it reduces the number of block I/O queue
> submissions by a factor of 1024 in the ideal case. (i.e., instead of
> submitting 4k requests at a time, we can now submit up to 512k writes at
> a time, a 128 factor of improvement.)
>
> Lockstat measurements by Eric Whitney show that the block I/O request
> queue lock is the top cause of scalability problems in ext4:
>
> http://free.linux.hp.com/~enw/ext4/2.6.35/
>
> This patch should resolve these issues, as well as reducing ext4's CPU
> overhead for large buffered streaming writes by a significant amount.
>
> - Ted
>
> P.S. In a recent e-mail to me, akpm commented that it was a little sad
> that most modern filesystems don't like the core functions offered by
> the VFS and "go it alone". I'm of the strong belief that the fact that
> ext4 was using as much of the "core functions" as it did was responsible
> for why we lagged some of the other modern file systems on the FFSB
> benchmark scores. I wonder if it might be useful to consider taking
> parts of fs/ext4/page-io.c and trying to make a higher level interface
> that could be easily adopted by other basic filesytstems to improve
> their performance.
>
> To play devil's advocate for a moment, the fact that btrfs has special
> needs because of its fs-level snapshots probably rules it out, and I'm
> not sure this is something that would ever be of interest to XFS, since
> they have something much more sophisticated. And perhaps it doesn't
> matter that much whether filesystems that exist primarily for
> compatibility (hfs, vfat, etc.) need to have high
> performance/scalability characteristics.
>
> On the other hand, one nice thing about the fs/ext4/page-io.c interface
> is that it should be relatively easy to take something which calls
> block_write_full_page(), and change it to call what is today named
> ext4_bio_write_page(). All it needs to do is pass a ext4_io_submit
> structure to successive calls to ext4_bio_write_page(), and then call
> (what today is named) ext4_io_submit() when it is done. So minimal
> changes to client file system code, and hopefully impressive
> improvements in performance.
>
> Just a thought....
>
>
> Theodore Ts'o (6):
> ext4: call mpage_da_submit_io() from mpage_da_map_blocks()
> ext4: simplify ext4_writepage()
> ext4: inline ext4_writepage() into mpage_da_submit_io()
> ext4: inline walk_page_buffers() into mpage_da_submit_io
> ext4: move mpage_put_bnr_to_bhs()'s functionality to
> mpage_da_submit_io()
> ext4: use bio layer instead of buffer layer in mpage_da_submit_io
>
> fs/ext4/Makefile | 2 +-
> fs/ext4/ext4.h | 36 +++++-
> fs/ext4/extents.c | 4 +-
> fs/ext4/inode.c | 432 +++++++++++++++++++----------------------------------
> fs/ext4/page-io.c | 426 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/super.c | 8 +-
> 6 files changed, 624 insertions(+), 284 deletions(-)
> create mode 100644 fs/ext4/page-io.c
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

My 48 core test results for these patches as applied to 2.6.36-rc6 can
be found at:

http://free.linux.hp.com/~enw/ext4/2.6.36-rc6

The Boxacle large_file_creates workload showed a large and consistent
scalability improvement with these patches on ext4 filesystems both with
and without a journal. (large_file_creates is effectively a sequential
I/O write workload in this case.) The random_writes and mail_server
workloads benefited as well, but to a much smaller degree.

Unmodified 2.6.36-rc6 ext4, ext4 without a journal, ext3, and xfs data
are also at that URL for comparison. In addition, I've supplied lock
stats and more detailed performance data for reference.

The storage system used for this work differed from that used in earlier
experiments. It delivered much better random I/O throughput, allowing
us to avoid becoming as thoroughly disk-bound as in the earlier work.

The data were taken on 2.6.36-rc6 because that's where Ted was
developing the patches. They've since gone into 2.6.37.

Thanks, Ted!
Eric