by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
<[email protected]> wrote:
>
> mmap() is in *no* way special. The exact same thing happens for
> regular read/write. Yet somehow the mmap code is special-cased, while
> the normal read-write code is not.

I just double-checked, because it's been a long time since I actually
looked at the code.

But yeah, block device read/write uses the pure page cache functions.
IOW, it has the *exact* same IO engine as mmap() would have.

So here's my suggestion:

- get rid of *all* the locking in aio_read/write and the splice paths
- get rid of all the stupid mmap games

- instead, add them to the functions that actually use
"blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else.

That's a fairly limited number of functions:
blkdev_{read,write}page(), blkdev_direct_IO() and
blkdev_write_{begin,end}()

Doesn't that sounds simpler? And more logical: it protects the actual
places that use the block size of the device.

I dunno. Maybe there is some fundamental reason why the above is
broken, but it seems to be a much simpler approach. Sure, you need to
guarantee that the people who get the write-lock cannot possibly cause
IO while holding it, but since the only reason to get the write lock
would be to change the block size, that should be pretty simple, no?

Yeah, yeah, I'm probably missing something fundamental, but the above
sounds like the simple approach to fixing things. Aiming for having
the block size read-lock be taken by the things that pass in the
block-size itself.

It would be nice for things to be logical and straightforward.

Linus

2012-11-28 20:33:07

On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds
<[email protected]> wrote:
>
> Interesting. The code *has* the block size (it's in "bh->b_size"), but
> it actually then uses the inode blocksize instead, and verifies the
> two against each other. It could just have used the block size
> directly (and then used the inode i_blkbits only when no buffers
> existed), avoiding that dependency entirely..

Looking more at this code, that really would be the nicest solution.

There's two cases for the whole get_block() thing:

- filesystems. The block size will not change randomly, and
"get_block()" seriously depends on the block size.

- the raw device. The block size *will* change, but to simplify the
problem, "get_block()" is a 1:1 mapping, so it doesn't even care about
the block size because it will always return "bh->b_blocknr = nr".

So we *could* just say that all the fs/buffer.c code should use
"inode->i_blkbits" for creating buffers (because that's the size new
buffers should always use), but use "bh->b_size" for any *existing*
buffer use.

And looking at it, it's even simple. Except for one *very* annoying
thing: several users really don't want the size of the buffer, they
really do want the *shift* of the buffer size.

In fact, that single issue seems to be the reason why
"inode->i_blkbits" is really used in fs/buffer.c.

Otherwise it would be fairly trivial to just make the pattern be just a simple

if (!page_has_buffers(page))
create_empty_buffers(page, 1 << inode->i_blkbits, 0);
head = page_buffers(page);
blocksize = head->b_size;

and just use the blocksize that way, without any other games. All
done, no silly WARN_ON() to verify against some global block-size, and
the fs/buffer.c code would be perfectly simple, and would have no
problem at all with multiple different blocksizes in different pages
(the page lock serializes the buffers and thus the blocksize at the
per-page level).

But the fact that the code wants to do things like

block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);

seriously seems to be the main thing that keeps us using
'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
enough to hurt (not everywhere, but on some machines).

Very annoying.

Linus

2012-11-29 06:16:45

On Sat, Dec 01, 2012 at 09:40:41AM +1100, Dave Chinner wrote:
> So it was based on this interface?

Based on that. I had dropped the inode operation as it's not really
a generic operation but a callback for either the buffered I/O code
or direct I/O and should be treated as such. I've also split the
single multiplexer function into individual ones, but the underlying
data structure and fundamental operations are the same.

> (I went looking for this code on google a couple of days ago so I
> could point at it and say "we should be using an iomap structure,
> not buffer heads", but it looks like I never posted it to fsdevel or
> the xfs lists...)

Your version defintively was up on your kernel.org XFS tree, that's what
I started from.

I'll have a long plane right tonight, let's see if I can get the direct
I/O version updated.