As discussed on linux-arch@ recently, these fixes are necessary for
XFS to work on virtually indexed architectures since otherwise vmalloc'd
pages are not properly flushed from the cache.
Thanks!
Kyle
The following changes since commit 8bea8672edfca7ec5f661cafb218f1205863b343:
Stephen Rothwell (1):
mfd: compile fix for twl4030 renaming
are available in the git repository at:
hera.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6.git coherence
James Bottomley (6):
mm: add coherence API for DMA to vmalloc/vmap areas
parisc: add mm API for DMA to vmalloc/vmap areas
arm: add mm API for DMA to vmalloc/vmap areas
sh: add mm API for DMA to vmalloc/vmap areas
block: permit I/O to vmalloc/vmap kernel pages
xfs: fix xfs to work with Virtually Indexed architectures
Documentation/cachetlb.txt | 24 ++++++++++++++++++++++++
arch/arm/include/asm/cacheflush.h | 10 ++++++++++
arch/parisc/include/asm/cacheflush.h | 8 ++++++++
arch/sh/include/asm/cacheflush.h | 8 ++++++++
fs/bio.c | 20 ++++++++++++++++++--
fs/xfs/linux-2.6/xfs_buf.c | 20 ++++++++++++++++++++
include/linux/highmem.h | 6 ++++++
7 files changed, 94 insertions(+), 2 deletions(-)
Linus, any word on these? I'd really love to get jejb and hch off
my back. :)
cheers, Kyle
On Wed, Dec 16, 2009 at 04:36:18AM +0000, Kyle McMartin wrote:
> As discussed on linux-arch@ recently, these fixes are necessary for
> XFS to work on virtually indexed architectures since otherwise vmalloc'd
> pages are not properly flushed from the cache.
>
> Thanks!
> Kyle
>
> The following changes since commit 8bea8672edfca7ec5f661cafb218f1205863b343:
> Stephen Rothwell (1):
> mfd: compile fix for twl4030 renaming
>
> are available in the git repository at:
>
> hera.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6.git coherence
>
> James Bottomley (6):
> mm: add coherence API for DMA to vmalloc/vmap areas
> parisc: add mm API for DMA to vmalloc/vmap areas
> arm: add mm API for DMA to vmalloc/vmap areas
> sh: add mm API for DMA to vmalloc/vmap areas
> block: permit I/O to vmalloc/vmap kernel pages
> xfs: fix xfs to work with Virtually Indexed architectures
>
> Documentation/cachetlb.txt | 24 ++++++++++++++++++++++++
> arch/arm/include/asm/cacheflush.h | 10 ++++++++++
> arch/parisc/include/asm/cacheflush.h | 8 ++++++++
> arch/sh/include/asm/cacheflush.h | 8 ++++++++
> fs/bio.c | 20 ++++++++++++++++++--
> fs/xfs/linux-2.6/xfs_buf.c | 20 ++++++++++++++++++++
> include/linux/highmem.h | 6 ++++++
> 7 files changed, 94 insertions(+), 2 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Thu, Dec 17, 2009 at 08:22:56AM -0500, Kyle McMartin wrote:
> Linus, any word on these? I'd really love to get jejb and hch off
> my back. :)
Yeah, users have been pretty impatiently waiting for this.
On Thu, 17 Dec 2009, Kyle McMartin wrote:
>
> Linus, any word on these? I'd really love to get jejb and hch off
> my back. :)
I hate them.
I don't see what the point of allowing kernel virtual addresses in bio's
is. It's wrong. The fact that XFS does that sh*t is an XFS issue. Handle
it there.
Fix XFS. Or convince me with some really good arguments, and make sure
that Jens signs off on the cr*p too.
In other words, the thing I object to isn't even the new flushing thing,
it's this idiocy:
- save off virtual address:
..
bio->bi_private = data;
..
- do vmalloc_to_page:
+ if (is_vmalloc_addr(data)) {
+ flush_kernel_dcache_addr(data);
+ page = vmalloc_to_page(data);
+ } else
+ page = virt_to_page(data);
WTF? Why the hell should the block layer support this kind of absolute
crap? When has "use random kernel virtual address" ever been a valid thing
to do for IO?
Why aren't you doing this before you submit the vmalloc range for IO?
So no. Not a way in hell do I pull this for 33. And preferably never.
Linus
On Thu, Dec 17, 2009 at 08:16:12AM -0800, Linus Torvalds wrote:
>
> I hate them.
>
> I don't see what the point of allowing kernel virtual addresses in bio's
> is. It's wrong. The fact that XFS does that sh*t is an XFS issue. Handle
> it there.
>
> Fix XFS. Or convince me with some really good arguments, and make sure
> that Jens signs off on the cr*p too.
I have a somewhat similar issue that comes up for ext4; at the moment
occasionaly we need to clone a buffer head buffer; either because the
first four bytes match the magic JBD "escape sequence", and we need to
modify the block and escape it before writing it to the journal, or
because we need to make a copy of a allocation bitmap block so we can
write a fixed copy to disk while we modify the "real" block during a
commit. At the moment we allocate a full page, even if that means
allocating a 16k PPC page when the file system block size is 4k, or
allocating a 4k x86 page when the file system block size is 1k.
That's because apparently the iSCSI and DMA blocks assume that they
have Real Pages (tm) passed to block I/O requests, and apparently XFS
ran into problems when sending vmalloc'ed pages. I don't know if this
is a problem if we pass the bio layer addresses coming from the SLAB
allocator, but oral tradition seems to indicate this is problematic,
although no one has given me the full chapter and verse explanation
about why this is so.
Now that I see Linus's complaint, I'm wondering if the issue is really
about kernel virtual addresses (i.e., coming from vmalloc), and not a
requirement for Real Pages (i.e., coming from the SLAB allocator as
opposed to get_free_page). And can this be documented someplace? I
tried looking at the bio documentation, and couldn't find anything
definitive on the subject.
Thanks,
- Ted
On Thu, 17 Dec 2009, [email protected] wrote:
>
> That's because apparently the iSCSI and DMA blocks assume that they
> have Real Pages (tm) passed to block I/O requests, and apparently XFS
> ran into problems when sending vmalloc'ed pages. I don't know if this
> is a problem if we pass the bio layer addresses coming from the SLAB
> allocator, but oral tradition seems to indicate this is problematic,
> although no one has given me the full chapter and verse explanation
> about why this is so.
kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA
translations for such pages is trivial and fundamental.
In contrast, vmalloc is pure and utter unadulterated CRAP. The pages
may be contiguous virtually, but it makes no difference for the block
layer, that has to be able to do IO by DMA anyway, so it has to look up
the page translations in the page tables etc crazy sh*t.
So passing vmalloc'ed page addresses around to something that will
eventually do a non-CPU-virtual thing on them is fundamentally insane. The
vmalloc space is about CPU virtual addresses. Such concepts simpyl do not
-exist- for some random block device.
> Now that I see Linus's complaint, I'm wondering if the issue is really
> about kernel virtual addresses (i.e., coming from vmalloc), and not a
> requirement for Real Pages (i.e., coming from the SLAB allocator as
> opposed to get_free_page). And can this be documented someplace? I
> tried looking at the bio documentation, and couldn't find anything
> definitive on the subject.
The whole "vmalloc is special" has always been true. If you want to
treat vmalloc as normal memory, you need to look up the pages yourself. We
have helpers for that (including helpers that populate vmalloc space from
a page array to begin with - so you can _start_ from some array of pages
and then lay them out virtually if you want to have a convenient CPU
access to the array).
And this whole "vmalloc is about CPU virtual addresses" is so obviously
and fundamentally true that I don't understand how anybody can ever be
confused about it. The "v" in vmalloc is for "virtual" as in virtual
memory.
Think of it like virtual user addresses. Does anybody really expect to be
able to pass a random user address to the BIO layer?
And if you do, I would suggest that you get out of kernel programming
pronto. You're a danger to society, and have a lukewarm IQ. I don't want
you touching kernel code.
And no, I do _not_ want the BIO layer having to walk page tables. Not for
vmalloc space, not for user virtual addresses.
(And don't tell me it already does. Maybe somebody sneaked it in past me,
without me ever noticing. That wouldn't be an excuse, that would be just
sad. Jesus wept)
Linus
On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote:
> The whole "vmalloc is special" has always been true. If you want to
> treat vmalloc as normal memory, you need to look up the pages yourself. We
> have helpers for that (including helpers that populate vmalloc space from
> a page array to begin with - so you can _start_ from some array of pages
> and then lay them out virtually if you want to have a convenient CPU
> access to the array).
Which is exactly what the XFS code does. Pages are allocated manually
and we store pointers to the page struct that later get added to the
bio. But we access them through vmap (which I added exactly for this
reason back in 2002) for kernel accesses. On all architectures with
sane caches things just work, but for parisc, arm and friends that have
virtually indexed caches we need to make sure to flush caches for this
different access. The vmalloc linear address is not used for I/O
everywhere.
On Thu, Dec 17, 2009 at 11:30:36AM -0500, [email protected] wrote:
> That's because apparently the iSCSI and DMA blocks assume that they
> have Real Pages (tm) passed to block I/O requests, and apparently XFS
> ran into problems when sending vmalloc'ed pages. I don't know if this
> is a problem if we pass the bio layer addresses coming from the SLAB
> allocator, but oral tradition seems to indicate this is problematic,
> although no one has given me the full chapter and verse explanation
> about why this is so.
Actually at least iscsi now has a workaround for that by checking for
PageSlab. Back when we deal with the XFS issue that check was only
available with debug options enabled. I tried to sort it out by
agreeing with the block and iscsi folks that either
a) we need to send down refcountable pages
b) block drivers need to accept kmalloced pages
I could not get any afreement, and thus we stopped using the kmalloced
pages in XFS which was easy enough. A bit later people fixed iscsi,
but we still don't have formal rules about what is acceptable to the
block layer.
On Thu, Dec 17, 2009 at 12:10:25PM -0500, Christoph Hellwig wrote:
> On Thu, Dec 17, 2009 at 11:30:36AM -0500, [email protected] wrote:
> > That's because apparently the iSCSI and DMA blocks assume that they
> > have Real Pages (tm) passed to block I/O requests, and apparently XFS
> > ran into problems when sending vmalloc'ed pages. I don't know if this
> > is a problem if we pass the bio layer addresses coming from the SLAB
> > allocator, but oral tradition seems to indicate this is problematic,
> > although no one has given me the full chapter and verse explanation
> > about why this is so.
>
> Actually at least iscsi now has a workaround for that by checking for
> PageSlab. Back when we deal with the XFS issue that check was only
> available with debug options enabled. I tried to sort it out by
> agreeing with the block and iscsi folks that either
>
> a) we need to send down refcountable pages
> b) block drivers need to accept kmalloced pages
>
> I could not get any afreement, and thus we stopped using the kmalloced
> pages in XFS which was easy enough. A bit later people fixed iscsi,
> but we still don't have formal rules about what is acceptable to the
> block layer.
It would be good to get some formal rules articulated. Someone has
asserted that the AoE (ATA over Ethernet) driver will barf on
SLAB/kmalloc allocated memory. True, false?
- Ted
On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote:
> kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA
> translations for such pages is trivial and fundamental.
Sure, but there's some rumors/oral traditions going around that some
block devices want bio address which are page aligned, because they
want to play some kind of refcounting game, and if you pass them a
kmalloc() memory, they will explode in some interesting and
entertaining way. And it's Weird Shit(tm) (aka iSCSI, AoE) type
drivers, that most of us don't have access to, so just because it
works Just Fine on SATA doesn't mean anything.
And none of this is documented anywhere, which is frustrating as hell.
Just rumors that "if you do this, AoE/iSCSI will corrupt your file
systems".
- Ted
On Thu, 17 Dec 2009, Christoph Hellwig wrote:
>
> On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote:
> > The whole "vmalloc is special" has always been true. If you want to
> > treat vmalloc as normal memory, you need to look up the pages yourself. We
> > have helpers for that (including helpers that populate vmalloc space from
> > a page array to begin with - so you can _start_ from some array of pages
> > and then lay them out virtually if you want to have a convenient CPU
> > access to the array).
>
> Which is exactly what the XFS code does. Pages are allocated manually
> and we store pointers to the page struct that later get added to the
> bio.
Hmm. The BIO interface that the patch-series changes (bio_map_kern)
doesn't work that way. It takes a "buf, len" kind of thing. That's what
I'm complaining about.
> But we access them through vmap (which I added exactly for this
> reason back in 2002) for kernel accesses. On all architectures with
> sane caches things just work, but for parisc, arm and friends that have
> virtually indexed caches we need to make sure to flush caches for this
> different access. The vmalloc linear address is not used for I/O
> everywhere.
Well, they clearly are _after_ this series, since that's what all those
changes to __bio_map_kernel() and bio_map_kern_endio() are all about.
So I believe you when you say that XFS perhaps does everything right - I
just think that the patch series in question actually makes things worse,
exactly because it is starting to use virtual addresses.
I also think that the changes to bio_map_kernel() and bio_map_kern_endio()
are not just "fundamentally ugly", I think they are made worse by the fact
that it's not even done "right". You both flush the virtual caches before
the IO and invalidate after - when the real pattern should be that you
flush it before a write, and invalidate it after a read.
And I really think that would be all much more properly done at the
_caller_ level, not by the BIO layer.
You must have some locking and allocation etc logic at the caller anyway,
why doesn't _that_ level just do the flushing or invalidation?
I get the feeling that somebody decided that the whole "do DMA to/from
vmalloc space" was somehow a common generic pattern that should be
supported in general, and I violently disagree. Maybe XFS has good reasons
for doing it, but that does emphatically _not_ make it a good idea in
general, and that does _not_ mean that the BIO layer should make it easy
to do for other users and have a general interface for that kind of
crazyness.
IOW, I'm perfectly happy with the patch to fs/xfs/linux-2.6/xfs_buf.c.
That one still seems to use 'bio_add_page()' with a regular 'struct page'.
But the fs/bio.c patch looks like just total and utter crap to me, and is
the reason I refuse to pull this series.
Linus
On Thu, 17 Dec 2009, [email protected] wrote:
>
> Sure, but there's some rumors/oral traditions going around that some
> block devices want bio address which are page aligned, because they
> want to play some kind of refcounting game,
Yeah, you might be right at that.
> And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> don't have access to, so just because it works Just Fine on SATA doesn't
> mean anything.
>
> And none of this is documented anywhere, which is frustrating as hell.
> Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> systems".
ACK. Jens?
Linus
On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote:
> > Which is exactly what the XFS code does. Pages are allocated manually
> > and we store pointers to the page struct that later get added to the
> > bio.
>
> Hmm. The BIO interface that the patch-series changes (bio_map_kern)
> doesn't work that way. It takes a "buf, len" kind of thing. That's what
> I'm complaining about.
Indeed, the "block: permit I/O to vmalloc/vmap kernel pages" does what
you complain about. But the series doesn't actually add a user for
that. What it does in XFS is quite a bit of black magic, too - but only
with the new cache coherence calls that are noops on architectures with
physically indexed caches.
> Well, they clearly are _after_ this series, since that's what all those
> changes to __bio_map_kernel() and bio_map_kern_endio() are all about.
>
> So I believe you when you say that XFS perhaps does everything right - I
> just think that the patch series in question actually makes things worse,
> exactly because it is starting to use virtual addresses.
I'm not entirely sure why James added those, but XFS doesn't actually
use bio_map_kern.
> And I really think that would be all much more properly done at the
> _caller_ level, not by the BIO layer.
>
> You must have some locking and allocation etc logic at the caller anyway,
> why doesn't _that_ level just do the flushing or invalidation?
http://git.kernel.org/?p=linux/kernel/git/kyle/parisc-2.6.git;a=commitdiff;h=56c8214b842324e94aa88012010b0f1f9847daec
does it in the caller level. Not exactly in a beautiful way, but who
am I complain as I'm already lost in our mess of cache coherency APIs.
> IOW, I'm perfectly happy with the patch to fs/xfs/linux-2.6/xfs_buf.c.
> That one still seems to use 'bio_add_page()' with a regular 'struct page'.
>
> But the fs/bio.c patch looks like just total and utter crap to me, and is
> the reason I refuse to pull this series.
Kyle/James, can you regenerate the tree without that patch included?
On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote:
> You both flush the virtual caches before
> the IO and invalidate after - when the real pattern should be that you
> flush it before a write, and invalidate it after a read.
That's not entirely true. If you have write back caches which are not DMA
coherent, you need to as a minimum:
- on write, clean the cache to ensure that the page is up to date with data
held in cache.
- on read, you must ensure that there are no potential write-backs before
the read commenses and invalidate at some point.
The point at which you invalidate depends on whether the CPU speculatively
prefetches:
- If it doesn't, you can invalidate the cache before the read, thereby
destroying any potential writebacks, and the cache will remain
unallocated for that address range until explicitly accessed.
- If you do have a CPU which does prefetch speculatively, then you do
need to clean the cache before DMA starts, and then you must invalidate
after the DMA completes.
Invalidating after DMA completes for the non-speculative prefetch just
wastes performance, especially if you have to do so line by line over
a region.
With ARM architecture version 7, we now have ARM CPUs which fall into
both categories.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Thu, 17 Dec 2009, Russell King wrote:
>
> That's not entirely true. If you have write back caches which are not DMA
> coherent, you need to as a minimum:
>
> - on write, clean the cache to ensure that the page is up to date with data
> held in cache.
>
> - on read, you must ensure that there are no potential write-backs before
> the read commenses and invalidate at some point.
Right you are.
Linus
Christoph Hellwig <[email protected]> writes:
> On Thu, Dec 17, 2009 at 11:30:36AM -0500, [email protected] wrote:
>> That's because apparently the iSCSI and DMA blocks assume that they
>> have Real Pages (tm) passed to block I/O requests, and apparently XFS
>> ran into problems when sending vmalloc'ed pages. I don't know if this
>> is a problem if we pass the bio layer addresses coming from the SLAB
>> allocator, but oral tradition seems to indicate this is problematic,
>> although no one has given me the full chapter and verse explanation
>> about why this is so.
>
> Actually at least iscsi now has a workaround for that by checking for
> PageSlab. Back when we deal with the XFS issue that check was only
> available with debug options enabled. I tried to sort it out by
> agreeing with the block and iscsi folks that either
DRBD has the same workaround as well:
if (disable_sendpage || (page_count(page) < 1) || PageSlab(page))
But it seems like a gross hack to me. Perhaps this should be
passed as some sort of BIO attribute?
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Dec 17 2009, Linus Torvalds wrote:
>
>
> On Thu, 17 Dec 2009, [email protected] wrote:
> >
> > Sure, but there's some rumors/oral traditions going around that some
> > block devices want bio address which are page aligned, because they
> > want to play some kind of refcounting game,
>
> Yeah, you might be right at that.
>
> > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> > don't have access to, so just because it works Just Fine on SATA doesn't
> > mean anything.
> >
> > And none of this is documented anywhere, which is frustrating as hell.
> > Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> > systems".
>
> ACK. Jens?
I've heard those rumours too, and I don't even know if they are true.
Who has a pointer to such a bug report and/or issue? The block layer
itself doesn't not have any such requirements, and the only places where
we play page games is for bio's that were explicitly mapped with pages
by itself (like mapping user data).o
We fix driver crap like that, we don't work around it. It's a BUG.
--
Jens Axboe
On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote:
> On Thu, Dec 17 2009, Linus Torvalds wrote:
> >
> >
> > On Thu, 17 Dec 2009, [email protected] wrote:
> > >
> > > Sure, but there's some rumors/oral traditions going around that some
> > > block devices want bio address which are page aligned, because they
> > > want to play some kind of refcounting game,
> >
> > Yeah, you might be right at that.
> >
> > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> > > don't have access to, so just because it works Just Fine on SATA doesn't
> > > mean anything.
> > >
> > > And none of this is documented anywhere, which is frustrating as hell.
> > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> > > systems".
> >
> > ACK. Jens?
>
> I've heard those rumours too, and I don't even know if they are true.
> Who has a pointer to such a bug report and/or issue? The block layer
> itself doesn't not have any such requirements, and the only places where
> we play page games is for bio's that were explicitly mapped with pages
> by itself (like mapping user data).o
OK, so what happened is that prior to the map single fix
commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f
Author: Mike Christie <[email protected]>
Date: Mon Jun 20 14:04:44 2005 +0200
[PATCH] Add blk_rq_map_kern()
bio could only accept user space buffers, so we had a special path for
kernel allocated buffers. That commit unified the path (with a separate
block API) so we could now submit kmalloc'd buffers via block APIs.
So the rule now is we can accept any user mapped area via
blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might
not be able to do a stack area (depending on how the arch maps the
stack) and we definitely cannot do a vmalloc'd area.
So it sounds like we only need a blk_rq_map_vmalloc() using the same
techniques as the patch set and we're good to go.
> We fix driver crap like that, we don't work around it. It's a BUG.
James
On Thu, 17 Dec 2009 12:39:57 -0500
[email protected] wrote:
> On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote:
> > kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA
> > translations for such pages is trivial and fundamental.
>
> Sure, but there's some rumors/oral traditions going around that some
> block devices want bio address which are page aligned, because they
> want to play some kind of refcounting game, and if you pass them a
> kmalloc() memory, they will explode in some interesting and
> entertaining way. And it's Weird Shit(tm) (aka iSCSI, AoE) type
> drivers, that most of us don't have access to, so just because it
> works Just Fine on SATA doesn't mean anything.
iSCSI initiator driver should work with kmalloc'ed memory.
The reason why iSCSI didn't work with kmalloc'ed memory is that it
uses sendpage (which needs refcountable pages). We added a workaround
to not use sendpage with kmalloc'ed memory (it would be great if we
remove the workaround though).
On Fri, 18 Dec 2009 00:57:00 +0100
James Bottomley <[email protected]> wrote:
> On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote:
> > On Thu, Dec 17 2009, Linus Torvalds wrote:
> > >
> > >
> > > On Thu, 17 Dec 2009, [email protected] wrote:
> > > >
> > > > Sure, but there's some rumors/oral traditions going around that some
> > > > block devices want bio address which are page aligned, because they
> > > > want to play some kind of refcounting game,
> > >
> > > Yeah, you might be right at that.
> > >
> > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> > > > don't have access to, so just because it works Just Fine on SATA doesn't
> > > > mean anything.
> > > >
> > > > And none of this is documented anywhere, which is frustrating as hell.
> > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> > > > systems".
> > >
> > > ACK. Jens?
> >
> > I've heard those rumours too, and I don't even know if they are true.
> > Who has a pointer to such a bug report and/or issue? The block layer
> > itself doesn't not have any such requirements, and the only places where
> > we play page games is for bio's that were explicitly mapped with pages
> > by itself (like mapping user data).o
>
> OK, so what happened is that prior to the map single fix
>
> commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f
> Author: Mike Christie <[email protected]>
> Date: Mon Jun 20 14:04:44 2005 +0200
>
> [PATCH] Add blk_rq_map_kern()
>
>
> bio could only accept user space buffers, so we had a special path for
> kernel allocated buffers. That commit unified the path (with a separate
> block API) so we could now submit kmalloc'd buffers via block APIs.
>
> So the rule now is we can accept any user mapped area via
> blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might
> not be able to do a stack area (depending on how the arch maps the
> stack) and we definitely cannot do a vmalloc'd area.
>
> So it sounds like we only need a blk_rq_map_vmalloc() using the same
> techniques as the patch set and we're good to go.
I'm not sure about it.
As I said before (when I was against this 'adding vmalloc support to
the block layer' stuff), are there potential users of this except for
XFS? Are there anyone who does such a thing now?
This API might be useful for only journaling file systems using log
formats that need large contiguous buffer. Sound like only XFS?
Even if we have some potential users, I'm not sure that supporting
vmalloc in the block layer officially is a good idea. IMO, it needs
too many tricks for generic code.
On Fri, Dec 18, 2009 at 10:00:21AM +0900, FUJITA Tomonori wrote:
> On Fri, 18 Dec 2009 00:57:00 +0100
> James Bottomley <[email protected]> wrote:
>
> > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote:
> > > On Thu, Dec 17 2009, Linus Torvalds wrote:
> > > >
> > > >
> > > > On Thu, 17 Dec 2009, [email protected] wrote:
> > > > >
> > > > > Sure, but there's some rumors/oral traditions going around that some
> > > > > block devices want bio address which are page aligned, because they
> > > > > want to play some kind of refcounting game,
> > > >
> > > > Yeah, you might be right at that.
> > > >
> > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> > > > > don't have access to, so just because it works Just Fine on SATA doesn't
> > > > > mean anything.
> > > > >
> > > > > And none of this is documented anywhere, which is frustrating as hell.
> > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> > > > > systems".
> > > >
> > > > ACK. Jens?
> > >
> > > I've heard those rumours too, and I don't even know if they are true.
> > > Who has a pointer to such a bug report and/or issue? The block layer
> > > itself doesn't not have any such requirements, and the only places where
> > > we play page games is for bio's that were explicitly mapped with pages
> > > by itself (like mapping user data).o
> >
> > OK, so what happened is that prior to the map single fix
> >
> > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f
> > Author: Mike Christie <[email protected]>
> > Date: Mon Jun 20 14:04:44 2005 +0200
> >
> > [PATCH] Add blk_rq_map_kern()
> >
> >
> > bio could only accept user space buffers, so we had a special path for
> > kernel allocated buffers. That commit unified the path (with a separate
> > block API) so we could now submit kmalloc'd buffers via block APIs.
> >
> > So the rule now is we can accept any user mapped area via
> > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might
> > not be able to do a stack area (depending on how the arch maps the
> > stack) and we definitely cannot do a vmalloc'd area.
> >
> > So it sounds like we only need a blk_rq_map_vmalloc() using the same
> > techniques as the patch set and we're good to go.
>
> I'm not sure about it.
>
> As I said before (when I was against this 'adding vmalloc support to
> the block layer' stuff), are there potential users of this except for
> XFS? Are there anyone who does such a thing now?
As Christoph already mentioned, XFS is not passing the vmalloc'd
range to the block layer - it passes the underlying pages to the
block layer. Hence I'm not sure there actually is anyone who is
passing vmalloc'd addresses to the block layer. Perhaps we should
put a WARN_ON() in the block layer to catch anyone doing such a
thing before considering supporting vmalloc'd addresses in the block
layer?
> This API might be useful for only journaling file systems using log
> formats that need large contiguous buffer. Sound like only XFS?
FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log
recovery in XFS. e.g. filesystems with directory block size larger
than page size uses mapped buffers.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, 18 Dec 2009 13:44:40 +1100
Dave Chinner <[email protected]> wrote:
> > > So it sounds like we only need a blk_rq_map_vmalloc() using the same
> > > techniques as the patch set and we're good to go.
> >
> > I'm not sure about it.
> >
> > As I said before (when I was against this 'adding vmalloc support to
> > the block layer' stuff), are there potential users of this except for
> > XFS? Are there anyone who does such a thing now?
>
> As Christoph already mentioned, XFS is not passing the vmalloc'd
> range to the block layer
Oops, I should have said a vmalloc/vmap area.
> > This API might be useful for only journaling file systems using log
> > formats that need large contiguous buffer. Sound like only XFS?
>
> FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log
> recovery in XFS. e.g. filesystems with directory block size larger
> than page size uses mapped buffers.
I see, thanks.
On Fri, 2009-12-18 at 10:00 +0900, FUJITA Tomonori wrote:
> On Fri, 18 Dec 2009 00:57:00 +0100
> James Bottomley <[email protected]> wrote:
>
> > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote:
> > > On Thu, Dec 17 2009, Linus Torvalds wrote:
> > > >
> > > >
> > > > On Thu, 17 Dec 2009, [email protected] wrote:
> > > > >
> > > > > Sure, but there's some rumors/oral traditions going around that some
> > > > > block devices want bio address which are page aligned, because they
> > > > > want to play some kind of refcounting game,
> > > >
> > > > Yeah, you might be right at that.
> > > >
> > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> > > > > don't have access to, so just because it works Just Fine on SATA doesn't
> > > > > mean anything.
> > > > >
> > > > > And none of this is documented anywhere, which is frustrating as hell.
> > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> > > > > systems".
> > > >
> > > > ACK. Jens?
> > >
> > > I've heard those rumours too, and I don't even know if they are true.
> > > Who has a pointer to such a bug report and/or issue? The block layer
> > > itself doesn't not have any such requirements, and the only places where
> > > we play page games is for bio's that were explicitly mapped with pages
> > > by itself (like mapping user data).o
> >
> > OK, so what happened is that prior to the map single fix
> >
> > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f
> > Author: Mike Christie <[email protected]>
> > Date: Mon Jun 20 14:04:44 2005 +0200
> >
> > [PATCH] Add blk_rq_map_kern()
> >
> >
> > bio could only accept user space buffers, so we had a special path for
> > kernel allocated buffers. That commit unified the path (with a separate
> > block API) so we could now submit kmalloc'd buffers via block APIs.
> >
> > So the rule now is we can accept any user mapped area via
> > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might
> > not be able to do a stack area (depending on how the arch maps the
> > stack) and we definitely cannot do a vmalloc'd area.
> >
> > So it sounds like we only need a blk_rq_map_vmalloc() using the same
> > techniques as the patch set and we're good to go.
>
> I'm not sure about it.
>
> As I said before (when I was against this 'adding vmalloc support to
> the block layer' stuff), are there potential users of this except for
> XFS? Are there anyone who does such a thing now?
>
> This API might be useful for only journaling file systems using log
> formats that need large contiguous buffer. Sound like only XFS?
>
> Even if we have some potential users, I'm not sure that supporting
> vmalloc in the block layer officially is a good idea. IMO, it needs
> too many tricks for generic code.
So far, there's only xfs that I know of.
Given the way journalling works, it's not an unusual requirement to use
a large buffer for operations. It's a bit of a waste of kernel
resources to have this physically contiguous, but it is a waste of
resources (and for buffers over our kmalloc max, it would even have to
be allocated at start of day), so I think large kernel virtual areas
(like vmap/vmalloc) have a part to play in fs operations.
As to the API, the specific problem is that the block and lower arch
layers are specifically programmed to think any kernel address has only
a single alias and that it's offset mapped, which is why we get the
failure.
An alternative proposal to modifying the block layer to do coherency,
might be simply to have the fs layer do a vunmap before doing I/O and
re-vmap when it's completed. That would ensure the architecturally
correct flushing of the aliases, and would satisfy the expectations of
blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear
page tables, which isn't necessary and might impact performance (xfs
people?)
If the performance impact of the above is too great, then we might
introduce a vmalloc sync API to do the flush before and the invalidate
after (would have to be called twice, once before I/O and once after).
This is sort of a violation of our architectural knowledge layering,
since the user of a vmap/vmalloc area has to know intrinsically how to
handle I/O instead of having it just work(tm), but since the users are
few and specialised, it's not going to lead to too many coding problems.
Any opinions?
James
On Fri, 2009-12-18 at 13:44 +1100, Dave Chinner wrote:
> On Fri, Dec 18, 2009 at 10:00:21AM +0900, FUJITA Tomonori wrote:
> > On Fri, 18 Dec 2009 00:57:00 +0100
> > James Bottomley <[email protected]> wrote:
> >
> > > On Thu, 2009-12-17 at 20:36 +0100, Jens Axboe wrote:
> > > > On Thu, Dec 17 2009, Linus Torvalds wrote:
> > > > >
> > > > >
> > > > > On Thu, 17 Dec 2009, [email protected] wrote:
> > > > > >
> > > > > > Sure, but there's some rumors/oral traditions going around that some
> > > > > > block devices want bio address which are page aligned, because they
> > > > > > want to play some kind of refcounting game,
> > > > >
> > > > > Yeah, you might be right at that.
> > > > >
> > > > > > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> > > > > > don't have access to, so just because it works Just Fine on SATA doesn't
> > > > > > mean anything.
> > > > > >
> > > > > > And none of this is documented anywhere, which is frustrating as hell.
> > > > > > Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> > > > > > systems".
> > > > >
> > > > > ACK. Jens?
> > > >
> > > > I've heard those rumours too, and I don't even know if they are true.
> > > > Who has a pointer to such a bug report and/or issue? The block layer
> > > > itself doesn't not have any such requirements, and the only places where
> > > > we play page games is for bio's that were explicitly mapped with pages
> > > > by itself (like mapping user data).o
> > >
> > > OK, so what happened is that prior to the map single fix
> > >
> > > commit df46b9a44ceb5af2ea2351ce8e28ae7bd840b00f
> > > Author: Mike Christie <[email protected]>
> > > Date: Mon Jun 20 14:04:44 2005 +0200
> > >
> > > [PATCH] Add blk_rq_map_kern()
> > >
> > >
> > > bio could only accept user space buffers, so we had a special path for
> > > kernel allocated buffers. That commit unified the path (with a separate
> > > block API) so we could now submit kmalloc'd buffers via block APIs.
> > >
> > > So the rule now is we can accept any user mapped area via
> > > blk_rq_map_user and any kmalloc'd area via blk_rq_map_kern(). We might
> > > not be able to do a stack area (depending on how the arch maps the
> > > stack) and we definitely cannot do a vmalloc'd area.
> > >
> > > So it sounds like we only need a blk_rq_map_vmalloc() using the same
> > > techniques as the patch set and we're good to go.
> >
> > I'm not sure about it.
> >
> > As I said before (when I was against this 'adding vmalloc support to
> > the block layer' stuff), are there potential users of this except for
> > XFS? Are there anyone who does such a thing now?
>
> As Christoph already mentioned, XFS is not passing the vmalloc'd
> range to the block layer - it passes the underlying pages to the
> block layer. Hence I'm not sure there actually is anyone who is
> passing vmalloc'd addresses to the block layer. Perhaps we should
> put a WARN_ON() in the block layer to catch anyone doing such a
> thing before considering supporting vmalloc'd addresses in the block
> layer?
vmalloc is just an alias for vmap/vmalloc in the above statements
(basically anything with an additional kernel virtual mapping which
causes aliases). If we support vmap, we naturally support vmalloc as
well.
> > This API might be useful for only journaling file systems using log
> > formats that need large contiguous buffer. Sound like only XFS?
>
> FWIW, mapped buffers larger than PAGE_SIZE are used for more than just log
> recovery in XFS. e.g. filesystems with directory block size larger
> than page size uses mapped buffers.
However, XFS is the only fs that actually uses kernel virtual mapping to
solve this problem.
James
On Fri, 18 Dec 2009 08:08:48 +0100
James Bottomley <[email protected]> wrote:
> > Even if we have some potential users, I'm not sure that supporting
> > vmalloc in the block layer officially is a good idea. IMO, it needs
> > too many tricks for generic code.
>
> So far, there's only xfs that I know of.
>
> Given the way journalling works, it's not an unusual requirement to use
> a large buffer for operations. It's a bit of a waste of kernel
> resources to have this physically contiguous, but it is a waste of
> resources (and for buffers over our kmalloc max, it would even have to
> be allocated at start of day), so I think large kernel virtual areas
> (like vmap/vmalloc) have a part to play in fs operations.
Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't
it better to wait until we have real users of the API?
> As to the API, the specific problem is that the block and lower arch
> layers are specifically programmed to think any kernel address has only
> a single alias and that it's offset mapped, which is why we get the
> failure.
Yeah, however we can make a rule that you can't pass a vmap area
(including vmap'ed pages) to the block layer. We can't make the rule
effective for the past so XFS is the only exception.
> An alternative proposal to modifying the block layer to do coherency,
> might be simply to have the fs layer do a vunmap before doing I/O and
> re-vmap when it's completed.
I'm not sure it's worth making the whole block layer compatible to
vmap (adding complexity and possibly performance penalty).
If we really need to support this, I like helper APIs that the callers
must use before and after I/Os.
> That would ensure the architecturally
> correct flushing of the aliases, and would satisfy the expectations of
> blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear
> page tables, which isn't necessary and might impact performance (xfs
> people?)
btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to
file systems since blk_rq_map_user and blk_rq_map_kern takes a request
structure.
On Fri, 2009-12-18 at 18:34 +0900, FUJITA Tomonori wrote:
> On Fri, 18 Dec 2009 08:08:48 +0100
> James Bottomley <[email protected]> wrote:
>
> > > Even if we have some potential users, I'm not sure that supporting
> > > vmalloc in the block layer officially is a good idea. IMO, it needs
> > > too many tricks for generic code.
> >
> > So far, there's only xfs that I know of.
> >
> > Given the way journalling works, it's not an unusual requirement to use
> > a large buffer for operations. It's a bit of a waste of kernel
> > resources to have this physically contiguous, but it is a waste of
> > resources (and for buffers over our kmalloc max, it would even have to
> > be allocated at start of day), so I think large kernel virtual areas
> > (like vmap/vmalloc) have a part to play in fs operations.
>
> Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't
> it better to wait until we have real users of the API?
XFS is a real user ... the XFS filesystem is our most trusted code base
that can break the 8TB limit, which hard disks are already at. Ext4 may
be ready, but it's not universally present in enterprise distros like
XFS.
> > As to the API, the specific problem is that the block and lower arch
> > layers are specifically programmed to think any kernel address has only
> > a single alias and that it's offset mapped, which is why we get the
> > failure.
>
> Yeah, however we can make a rule that you can't pass a vmap area
> (including vmap'ed pages) to the block layer. We can't make the rule
> effective for the past so XFS is the only exception.
We need something that works for XFS. The next proposal works for the
current block API because the vunmap makes the xfs pages look like
standard kernel pages, which blk_rq_map_kern() will process correctly.
But, in principle, I think whatever fix is chosen, we shouldn't
necessarily discourage others from using it.
> > An alternative proposal to modifying the block layer to do coherency,
> > might be simply to have the fs layer do a vunmap before doing I/O and
> > re-vmap when it's completed.
>
> I'm not sure it's worth making the whole block layer compatible to
> vmap (adding complexity and possibly performance penalty).
This proposal has no block layer changes. It just makes the XFS vmap
area look like a standard set of kernel pages ... with the overhead of
the page table manipulations on unmap and remap.
> If we really need to support this, I like helper APIs that the callers
> must use before and after I/Os.
If it's just this route, they already exist ... they're vmap and vunmap.
> > That would ensure the architecturally
> > correct flushing of the aliases, and would satisfy the expectations of
> > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear
> > page tables, which isn't necessary and might impact performance (xfs
> > people?)
>
> btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to
> file systems since blk_rq_map_user and blk_rq_map_kern takes a request
> structure.
OK, so that was illustrative. The meat of the change is at the bio
layer anyway (fss tend to speak bios). But the point of *this*
particular proposal is that it requires no changes either in the blk_ or
bio_ routines.
James
On Fri, 18 Dec 2009 11:01:29 +0100
James Bottomley <[email protected]> wrote:
> > Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't
> > it better to wait until we have real users of the API?
>
> XFS is a real user ... the XFS filesystem is our most trusted code base
> that can break the 8TB limit, which hard disks are already at. Ext4 may
> be ready, but it's not universally present in enterprise distros like
> XFS.
XFS already has the own code to handle that, which works fine (with
your patchset except for 5/6 for the block layer). Not much motivation
for XFS to move to the generic API?
> > > That would ensure the architecturally
> > > correct flushing of the aliases, and would satisfy the expectations of
> > > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear
> > > page tables, which isn't necessary and might impact performance (xfs
> > > people?)
> >
> > btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to
> > file systems since blk_rq_map_user and blk_rq_map_kern takes a request
> > structure.
>
> OK, so that was illustrative. The meat of the change is at the bio
> layer anyway (fss tend to speak bios).
Yeah, I think so, it's up to Jens to add new APIs for vmap there.
On Fri, 2009-12-18 at 19:24 +0900, FUJITA Tomonori wrote:
> On Fri, 18 Dec 2009 11:01:29 +0100
> James Bottomley <[email protected]> wrote:
>
> > > Yeah, but now only XFS passes vmap'ed pages to the block layer. Isn't
> > > it better to wait until we have real users of the API?
> >
> > XFS is a real user ... the XFS filesystem is our most trusted code base
> > that can break the 8TB limit, which hard disks are already at. Ext4 may
> > be ready, but it's not universally present in enterprise distros like
> > XFS.
>
> XFS already has the own code to handle that, which works fine (with
> your patchset except for 5/6 for the block layer). Not much motivation
> for XFS to move to the generic API?
Right, but it's for completeness. If we decide to allow vmap buffers,
then only supporting them on certain paths is a recipe for confusion in
a year's time when someone assumes we support vmap buffers on all block
paths; a bit like the current confusion over what we support ....
> > > > That would ensure the architecturally
> > > > correct flushing of the aliases, and would satisfy the expectations of
> > > > blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear
> > > > page tables, which isn't necessary and might impact performance (xfs
> > > > people?)
> > >
> > > btw, I'm not sure that the existing blk_rq_map_* API isn't fit well to
> > > file systems since blk_rq_map_user and blk_rq_map_kern takes a request
> > > structure.
> >
> > OK, so that was illustrative. The meat of the change is at the bio
> > layer anyway (fss tend to speak bios).
>
> Yeah, I think so, it's up to Jens to add new APIs for vmap there.
Agreed.
James
On Fri, Dec 18, 2009 at 08:08:48AM +0100, James Bottomley wrote:
> On Fri, 2009-12-18 at 10:00 +0900, FUJITA Tomonori wrote:
> An alternative proposal to modifying the block layer to do coherency,
> might be simply to have the fs layer do a vunmap before doing I/O and
> re-vmap when it's completed.
I don't think that works for XFS as it stands. Unmapping across
IO would have to guarantee we remap the buffer at the same
address after IO because we currently assume that mapped address
of the buffer can't change while references are held on the buffer.
That seems like an awful lot of complexity compared to the few lines
of code in XFS and the arches needed to support this.
> That would ensure the architecturally
> correct flushing of the aliases, and would satisfy the expectations of
> blk_rq_map_kern(). The down side is that vmap/vmalloc set up and clear
> page tables, which isn't necessary and might impact performance (xfs
> people?)
We play lots of tricks in XFS to avoid mapping buffers when we can
because of the performance impact it has. Nick Piggin's recent
work getting vmap to scale helped a lot, but it is still best to
avoid mapped buffers where possible.
> If the performance impact of the above is too great, then we might
> introduce a vmalloc sync API to do the flush before and the invalidate
> after (would have to be called twice, once before I/O and once after).
> This is sort of a violation of our architectural knowledge layering,
> since the user of a vmap/vmalloc area has to know intrinsically how to
> handle I/O instead of having it just work(tm), but since the users are
> few and specialised, it's not going to lead to too many coding problems.
>
> Any opinions?
Personally I see nothing wrong with the original patch series. If
the block layer mods are contentious, then just drop that patch
until a real need is brought to life.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Dec 18, 2009 at 09:21:30AM +0900, FUJITA Tomonori wrote:
>
> iSCSI initiator driver should work with kmalloc'ed memory.
>
> The reason why iSCSI didn't work with kmalloc'ed memory is that it
> uses sendpage (which needs refcountable pages). We added a workaround
> to not use sendpage with kmalloc'ed memory (it would be great if we
> remove the workaround though).
Well, with a patch that I plan to be pushing that we have general
agreement that it is a block device driver BUG not to accept
kmalloc'ed/SLAB allocated memory, is one where ext4 will use
kmalloc'ed/slab allocated memory on occasion when it needs to make
shadow copy of buffers for journalling purposes AND when the fs block
size is smaller than the page size. (i.e., no more allocating a 16k
page when the fs block size is 4k). So this won't happen all the
time; even if the case of a 16k Itanium system with 4k blocks, the
bulk of the data won't be sent via kmalloc'ed memory --- just some
critical metadata block and some data blocks that need to be escaped
when being written into the journal.
I do think we need to document that block device drivers are
_expected_ to be able to handle kmalloc'ed memory, and if they can't,
#1 they should do a BUG_ON instead of corrupting user's data, and #2,
if they do corrupt data, we should send the angry users with corrupted
file systems to bang at the doors of the block device authors.
- Ted
On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote:
>
> I also think that the changes to bio_map_kernel() and bio_map_kern_endio()
> are not just "fundamentally ugly", I think they are made worse by the fact
> that it's not even done "right". You both flush the virtual caches before
> the IO and invalidate after - when the real pattern should be that you
> flush it before a write, and invalidate it after a read.
>
> And I really think that would be all much more properly done at the
> _caller_ level, not by the BIO layer.
>
> You must have some locking and allocation etc logic at the caller anyway,
> why doesn't _that_ level just do the flushing or invalidation?
And then there are certain types of caches that need invalidation before
_and_ after a DMA transaction as a workaround for a processor being
grossly abused in a system that it should not be used in. Basically the
issue is that falsly speculated stores may dirty caches.
Ralf
On Fri, 18 Dec 2009 09:17:32 -0500
[email protected] wrote:
> On Fri, Dec 18, 2009 at 09:21:30AM +0900, FUJITA Tomonori wrote:
> >
> > iSCSI initiator driver should work with kmalloc'ed memory.
> >
> > The reason why iSCSI didn't work with kmalloc'ed memory is that it
> > uses sendpage (which needs refcountable pages). We added a workaround
> > to not use sendpage with kmalloc'ed memory (it would be great if we
> > remove the workaround though).
>
> Well, with a patch that I plan to be pushing that we have general
> agreement that it is a block device driver BUG not to accept
> kmalloc'ed/SLAB allocated memory, is one where ext4 will use
> kmalloc'ed/slab allocated memory on occasion when it needs to make
> shadow copy of buffers for journalling purposes AND when the fs block
> size is smaller than the page size. (i.e., no more allocating a 16k
> page when the fs block size is 4k). So this won't happen all the
> time; even if the case of a 16k Itanium system with 4k blocks, the
> bulk of the data won't be sent via kmalloc'ed memory --- just some
> critical metadata block and some data blocks that need to be escaped
> when being written into the journal.
Actually, ext3 (jbd) sent kmalloc'ed buffer to the block layer for
frozen data. xfs also used kmalloc'ed buffer. Neither do now (so, as
you said above, jbd wastes some memory when the block size is not
equal to page size, I think).
> I do think we need to document that block device drivers are
> _expected_ to be able to handle kmalloc'ed memory,
Agreed.
Note that network block drivers (iSCSI, drbd, something else?) doesn't
play with page ref-counting. They want to use sendpage. The network
layer (sendpage) can't handle non-ref-counted pages. The best solution
for fs and network block drivers might be modifying sendpage to handle
such pages.
On Sat, 2009-12-19 at 18:33 +0000, Ralf Baechle wrote:
> On Thu, Dec 17, 2009 at 09:42:15AM -0800, Linus Torvalds wrote:
>
> >
> > I also think that the changes to bio_map_kernel() and bio_map_kern_endio()
> > are not just "fundamentally ugly", I think they are made worse by the fact
> > that it's not even done "right". You both flush the virtual caches before
> > the IO and invalidate after - when the real pattern should be that you
> > flush it before a write, and invalidate it after a read.
> >
> > And I really think that would be all much more properly done at the
> > _caller_ level, not by the BIO layer.
> >
> > You must have some locking and allocation etc logic at the caller anyway,
> > why doesn't _that_ level just do the flushing or invalidation?
>
> And then there are certain types of caches that need invalidation before
> _and_ after a DMA transaction as a workaround for a processor being
> grossly abused in a system that it should not be used in. Basically the
> issue is that falsly speculated stores may dirty caches.
Um, so that's just so wrong it doesn't even seem possible. It would
mean that a speculation could make a line dirty while DMA was in
progress to the underlying memory. There's always going to be a window
between the DMA completion and the software invalidation where the dirty
line could be flushed, thus corrupting the DMA transfer.
Hopefully steps have been taken to see that whoever thought this was a
good idea isn't still contributing to the gene pool?
James