I had a talk with James Bottomley last night and it seems that there is an
established way of using the page structs of slab objects in the block
layer. Drivers may use the DMA interfaces to issue control commands. In
that case they may allocate a short structure via the slab allocator and
put the control commands into that slab object.
The driver will then perform a sg_init_one() on the slab object.
sg_init_one() calls sg_set_buf() which determines the page struct of a
page. In this case sg_set_buf() will determine the page struct of a slab
object. The dma layer may then perform operations on the "slab page". The
block layer folks seem to have spend some time to make this work right.
That usage is made a bit dangerous by the issue that the fields of the
page struct are used to varying degrees by slab allocators for their own
purposes. A particular problem arises with SLUB since SLUB needs to
maximizes the usage of the page struct fields to avoid having to put
control structures in the page itself and be able to completely fill up a
page with slab objects.
What is new in SLUB is the overloading of the mapping, index and mapcount
fields. The other slab allocators only overload the lru and private fields
(The version of SLOB in mm also overloads the index field like SLUB).
So it is in general not possible to call arbitrary page cache functions on
slab pages because fields are used in ways not conforming to page cache
use. The use of slab page structs on scatter gather lists currently only
works because a select number of page cache functions are called on them
that make restricted use of fields in the page structs.
The called functions are:
kmap / kunmap
This is not a problem since slab pages are never highmem
pages. The effect of kmap if used on the page struct of a slab page
is to determine the address of the page.
flush_dcache_page
This is a no op for most architectures. On architectures that
have a virtualized cache it is necessary to flush all the
page ranges in which this page is mapped. For that purpose
the mapping of a page must be determined. Since SLUB uses
the mapping for a pointer to the kmem_cache structure this
will result in using a kmem_cache address. If this address is
used like an address_space structure then the kernel will oops.
flush_dcache_page() must flush even for slab objects if
slab objects are put onto the scatter gather lists since the
dma engine must be able to read the information of the control
structure from memory. The block layer will take care of any objects
overlapping page boundaries and flush multiple pages if necessary.
We have only recently encountered this issue which is due to the following:
1. Architectures that need to flush virtual caches are rare and so
flush_dcache_page() typically does nothing or nothing harmful since
page->mapping, page->mapcount and page->index are not used.
2. It is not too frequent for a driver to perform a kmalloc in order to
generate a control structure. As a result tests on sparc64 and arm
did not show any problems.
Architectures affected seem to be only pa-risc and arm. Sparc64 may be
affected too but sparc64 does not use page_mapping to scan over uses of
the page (and xtensa has the virtual cache flushing commented out for some
reason). The issue is currently fixed in Linus' tree by Hugh's check for a
slab page in page_mapping.
We have here an established use that is somewhat dicey. It may be best to
ensure that the sort of use is restricted to the functions listed above.
The issue only occurs in the flush_dcache_page() of the two or three
listed platforms.
Make sure that no other page cache functions are called for such pages by
putting a VM_BUG_ON(PageSlab(page)) into the frequently used
page_mapping() inline function. It will trigger even on platforms that do
not implement a flush_dcache_page() function if page_mapping() is used on
any slab page so that we can detect potential issues early. This is useful
regardless of the slab allocator in use since in general a slab page
cannot be handled like a regular page cache page.
Modify the functions in the affected arches to check for PageSlab() and
use a NULL mapping if such a page is encountered. This may only be
necessary for parisc and arm since sparc64 and xtensa do not scan over
processes mapping a page but I have modified those two arches also for
correctnesses sake since they use page_mapping() in flush_dcache_page().
Still a better solution would be to not use the slab allocator at all for
the objects that are used to send commands to the devices. These are not
permanent and grabbing a page from the pcp lists and putting it back is
likely as fast as performing a kmalloc.
I have no access to the platforms discussed here. Testing was restricted
to running it on my laptop.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6/arch/arm/mm/flush.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/flush.c 2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/arm/mm/flush.c 2007-06-28 19:11:46.000000000 -0700
@@ -188,7 +188,17 @@
*/
void flush_dcache_page(struct page *page)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping;
+
+ /*
+ * This function is special in that a page struct obtained via
+ * virt_to_page from a slab object may be passed to it. However, slab
+ * allocators may use the mapping field for their own purposes. As a
+ * result mapping may be != NULL here although the page is not mapped.
+ * Slab objects are never mapped into user space so use NULL for that
+ * special case.
+ */
+ mapping = PageSlab(page) ? NULL : page_mapping(page);
#ifndef CONFIG_SMP
if (mapping && !mapping_mapped(mapping))
Index: linux-2.6/arch/parisc/kernel/cache.c
===================================================================
--- linux-2.6.orig/arch/parisc/kernel/cache.c 2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/parisc/kernel/cache.c 2007-06-28 19:11:46.000000000 -0700
@@ -339,7 +339,7 @@
void flush_dcache_page(struct page *page)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping;
struct vm_area_struct *mpnt;
struct prio_tree_iter iter;
unsigned long offset;
@@ -347,6 +347,15 @@
pgoff_t pgoff;
unsigned long pfn = page_to_pfn(page);
+ /*
+ * This function is special in that a page struct obtained via
+ * virt_to_page from a slab object may be passed to it. However, slab
+ * allocators may use the mapping field for their own purposes. As a
+ * result mapping may be != NULL here although the page is not mapped.
+ * Slab objects are never mapped into user space so use NULL for that
+ * special case.
+ */
+ mapping = PageSlab(page) ? NULL : page_mapping(page);
if (mapping && !mapping_mapped(mapping)) {
set_bit(PG_dcache_dirty, &page->flags);
Index: linux-2.6/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/init.c 2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/sparc64/mm/init.c 2007-06-28 19:11:46.000000000 -0700
@@ -339,7 +339,15 @@
this_cpu = get_cpu();
- mapping = page_mapping(page);
+ /*
+ * This function is special in that a page struct obtained via
+ * virt_to_page from a slab object may be passed to it. However, slab
+ * allocators may use the mapping field for their own purposes. As a
+ * result mapping may be != NULL here although the page is not mapped.
+ * Slab objects are never mapped into user space so use NULL for that
+ * special case.
+ */
+ mapping = PageSlab(page) ? NULL : page_mapping(page);
if (mapping && !mapping_mapped(mapping)) {
int dirty = test_bit(PG_dcache_dirty, &page->flags);
if (dirty) {
Index: linux-2.6/arch/xtensa/mm/init.c
===================================================================
--- linux-2.6.orig/arch/xtensa/mm/init.c 2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/arch/xtensa/mm/init.c 2007-06-28 19:11:46.000000000 -0700
@@ -433,7 +433,7 @@
void flush_dcache_page(struct page *page)
{
unsigned long addr = __pa(page_address(page));
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping;
__flush_invalidate_dcache_page_phys(addr);
@@ -442,6 +442,15 @@
/* If this page hasn't been mapped, yet, handle I$/D$ coherency later.*/
#if 0
+ /*
+ * This function is special in that a page struct obtained via
+ * virt_to_page from a slab object may be passed to it. However, slab
+ * allocators may use the mapping field for their own purposes. As a
+ * result mapping may be != NULL although the page is not mapped.
+ * Slab objects are never mapped into user space so use NULL for that
+ * special case.
+ */
+ mapping = PageSlab(page) ? NULL : page_mapping(page);
if (mapping && !mapping_mapped(mapping))
clear_bit(PG_cache_clean, &page->flags);
else
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2007-06-28 18:57:18.000000000 -0700
+++ linux-2.6/include/linux/mm.h 2007-06-28 19:11:46.000000000 -0700
@@ -601,12 +601,16 @@
{
struct address_space *mapping = page->mapping;
+ /*
+ * The scatter gather layer currently may place the page struct
+ * of slab objects onto the scatter gather lists. The VM_BUG_ON
+ * here insures that we get notified if page cache functions are
+ * called for such a page.
+ */
+ VM_BUG_ON(PageSlab(page));
+
if (unlikely(PageSwapCache(page)))
mapping = &swapper_space;
-#ifdef CONFIG_SLUB
- else if (unlikely(PageSlab(page)))
- mapping = NULL;
-#endif
else if (unlikely((unsigned long)mapping & PAGE_MAPPING_ANON))
mapping = NULL;
return mapping;
Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h 2007-06-28 19:11:48.000000000 -0700
+++ linux-2.6/include/linux/highmem.h 2007-06-28 19:25:36.000000000 -0700
@@ -38,6 +38,11 @@
#ifndef ARCH_HAS_KMAP
static inline void *kmap(struct page *page)
{
+ /*
+ * WARNING: page may be a slab page and kmap may be called for such
+ * a page as a result of putting a slab object onto scatter gather
+ * lists by a device driver.
+ */
might_sleep();
return page_address(page);
}
@@ -48,6 +53,11 @@
static inline void *kmap_atomic(struct page *page, enum km_type idx)
{
+ /*
+ * WARNING: page may be a slab page and kmap_atomic may be called for
+ * such a page as a result of putting a slab object onto scatter gather
+ * lists by a device driver.
+ */
pagefault_disable();
return page_address(page);
}
From: Christoph Lameter <[email protected]>
Date: Thu, 28 Jun 2007 21:01:36 -0700 (PDT)
> Modify the functions in the affected arches to check for PageSlab() and
> use a NULL mapping if such a page is encountered. This may only be
> necessary for parisc and arm since sparc64 and xtensa do not scan over
> processes mapping a page but I have modified those two arches also for
> correctnesses sake since they use page_mapping() in flush_dcache_page().
>
> Still a better solution would be to not use the slab allocator at all for
> the objects that are used to send commands to the devices. These are not
> permanent and grabbing a page from the pcp lists and putting it back is
> likely as fast as performing a kmalloc.
Jens Axboe wants to get references to the page structs behind
kmalloc() allocated pages in his networking splice work.
We pass scatterlists around, but networking buffers are
composed of a kmalloc()'d data header area for packet headers
and some of the initial packet data, then a true scatterlist
of page/offset/len triplets.
Splice wants to work with pages everwhere.
This is a reocurring theme, we should provide some kind of
solution to these issues instead of wishing they would go
away.
We could make a special allocator in the networking that carves
chunks out of pages but I'm sure you'll find that about as stupid
and wasteful as I do :-)
On Thu, 28 Jun 2007, David Miller wrote:
> > Still a better solution would be to not use the slab allocator at all for
> > the objects that are used to send commands to the devices. These are not
> > permanent and grabbing a page from the pcp lists and putting it back is
> > likely as fast as performing a kmalloc.
>
> Jens Axboe wants to get references to the page structs behind
> kmalloc() allocated pages in his networking splice work.
You can get such a reference and then the slab page will be in limbo if
all objects are freed until that reference is given up. The reference
method is also use by kmem_cache_vacate() (but that is slab internal).
> We could make a special allocator in the networking that carves
> chunks out of pages but I'm sure you'll find that about as stupid
> and wasteful as I do :-)
Well then I guess this containment patch is as good as we can get.
It makes sure that things do not get out of hand.
From: Christoph Lameter <[email protected]>
Date: Thu, 28 Jun 2007 21:22:22 -0700 (PDT)
> On Thu, 28 Jun 2007, David Miller wrote:
>
> > > Still a better solution would be to not use the slab allocator at all for
> > > the objects that are used to send commands to the devices. These are not
> > > permanent and grabbing a page from the pcp lists and putting it back is
> > > likely as fast as performing a kmalloc.
> >
> > Jens Axboe wants to get references to the page structs behind
> > kmalloc() allocated pages in his networking splice work.
>
> You can get such a reference and then the slab page will be in limbo if
> all objects are freed until that reference is given up. The reference
> method is also use by kmem_cache_vacate() (but that is slab internal).
What about if someone kfree()'s that object meanwhile?
Can we bump the SLAB object count just like we can a page?
That's basically what's happening in the stuff Jens is
working on, he needs to grab a reference to a SLAB
object just like one can a page. Even if there is an
intervening kfree (like a put_page()) the SLAB object is
still live until all the references are put, and thus it
can't get reallocated and given to another client by SLAB.
On Thu, 28 Jun 2007, David Miller wrote:
> > You can get such a reference and then the slab page will be in limbo if
> > all objects are freed until that reference is given up. The reference
> > method is also use by kmem_cache_vacate() (but that is slab internal).
>
> What about if someone kfree()'s that object meanwhile?
If that is the last object in the slab then the page is deslabified and
it will stick around as a regular page until its refcount reaches zero.
There is slab functionality in SLUB that uses this trick to insure that
the slab does not go away.
> Can we bump the SLAB object count just like we can a page?
You can bump the slab page count. But you have to use virt_to_head_page()
instead of virt_to_page() since slab pages may be compound pages (in both
SLAB and SLUB). Incrementing the refcount of a tail page will cause an oops on
free.
> That's basically what's happening in the stuff Jens is
> working on, he needs to grab a reference to a SLAB
> object just like one can a page. Even if there is an
> intervening kfree (like a put_page()) the SLAB object is
> still live until all the references are put, and thus it
> can't get reallocated and given to another client by SLAB.
If you kfree the object then all slab allocators will feel free to
immediately alloc the object for other purposes.
If you want to prohibit further allocations from a slab then I can likely
give you a function call that removes a slab from the partial lists so
that the slab will not be allocated from. But this will only work with
SLUB.
At some point the slab will then need to be returned to the partial slab
lists.
Hmmmm... Maybe we are creating more of a mess with this. Isnt there some
other way to handle these object.
From: Christoph Lameter <[email protected]>
Date: Thu, 28 Jun 2007 21:39:01 -0700 (PDT)
> Hmmmm... Maybe we are creating more of a mess with this. Isnt there some
> other way to handle these object.
That's where I was going with the silly idea to use another
allocator :)
Really, it would be great if we could treat kmalloc() objects
just like real pages. Everything wants to do I/O on pages
but sometimes (like the networking) you have a kmalloc
chunk which is technically just a part of a page.
The fact that there is no easy way to make this work is
frustrating :-)
On Thu, 28 Jun 2007 22:06:06 -0700 (PDT) David Miller <[email protected]> wrote:
> From: Christoph Lameter <[email protected]>
> Date: Thu, 28 Jun 2007 21:39:01 -0700 (PDT)
>
> > Hmmmm... Maybe we are creating more of a mess with this. Isnt there some
> > other way to handle these object.
>
> That's where I was going with the silly idea to use another
> allocator :)
>
> Really, it would be great if we could treat kmalloc() objects
> just like real pages.
>From a high level, that seems like a bad idea. kmalloc() gives you a
virtual address and you really shouldn't be poking around at that memory's
underlying page's pageframe metadata.
However we can of course do tasteless and weird things if the benefit is
sufficient....
> Everything wants to do I/O on pages
> but sometimes (like the networking) you have a kmalloc
> chunk which is technically just a part of a page.
hm. So what happens when two quite different threads of control are doing
IO against two hunks of kmalloced memory which happen to come from the same
page? Either some (kernel-wide) locking is needed, or that pageframe needs
to be treated as readonly?
From: Andrew Morton <[email protected]>
Date: Thu, 28 Jun 2007 22:24:24 -0700
> So what happens when two quite different threads of control are doing
> IO against two hunks of kmalloced memory which happen to come from the same
> page? Either some (kernel-wide) locking is needed, or that pageframe needs
> to be treated as readonly?
Or you put an atomic_t at the beginning or tail of every SLAB
object. It's a space cost not a runtime cost for the common
case which is:
smp_rmb();
if (atomic_read(&slab_obj->count) == 1)
really_free_it();
else if (atomic_dec_and_test(...))
Note I don't like this variant either. :)
On Thu, 28 Jun 2007 22:37:34 -0700 (PDT) David Miller <[email protected]> wrote:
> From: Andrew Morton <[email protected]>
> Date: Thu, 28 Jun 2007 22:24:24 -0700
>
> > So what happens when two quite different threads of control are doing
> > IO against two hunks of kmalloced memory which happen to come from the same
> > page? Either some (kernel-wide) locking is needed, or that pageframe needs
> > to be treated as readonly?
>
> Or you put an atomic_t at the beginning or tail of every SLAB
> object. It's a space cost not a runtime cost for the common
> case which is:
>
> smp_rmb();
> if (atomic_read(&slab_obj->count) == 1)
> really_free_it();
> else if (atomic_dec_and_test(...))
>
> Note I don't like this variant either. :)
but, but... Christoph said 'The dma layer may then perform operations on
the "slab page'.
If those operations involve modifying that slab page's pageframe then what
stops concurrent dma'ers from stomping on each other's changes? As in:
why aren't we already buggy?
If those operations _don't_ involve modifying the pageframe (hopes this is
true) then we're read-only and things become much easier?
On Thu, 28 Jun 2007, David Miller wrote:
> From: Andrew Morton <[email protected]>
> Date: Thu, 28 Jun 2007 22:24:24 -0700
>
> > So what happens when two quite different threads of control are doing
> > IO against two hunks of kmalloced memory which happen to come from the same
> > page? Either some (kernel-wide) locking is needed, or that pageframe needs
> > to be treated as readonly?
>
> Or you put an atomic_t at the beginning or tail of every SLAB
> object. It's a space cost not a runtime cost for the common
> case which is:
Hmmm... We could do something like
kmem_cache_get(slab, object)
and
kmem_cache_put(slab, object)
kmem_cache_get would disable allocations from the slab and increment a
refcount.
kmem_cache_put would enable allocations again if the refcount reaches
one.
The problem is that freeing an object may cause writes to the object. F.e.
poisoning will overwrite the object on free. SLUB will put its free
pointer in the first words etc.
On Thu, 28 Jun 2007, Andrew Morton wrote:
> If those operations _don't_ involve modifying the pageframe (hopes this is
> true) then we're read-only and things become much easier?
This is true right now. We are way off topic ...
On Thu, 28 Jun 2007, David Miller wrote:
> Really, it would be great if we could treat kmalloc() objects
> just like real pages. Everything wants to do I/O on pages
> but sometimes (like the networking) you have a kmalloc
> chunk which is technically just a part of a page.
>
> The fact that there is no easy way to make this work is
> frustrating :-)
There is easy way: Allocate a page and just use the first N bytes. You can
specify the bytes to be used when putting the memory onto the scatter
gather list. This wastes memory but it works. You have real refcounting
since you got a real page.
How frequent are these objects?
From: Christoph Lameter <[email protected]>
Date: Fri, 29 Jun 2007 00:00:39 -0700 (PDT)
> On Thu, 28 Jun 2007, David Miller wrote:
>
> > Really, it would be great if we could treat kmalloc() objects
> > just like real pages. Everything wants to do I/O on pages
> > but sometimes (like the networking) you have a kmalloc
> > chunk which is technically just a part of a page.
> >
> > The fact that there is no easy way to make this work is
> > frustrating :-)
>
> There is easy way: Allocate a page and just use the first N bytes. You can
> specify the bytes to be used when putting the memory onto the scatter
> gather list. This wastes memory but it works. You have real refcounting
> since you got a real page.
>
> How frequent are these objects?
Every single network packet.
> If those operations involve modifying that slab page's pageframe then what
> stops concurrent dma'ers from stomping on each other's changes? As in:
> why aren't we already buggy?
Or DMA operations falling out with CPU operations in the same memory
area. Not all platforms have hardware consistency and some will blat the
entire page out of cache.
On Thu, 28 Jun 2007, Christoph Lameter wrote:
> I had a talk with James Bottomley last night and it seems that there is an
> established way of using the page structs of slab objects in the block
> layer. Drivers may use the DMA interfaces to issue control commands. In
> that case they may allocate a short structure via the slab allocator and
> put the control commands into that slab object.
>
> The driver will then perform a sg_init_one() on the slab object.
> sg_init_one() calls sg_set_buf() which determines the page struct of a
> page. In this case sg_set_buf() will determine the page struct of a slab
> object. The dma layer may then perform operations on the "slab page". The
> block layer folks seem to have spend some time to make this work right.
Yes, I don't see why this comes as such a surprise and horror to you,
so much in need of dire WARNINGs. kmalloc memory is not a different
kind of memory from what you get from the page allocators.
I stand by my page_mapping patch, and the remark I made before,
that page_mapping(page) is the correct place to check this. What is
page_mapping(page) for? Precisely to return the struct address_space*
from page->mapping when that's what's in there, and not when that field
has been reused for something else.
So lines like
> + mapping = PageSlab(page) ? NULL : page_mapping(page);
seem to miss the point.
I agree that the only clash found yet has been in flush_dcache_page,
so some bytes and branches can indeed be saved by just doing the
test in there. Oh, but your VM_BUG_ON cancels out that saving.
And if we were to try to save bytes and branches there, it's the
synthetic swapper_space business (only required in a couple of
places) I'd be wanting to cut out.
To me this all seems like a big fuss to excuse your surprise:
so please don't expect an Ack from me; but if others prefer
this, I won't be Nacking. (Though I'll probably whine about
it into eternity ;)
Hugh
On Fri, 29 Jun 2007, Hugh Dickins wrote:
> I stand by my page_mapping patch, and the remark I made before,
> that page_mapping(page) is the correct place to check this. What is
> page_mapping(page) for? Precisely to return the struct address_space*
> from page->mapping when that's what's in there, and not when that field
> has been reused for something else.
>
> So lines like
> > + mapping = PageSlab(page) ? NULL : page_mapping(page);
> seem to miss the point.
They certainly point out to the reader that one can expect a slab page
here where one may not expect one since flush_dcache_page is a page
cache function.
> I agree that the only clash found yet has been in flush_dcache_page,
> so some bytes and branches can indeed be saved by just doing the
> test in there. Oh, but your VM_BUG_ON cancels out that saving.
> And if we were to try to save bytes and branches there, it's the
> synthetic swapper_space business (only required in a couple of
> places) I'd be wanting to cut out.
VM_BUG_ON is different from BUG_ON. BUG_ON is always checked. VM_BUG_ON
depends on a debug config option.
> To me this all seems like a big fuss to excuse your surprise:
I am weirded out by the use of terms like "accusations" and "excuses" in
these discussions. But if it makes you feel better....
Others seemed to have encountered the same surprises before me. So it is
better to point out these issues in the sources. There is the danger of
other pagecache functions getting called for slab pages in the I/O
layer and the check in page_mapping provides some protection from such
changes. The checks/comments in the functions where we allow slab page use
help the ones enhancing the code to keep these issues in mind and help
them to not be surprised in turn.
On Fri, 29 Jun 2007 13:16:57 +0100
Alan Cox <[email protected]> wrote:
> > If those operations involve modifying that slab page's pageframe then what
> > stops concurrent dma'ers from stomping on each other's changes? As in:
> > why aren't we already buggy?
>
> Or DMA operations falling out with CPU operations in the same memory
> area. Not all platforms have hardware consistency and some will blat the
> entire page out of cache.
Is that just a performance problem, or can data be lost here? It depends
on the meaning of "blat": writeback? invalidate? More details, please.
I'm dyin here and nobody will talk to me. If the kernel is already doing
these things, why aren't we already buggy? Is it because we don't actually
modify the pageframes of these dma-to-from-kmalloced pages? But we were
thinking of doing so in the future?
On Fri, Jun 29, 2007 at 01:45:29PM -0700, Andrew Morton wrote:
> On Fri, 29 Jun 2007 13:16:57 +0100
> Alan Cox <[email protected]> wrote:
>
> > > If those operations involve modifying that slab page's pageframe then what
> > > stops concurrent dma'ers from stomping on each other's changes? As in:
> > > why aren't we already buggy?
> >
> > Or DMA operations falling out with CPU operations in the same memory
> > area. Not all platforms have hardware consistency and some will blat the
> > entire page out of cache.
>
> Is that just a performance problem, or can data be lost here? It depends
> on the meaning of "blat": writeback? invalidate? More details, please.
>
>
> I'm dyin here and nobody will talk to me. If the kernel is already doing
> these things, why aren't we already buggy? Is it because we don't actually
> modify the pageframes of these dma-to-from-kmalloced pages? But we were
> thinking of doing so in the future?
I think people are getting too het up about this.
DMA to or from memory should be done via the DMA mapping API. If we're
DMAing to/from a limited range within a page, either we should be using
dma_map_single(), or dma_map_page() with an appropriate offset and size.
Other cache flushing functions should not be called for DMA operations;
any cache handling required by non-coherent architectures should be done
by the DMA API only.
However, with non-coherent aliasing architectures (such as those with
aliasing VIPT or VIVT caches) there is an additional requirement on PIO
to page cache. If the page we're writing data has some cache lines
allocated to it, we potentially hit those cache lines and the data
doesn't hit the underlying page. Later on, when we come to map the
page into userspace, the data may still be sitting in the cache lines
corresponding with the kernel's mapping. Therefore, there is a
requirement to ensure that the cache state WRT the kernel's mapping is
the same irrespective of the method by which data ends up in the page.
That means that for these caches, the data PIO'd into the page must be
written back to the underlying page before the page is handed to
userspace.
The two are completely separate; it seems to me from the above discussion
that people are confusing the two scenarios, and mixing DMA with the PIO
cache handling. Please don't, you'll only get more and more confused.
(Note: with the dma_map_* API, architectures have to be sensible when
they're passed offests and sizes which aren't cacheline aligned.
Technically, it's buggy to ask for non-L1 line aligned offsets and
sizes, but they do happen. We handle this on ARM by writing back
the overlapped lines and invalidating the rest before the DMA operation
commences, and hope that the overlapped lines aren't touched for the
duration of the DMA.)
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Fri, 29 Jun 2007 13:45:29 -0700
Andrew Morton <[email protected]> wrote:
> On Fri, 29 Jun 2007 13:16:57 +0100
> Alan Cox <[email protected]> wrote:
>
> > > If those operations involve modifying that slab page's pageframe then what
> > > stops concurrent dma'ers from stomping on each other's changes? As in:
> > > why aren't we already buggy?
> >
> > Or DMA operations falling out with CPU operations in the same memory
> > area. Not all platforms have hardware consistency and some will blat the
> > entire page out of cache.
>
> Is that just a performance problem, or can data be lost here? It depends
> on the meaning of "blat": writeback? invalidate? More details, please.
Invalidate. Sorry didn't realise it they hadn't discovered that word down
under.
If you've got something packing objects in tight we are going
to have fun with cache handling simply because the CPU cache granularity
may mean that the invalidate also invalidates a few bytes on (ie a 12
byte object will invalidate 16 bytes of memory) and you've just removed
any CPU held changes in the start of the next object.
Alan
> DMA to or from memory should be done via the DMA mapping API. If we're
> DMAing to/from a limited range within a page, either we should be using
> dma_map_single(), or dma_map_page() with an appropriate offset and size.
If those ranges overlap a cache line then the dma mapping API will not
save your backside.
On a system with a 32 byte cache granularity what happens if you get two
dma mapping calls for x and x+16. Right now the thing that avoids this
occurring is that the allocators don't pack stuff in that hard so x+16
always belongs to the same driver and we can hope driver authors are
sensible
> sizes, but they do happen. We handle this on ARM by writing back
> the overlapped lines and invalidating the rest before the DMA operation
> commences, and hope that the overlapped lines aren't touched for the
> duration of the DMA.)
The combination of "hope" and "DMA" isn't a good one for stable system
design. In this situation we should be waving large red flags
On Sat, Jun 30, 2007 at 12:11:38AM +0100, Alan Cox wrote:
> > DMA to or from memory should be done via the DMA mapping API. If we're
> > DMAing to/from a limited range within a page, either we should be using
> > dma_map_single(), or dma_map_page() with an appropriate offset and size.
>
> If those ranges overlap a cache line then the dma mapping API will not
> save your backside.
There's nothing much that the DMA API can do though. Consider DMA
to a result buffer which is, eg, only 16 bytes in size. So you get
passed a size of '16' to the DMA API. What should you do at this
point? BUG()?
What if you have 64 or 128 byte cache lines?
> > sizes, but they do happen. We handle this on ARM by writing back
> > the overlapped lines and invalidating the rest before the DMA operation
> > commences, and hope that the overlapped lines aren't touched for the
> > duration of the DMA.)
>
> The combination of "hope" and "DMA" isn't a good one for stable system
> design. In this situation we should be waving large red flags
I agree.
However, I don't think this is an issue for the DMA API to handle; it's
something that driver authors need to be aware of. If they wish to do
a DMA to a kmalloc'd buffer or even a page, we could require that
offsets and sizes are cache line aligned.
However, remember that turning on slab debugging turns off cache line
alignment, so imposing such a requirement implies that the slab
debugging will break DMA, or driver authors also have to be aware of
that and do their own alignment internally, *or* we provide an allocator
which does unconditionally align.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Thu, Jun 28, 2007 at 10:24:24PM -0700, Andrew Morton wrote:
> > Really, it would be great if we could treat kmalloc() objects
> > just like real pages.
>
> >From a high level, that seems like a bad idea. kmalloc() gives you a
> virtual address and you really shouldn't be poking around at that memory's
> underlying page's pageframe metadata.
>
> However we can of course do tasteless and weird things if the benefit is
> sufficient....
Hey, when we had exactly that issues coming up with xfs/ext3 recovery over
iscsi/aoe you said it's fine :) End result is that XFS got fixed and ext3
is still broken..