2011-01-05 19:05:40

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

[sorry for the unthreaded insertion. We're seeing this on parisc too]

> On Wed, Jan 05, 2011 at 10:14:17AM -0500, Trond Myklebust wrote:
> > OK. So,the new behaviour in 2.6.37 is that we're writing to a series of
> > pages via the usual kmap_atomic()/kunmap_atomic() and kmap()/kunmap()
> > interfaces, but we can end up reading them via a virtual address range
> > that gets set up via vm_map_ram() (that range gets set up before the
> > write occurs).
>
> kmap of lowmem pages will always reuses the existing kernel direct
> mapping, so there won't be a problem there.
>
> > Do we perhaps need an invalidate_kernel_vmap_range() before we can read
> > the data on ARM in this kind of scenario?
>
> Firstly, vm_map_ram() does no cache maintainence of any sort, nor does
> it take care of page colouring - so any architecture where cache aliasing
> can occur will see this problem. It should not limited to ARM.
>
> Secondly, no, invalidate_kernel_vmap_range() probably isn't sufficient.
> There's two problems here:
>
> addr = kmap(lowmem_page);
> *addr = stuff;
> kunmap(lowmem_page);
>
> Such lowmem pages are accessed through their kernel direct mapping.
>
> ptr = vm_map_ram(lowmem_page);
> read = *ptr;
>
> This creates a new mapping which can alias with the kernel direct mapping.
> Now, as this is a new mapping, there should be no cache lines associated
> with it. (Looking at vm_unmap_ram(), it calls free_unmap_vmap_area_addr(),
> free_unmap_vmap_area(), which then calls flush_cache_vunmap() on the
> region. vb_free() also calls flush_cache_vunmap() too.)
>
> If the write after kmap() hits an already present cache line, the cache
> line will be updated, but it won't be written back to memory. So, on
> a subsequent vm_map_ram(), with any kind of aliasing cache, there's
> no guarantee that you'll hit that cache line and read the data just
> written there.
>
> The kernel direct mapping would need to be flushed.
>
> I'm really getting to the point of hating the poliferation of RAM
> remapping interfaces - it's going to (and is) causing nothing but lots
> of pain on virtual cache architectures, needing more and more cache
> flushing interfaces to be created.
>
> Is there any other solution to this?

I think the solution for the kernel direct mapping problem is to take
the expected flushes and invalidates into kmap/kunmap[_atomic]. I think
the original reason for not doing this was efficiency: the user should
know what they did with the data (i.e. if they're only reading it, it
doesn't need to be flushed on unmap). However, the difficulty of
getting all this right seems to outweigh the efficiency of only using
the necessary flushing. At least on some architectures, we can look at
the TLB flags to see if the page was dirtied (and only flush if it was).

James




2011-01-10 20:15:25

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Mon, 2011-01-10 at 11:34 -0800, Linus Torvalds wrote:
> 2011/1/10 Trond Myklebust <[email protected]>:
> >
> > I usually do this, but there is a slight problem with that approach:
> > Greg gets to do all the work of figuring out to which stable kernels
> > this particular patch applies. In this case, since we're only talking
> > about the 2.6.37 kernel, I prefer to use the mailing lists.
>
> Just do
>
> Cc: [email protected] [2.6.37]
>
> or similar. It's quite common.
>
> So EVEN IF you want to email people around about the patch separately,
> do add the "Cc: stable" marker. It's worthwhile information about the
> patch for everybody involved.

OK. Patch description amended and recommitted in git. Thanks to all for
the tips...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-08 16:50:04

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote:
> On the other hand, the xdr routines, since they take the pages anyway,
> could use a scatterlist approach to writing through the kernel mapping
> instead of using vmap ... we have all the machinery for this in
> lib/scatterlist.c ... it's not designed for this case, since it's
> designed to allow arbitrary linear reads and writes on a block
> scatterlist, but the principle is the same ... it looks like it would be
> rather a big patch, though ...

The following alternative seems to work for me, but has only been
lightly tested so far. It's a bit large for a stable patch, but not too
ungainly.

It modifies xdr_stream, adding the ability to iterate through page data.
To avoid kmap()/kunmap(), it does require that pages be allocated in
lowmem, but since the only use case here is when using page arrays as
temporary buffers, that seems like an acceptable compromise.

Cheers
Trond

---------------------------------------------------------------------------------

2011-01-06 18:01:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, Jan 6, 2011 at 9:47 AM, Trond Myklebust
<[email protected]> wrote:
>
> Why is this line needed? We're not writing through the virtual mapping.

I haven't looked at the sequence of accesses, but you need to be
_very_ aware that "write-through" is absolutely NOT sufficient for
cache coherency.

In cache coherency, you have three options:

- true coherency (eg physically indexed/tagged caches)

- exclusion (eg virtual caches, but with an exclusion guarantee that
guarantees that aliases cannot happen: either by using physical
tagging or by not allowing cases that could cause virtual aliases)

- write-through AND non-cached reads (ie "no caching at all").

You seem to be forgetting the "no cached reads" part. It's not
sufficient to flush after a write - you need to make sure that you
also don't have a cached copy of the alias for the read.

So "We're not writing through the virtual mapping" is NOT a sufficient
excuse. If you're reading through the virtual mapping, you need to
make sure that the virtual mapping is flushed _after_ any writes
through any other mapping and _before_ any reads through the virtual
one.

This is why you really really really generally don't want to have
aliasing. Purely virtual caches are pure crap. Really.

Linus

2011-01-05 23:28:38

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 18:06 -0500, Trond Myklebust wrote:
> On Wed, 2011-01-05 at 13:30 -0800, Linus Torvalds wrote:
> > On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > So what should be the preferred way to ensure data gets flushed when
> > > you've written directly to a page, and then want to read through the
> > > vm_map_ram() virtual range? Should we be adding new semantics to
> > > flush_kernel_dcache_page()?
> >
> > The "preferred way" is actually simple: "don't do that". IOW, if some
> > page is accessed through a virtual mapping you've set up, then
> > _always_ access it through that virtual mapping.
> >
> > Now, when that is impossible (and yes, it sometimes is), then you
> > should flush after doing all writes. And if you do the write through
> > the regular kernel mapping, you should use flush_dcache_page(). And if
> > you did it through the virtual mapping, you should use
> > "flush_kernel_vmap_range()" or whatever.
> >
> > NOTE! I really didn't look those up very closely, and if the accesses
> > can happen concurrently you are basically screwed, so you do need to
> > do locking or something else to guarantee that there is some nice
> > sequential order. And maybe I forgot something. Which is why I do
> > suggest "don't do that" as a primary approach to the problem if at all
> > possible.
> >
> > Oh, and you may need to flush before reading too (and many writes do
> > end up being "read-modify-write" cycles) in case it's possible that
> > you have stale data from a previous read that was then invalidated by
> > a write to the aliasing address. Even if that write was flushed out,
> > the stale read data may exist at the virtual address. I forget what
> > all we required - in the end the only sane model is "virtual caches
> > suck so bad that anybody who does them should be laughed at for being
> > a retard".
>
> Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(),
> which takes care of invalidating the cache prior to a virtual address
> read.
>
> My question was specifically about the write through the regular kernel
> mapping: according to Russell and my reading of the cachetlb.txt
> documentation, flush_dcache_page() is only guaranteed to have an effect
> on page cache pages.
> flush_kernel_dcache_page() (not to be confused with flush_dcache_page)
> would appear to be the closest fit according to my reading of the
> documentation, however the ARM implementation appears to be a no-op...

It depends on exactly what you're doing. In the worst case, (ping pong
reads and writes through both aliases) you have to flush and invalidate
both alias 1 alias 2 each time you access on one and then another.

Can you explain how the code works? it looks to me like you read the xdr
stuff through the vmap region then write it out directly to the pages?
*if* this is just a conversion, *and* you never need to read the new
data through the vmap alias, you might be able to get away with a
flush_dcache_page in nfs_readdir_release_array(). If the access pattern
is more complex, you'll need more stuff splashed through the loop
(including vmap invalidation/flushing).

Is there any way you could just rewrite nfs_readdir_add_to_array() to
use the vmap address instead of doing a kmap? That way everything will
go through a single alias and not end up with this incoherency.

James



2011-01-06 21:08:03

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, 2011-01-06 at 12:25 -0600, James Bottomley wrote:
> OK, so thinking about this, it seems that the only danger is actually
> what NFS is doing: reading cache pages via a vmap. In that case, since
> the requirement is to invalidate the vmap range to prepare for read, we
> could have invalidate_kernel_vmap_range loop over the underlying pages
> and flush them through the kernel alias if the architecture specific
> flag indicates their contents might be dirty.
>
> The loop adds expense that is probably largely unnecessary to
> invalidate_kernel_vmap_range() but the alternative is adding to the API
> proliferation with something that only flushes the kernel pages if the
> arch specific flag says they're dirty.

This is what I think the arm patch would look like (example only: I
can't compile it). Is something like this too expensive? the loop can't
be optimised away because of the need to check the pages (and
vmalloc_to_page is a three level page table lookup).

James

---

diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index 3acd8fa..34469ca 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -414,8 +414,17 @@ static inline void flush_kernel_vmap_range(void *addr, int size)
}
static inline void invalidate_kernel_vmap_range(void *addr, int size)
{
- if ((cache_is_vivt() || cache_is_vipt_aliasing()))
- __cpuc_flush_dcache_area(addr, (size_t)size);
+ if ((cache_is_vivt() || cache_is_vipt_aliasing())) {
+ void *cursor = addr;
+
+ for ( ; cursor < addr + size; cursor += PAGE_SIZE) {
+ struct page *page = vmalloc_to_page(cursor);
+
+ if (!test_and_set_bit(PG_dcache_clean, &page->flags))
+ __flush_dcache_page(page_mapping(page), page);
+ }
+ __cpuc_flush_dcache_area(addr, (size_t)size);
+ }
}

#define ARCH_HAS_FLUSH_ANON_PAGE



2011-01-10 19:26:01

by Uwe Kleine-König

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

Hello Trond,

On Mon, Jan 10, 2011 at 12:20:35PM -0500, Trond Myklebust wrote:
> On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote:
> > On 01/10/2011 05:25 PM, Trond Myklebust wrote:
> > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-K?nig wrote:
> > >> Hi Trond,
> > >>
> > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote:
> > >>> -----------------------------------------------------------------------------------
> > >> It would be great if you could add a "8<" in the line above next time.
> > >> Then git-am -c does the right thing (at least I think so).
> > >
> > > Sorry. I wasn't aware of that particular idiom. So something like
> > >
> > > 8<------------------------------------------------------------
> > > From: Trond Myklebust <[email protected]>
> > > Subject: .....
> >
> > Yes.
> >
> > From "man git-mailinfo":
> > "A line that mainly consists of scissors (either ">8" or "8<") and
> > perforation (dash "-")"
> >
> > BTW:
> > Is this patch a candidate for stable?
>
> Yes. I'm planning on sending it to the stable list after Linus merges it
> into mainline.
So there is another idiom for you: just put

Cc: [email protected]

in the S-o-b block and Greg will pick it off "automatically". (Just in
case you don't know, and if you do, maybe someone else learned
something.)

Best regards
Uwe

--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |

2011-01-10 10:50:58

by Uwe Kleine-König

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

Hi Trond,

On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote:
> -----------------------------------------------------------------------------------
It would be great if you could add a "8<" in the line above next time.
Then git-am -c does the right thing (at least I think so).

> From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Sat, 8 Jan 2011 17:45:38 -0500
> Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir
>
> vm_map_ram() is not available on NOMMU platforms, and causes trouble
> on incoherrent architectures such as ARM when we access the page data
> through both the direct and the virtual mapping.
>
> The alternative is to use the direct mapping to access page data
> for the case when we are not crossing a page boundary, but to copy
> the data into a linear scratch buffer when we are accessing data
> that spans page boundaries.
>
> Signed-off-by: Trond Myklebust <[email protected]>
Tested-by: Uwe Kleine-K?nig <[email protected]>

on tx28.

Thanks
Uwe

> ---
> fs/nfs/dir.c | 44 ++++++-------
> fs/nfs/nfs2xdr.c | 6 --
> fs/nfs/nfs3xdr.c | 6 --
> fs/nfs/nfs4xdr.c | 6 --
> include/linux/sunrpc/xdr.h | 4 +-
> net/sunrpc/xdr.c | 155 +++++++++++++++++++++++++++++++++++---------
> 6 files changed, 148 insertions(+), 73 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 996dd89..0108cf4 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -33,7 +33,6 @@
> #include <linux/namei.h>
> #include <linux/mount.h>
> #include <linux/sched.h>
> -#include <linux/vmalloc.h>
> #include <linux/kmemleak.h>
>
> #include "delegation.h"
> @@ -459,25 +458,26 @@ out:
> /* Perform conversion from xdr to cache array */
> static
> int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry,
> - void *xdr_page, struct page *page, unsigned int buflen)
> + struct page **xdr_pages, struct page *page, unsigned int buflen)
> {
> struct xdr_stream stream;
> - struct xdr_buf buf;
> - __be32 *ptr = xdr_page;
> + struct xdr_buf buf = {
> + .pages = xdr_pages,
> + .page_len = buflen,
> + .buflen = buflen,
> + .len = buflen,
> + };
> + struct page *scratch;
> struct nfs_cache_array *array;
> unsigned int count = 0;
> int status;
>
> - buf.head->iov_base = xdr_page;
> - buf.head->iov_len = buflen;
> - buf.tail->iov_len = 0;
> - buf.page_base = 0;
> - buf.page_len = 0;
> - buf.buflen = buf.head->iov_len;
> - buf.len = buf.head->iov_len;
> -
> - xdr_init_decode(&stream, &buf, ptr);
> + scratch = alloc_page(GFP_KERNEL);
> + if (scratch == NULL)
> + return -ENOMEM;
>
> + xdr_init_decode(&stream, &buf, NULL);
> + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
>
> do {
> status = xdr_decode(desc, entry, &stream);
> @@ -506,6 +506,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en
> } else
> status = PTR_ERR(array);
> }
> +
> + put_page(scratch);
> return status;
> }
>
> @@ -521,7 +523,6 @@ static
> void nfs_readdir_free_large_page(void *ptr, struct page **pages,
> unsigned int npages)
> {
> - vm_unmap_ram(ptr, npages);
> nfs_readdir_free_pagearray(pages, npages);
> }
>
> @@ -530,9 +531,8 @@ void nfs_readdir_free_large_page(void *ptr, struct page **pages,
> * to nfs_readdir_free_large_page
> */
> static
> -void *nfs_readdir_large_page(struct page **pages, unsigned int npages)
> +int nfs_readdir_large_page(struct page **pages, unsigned int npages)
> {
> - void *ptr;
> unsigned int i;
>
> for (i = 0; i < npages; i++) {
> @@ -541,13 +541,11 @@ void *nfs_readdir_large_page(struct page **pages, unsigned int npages)
> goto out_freepages;
> pages[i] = page;
> }
> + return 0;
>
> - ptr = vm_map_ram(pages, npages, 0, PAGE_KERNEL);
> - if (!IS_ERR_OR_NULL(ptr))
> - return ptr;
> out_freepages:
> nfs_readdir_free_pagearray(pages, i);
> - return NULL;
> + return -ENOMEM;
> }
>
> static
> @@ -577,8 +575,8 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
> memset(array, 0, sizeof(struct nfs_cache_array));
> array->eof_index = -1;
>
> - pages_ptr = nfs_readdir_large_page(pages, array_size);
> - if (!pages_ptr)
> + status = nfs_readdir_large_page(pages, array_size);
> + if (status < 0)
> goto out_release_array;
> do {
> unsigned int pglen;
> @@ -587,7 +585,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
> if (status < 0)
> break;
> pglen = status;
> - status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen);
> + status = nfs_readdir_page_filler(desc, &entry, pages, page, pglen);
> if (status < 0) {
> if (status == -ENOSPC)
> status = 0;
> diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c
> index 5914a19..b382a1b 100644
> --- a/fs/nfs/nfs2xdr.c
> +++ b/fs/nfs/nfs2xdr.c
> @@ -487,12 +487,6 @@ nfs_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_se
>
> entry->d_type = DT_UNKNOWN;
>
> - p = xdr_inline_peek(xdr, 8);
> - if (p != NULL)
> - entry->eof = !p[0] && p[1];
> - else
> - entry->eof = 0;
> -
> return p;
>
> out_overflow:
> diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
> index f6cc60f..ba91236 100644
> --- a/fs/nfs/nfs3xdr.c
> +++ b/fs/nfs/nfs3xdr.c
> @@ -647,12 +647,6 @@ nfs3_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry, struct nfs_s
> memset((u8*)(entry->fh), 0, sizeof(*entry->fh));
> }
>
> - p = xdr_inline_peek(xdr, 8);
> - if (p != NULL)
> - entry->eof = !p[0] && p[1];
> - else
> - entry->eof = 0;
> -
> return p;
>
> out_overflow:
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index 9f1826b..0662a98 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -6215,12 +6215,6 @@ __be32 *nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
> if (verify_attr_len(xdr, p, len) < 0)
> goto out_overflow;
>
> - p = xdr_inline_peek(xdr, 8);
> - if (p != NULL)
> - entry->eof = !p[0] && p[1];
> - else
> - entry->eof = 0;
> -
> return p;
>
> out_overflow:
> diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> index 498ab93..7783c68 100644
> --- a/include/linux/sunrpc/xdr.h
> +++ b/include/linux/sunrpc/xdr.h
> @@ -201,6 +201,8 @@ struct xdr_stream {
>
> __be32 *end; /* end of available buffer space */
> struct kvec *iov; /* pointer to the current kvec */
> + struct kvec scratch; /* Scratch buffer */
> + struct page **page_ptr; /* pointer to the current page */
> };
>
> extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p);
> @@ -208,7 +210,7 @@ extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes);
> extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages,
> unsigned int base, unsigned int len);
> extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p);
> -extern __be32 *xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes);
> +extern void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen);
> extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes);
> extern void xdr_read_pages(struct xdr_stream *xdr, unsigned int len);
> extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len);
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index cd9e841..679cd67 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -552,6 +552,74 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int b
> }
> EXPORT_SYMBOL_GPL(xdr_write_pages);
>
> +static void xdr_set_iov(struct xdr_stream *xdr, struct kvec *iov,
> + __be32 *p, unsigned int len)
> +{
> + if (len > iov->iov_len)
> + len = iov->iov_len;
> + if (p == NULL)
> + p = (__be32*)iov->iov_base;
> + xdr->p = p;
> + xdr->end = (__be32*)(iov->iov_base + len);
> + xdr->iov = iov;
> + xdr->page_ptr = NULL;
> +}
> +
> +static int xdr_set_page_base(struct xdr_stream *xdr,
> + unsigned int base, unsigned int len)
> +{
> + unsigned int pgnr;
> + unsigned int maxlen;
> + unsigned int pgoff;
> + unsigned int pgend;
> + void *kaddr;
> +
> + maxlen = xdr->buf->page_len;
> + if (base >= maxlen)
> + return -EINVAL;
> + maxlen -= base;
> + if (len > maxlen)
> + len = maxlen;
> +
> + base += xdr->buf->page_base;
> +
> + pgnr = base >> PAGE_SHIFT;
> + xdr->page_ptr = &xdr->buf->pages[pgnr];
> + kaddr = page_address(*xdr->page_ptr);
> +
> + pgoff = base & ~PAGE_MASK;
> + xdr->p = (__be32*)(kaddr + pgoff);
> +
> + pgend = pgoff + len;
> + if (pgend > PAGE_SIZE)
> + pgend = PAGE_SIZE;
> + xdr->end = (__be32*)(kaddr + pgend);
> + xdr->iov = NULL;
> + return 0;
> +}
> +
> +static void xdr_set_next_page(struct xdr_stream *xdr)
> +{
> + unsigned int newbase;
> +
> + newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT;
> + newbase -= xdr->buf->page_base;
> +
> + if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0)
> + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len);
> +}
> +
> +static bool xdr_set_next_buffer(struct xdr_stream *xdr)
> +{
> + if (xdr->page_ptr != NULL)
> + xdr_set_next_page(xdr);
> + else if (xdr->iov == xdr->buf->head) {
> + if (xdr_set_page_base(xdr, 0, PAGE_SIZE) < 0)
> + xdr_set_iov(xdr, xdr->buf->tail, NULL, xdr->buf->len);
> + }
> + return xdr->p != xdr->end;
> +}
> +
> /**
> * xdr_init_decode - Initialize an xdr_stream for decoding data.
> * @xdr: pointer to xdr_stream struct
> @@ -560,41 +628,67 @@ EXPORT_SYMBOL_GPL(xdr_write_pages);
> */
> void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p)
> {
> - struct kvec *iov = buf->head;
> - unsigned int len = iov->iov_len;
> -
> - if (len > buf->len)
> - len = buf->len;
> xdr->buf = buf;
> - xdr->iov = iov;
> - xdr->p = p;
> - xdr->end = (__be32 *)((char *)iov->iov_base + len);
> + xdr->scratch.iov_base = NULL;
> + xdr->scratch.iov_len = 0;
> + if (buf->head[0].iov_len != 0)
> + xdr_set_iov(xdr, buf->head, p, buf->len);
> + else if (buf->page_len != 0)
> + xdr_set_page_base(xdr, 0, buf->len);
> }
> EXPORT_SYMBOL_GPL(xdr_init_decode);
>
> -/**
> - * xdr_inline_peek - Allow read-ahead in the XDR data stream
> - * @xdr: pointer to xdr_stream struct
> - * @nbytes: number of bytes of data to decode
> - *
> - * Check if the input buffer is long enough to enable us to decode
> - * 'nbytes' more bytes of data starting at the current position.
> - * If so return the current pointer without updating the current
> - * pointer position.
> - */
> -__be32 * xdr_inline_peek(struct xdr_stream *xdr, size_t nbytes)
> +static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes)
> {
> __be32 *p = xdr->p;
> __be32 *q = p + XDR_QUADLEN(nbytes);
>
> if (unlikely(q > xdr->end || q < p))
> return NULL;
> + xdr->p = q;
> return p;
> }
> -EXPORT_SYMBOL_GPL(xdr_inline_peek);
>
> /**
> - * xdr_inline_decode - Retrieve non-page XDR data to decode
> + * xdr_set_scratch_buffer - Attach a scratch buffer for decoding data.
> + * @xdr: pointer to xdr_stream struct
> + * @buf: pointer to an empty buffer
> + * @buflen: size of 'buf'
> + *
> + * The scratch buffer is used when decoding from an array of pages.
> + * If an xdr_inline_decode() call spans across page boundaries, then
> + * we copy the data into the scratch buffer in order to allow linear
> + * access.
> + */
> +void xdr_set_scratch_buffer(struct xdr_stream *xdr, void *buf, size_t buflen)
> +{
> + xdr->scratch.iov_base = buf;
> + xdr->scratch.iov_len = buflen;
> +}
> +EXPORT_SYMBOL_GPL(xdr_set_scratch_buffer);
> +
> +static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes)
> +{
> + __be32 *p;
> + void *cpdest = xdr->scratch.iov_base;
> + size_t cplen = (char *)xdr->end - (char *)xdr->p;
> +
> + if (nbytes > xdr->scratch.iov_len)
> + return NULL;
> + memcpy(cpdest, xdr->p, cplen);
> + cpdest += cplen;
> + nbytes -= cplen;
> + if (!xdr_set_next_buffer(xdr))
> + return NULL;
> + p = __xdr_inline_decode(xdr, nbytes);
> + if (p == NULL)
> + return NULL;
> + memcpy(cpdest, p, nbytes);
> + return xdr->scratch.iov_base;
> +}
> +
> +/**
> + * xdr_inline_decode - Retrieve XDR data to decode
> * @xdr: pointer to xdr_stream struct
> * @nbytes: number of bytes of data to decode
> *
> @@ -605,13 +699,16 @@ EXPORT_SYMBOL_GPL(xdr_inline_peek);
> */
> __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes)
> {
> - __be32 *p = xdr->p;
> - __be32 *q = p + XDR_QUADLEN(nbytes);
> + __be32 *p;
>
> - if (unlikely(q > xdr->end || q < p))
> + if (nbytes == 0)
> + return xdr->p;
> + if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr))
> return NULL;
> - xdr->p = q;
> - return p;
> + p = __xdr_inline_decode(xdr, nbytes);
> + if (p != NULL)
> + return p;
> + return xdr_copy_to_scratch(xdr, nbytes);
> }
> EXPORT_SYMBOL_GPL(xdr_inline_decode);
>
> @@ -671,16 +768,12 @@ EXPORT_SYMBOL_GPL(xdr_read_pages);
> */
> void xdr_enter_page(struct xdr_stream *xdr, unsigned int len)
> {
> - char * kaddr = page_address(xdr->buf->pages[0]);
> xdr_read_pages(xdr, len);
> /*
> * Position current pointer at beginning of tail, and
> * set remaining message length.
> */
> - if (len > PAGE_CACHE_SIZE - xdr->buf->page_base)
> - len = PAGE_CACHE_SIZE - xdr->buf->page_base;
> - xdr->p = (__be32 *)(kaddr + xdr->buf->page_base);
> - xdr->end = (__be32 *)((char *)xdr->p + len);
> + xdr_set_page_base(xdr, 0, len);
> }
> EXPORT_SYMBOL_GPL(xdr_enter_page);
>
> --
> 1.7.3.4
>
>
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>
>

--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |

2011-01-05 23:29:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust
<[email protected]> wrote:
>
> Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(),
> which takes care of invalidating the cache prior to a virtual address
> read.
>
> My question was specifically about the write through the regular kernel
> mapping: according to Russell and my reading of the cachetlb.txt
> documentation, flush_dcache_page() is only guaranteed to have an effect
> on page cache pages.

I don't think that should ever matter. It's not like the hardware can
know whether it's a dcache page or not.

And if the sw implementation cares, it's doing something really odd.
But who knows - there's a lot of crap out there, and people sometimes
do really odd things to work around the brokenness of a VIVT cache
with aliases.

Linus

2011-01-05 20:34:00

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 20:00 +0000, Russell King - ARM Linux wrote:
> On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote:
> > On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote:
> > > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley
> > > <[email protected]> wrote:
> > > >
> > > > I think the solution for the kernel direct mapping problem is to take
> > > > the expected flushes and invalidates into kmap/kunmap[_atomic].
> > >
> > > No, we really can't do that. Most of the time, the kmap() is the only
> > > way we access the page anyway, so flushing things would just be
> > > stupid. Why waste time and energy on doing something pointless?
> >
> > It's hardly pointless. The kmap sets up an inequivalent alias in the
> > cache.
>
> No it doesn't. For pages which are inaccessible, it sets up a mapping
> for those pages. On aliasing cache architectures, when you tear down
> such a mapping, you have to flush the cache before you do so (otherwise
> you can end up with cache lines existing in the cache for inaccessible
> mappings.)
>
> For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage
> because all lowmem pages are already accessible. So kunmap() doesn't
> do anything - just like the !HIGHMEM implementation for these macros.

well, that depends. For us on parisc, kmap of a user page in !HIGHMEM
sets up an inequivalent aliase still ... because the cache colour of the
user and kernel virtual addresses are different. Depending on the
return path to userspace, we still usually have to flush to get the user
to see the changes the kernel has made.

James

> So, for highmem-enabled systems:
>
> low_addr = kmap_atomic(lowmem_page);
> high_addr = kmap_atomic(highmem_page);
>
> results in low_addr in the kernel direct-mapped region, and high_addr
> in the kmap_atomic region.



2011-01-10 16:25:56

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote:
> Hi Trond,
>
> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote:
> > -----------------------------------------------------------------------------------
> It would be great if you could add a "8<" in the line above next time.
> Then git-am -c does the right thing (at least I think so).

Sorry. I wasn't aware of that particular idiom. So something like

8<------------------------------------------------------------
From: Trond Myklebust <[email protected]>
Subject: .....


--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-10 17:09:10

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On 01/10/2011 05:25 PM, Trond Myklebust wrote:
> On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote:
>> Hi Trond,
>>
>> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote:
>>> -----------------------------------------------------------------------------------
>> It would be great if you could add a "8<" in the line above next time.
>> Then git-am -c does the right thing (at least I think so).
>
> Sorry. I wasn't aware of that particular idiom. So something like
>
> 8<------------------------------------------------------------
> From: Trond Myklebust <[email protected]>
> Subject: .....

Yes.

From "man git-mailinfo":
"A line that mainly consists of scissors (either ">8" or "8<") and
perforation (dash "-")"

BTW:
Is this patch a candidate for stable?

regards, Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |


Attachments:
signature.asc (262.00 B)
OpenPGP digital signature

2011-01-07 18:53:29

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, 2011-01-06 at 09:55 -0800, Linus Torvalds wrote:
> On Thu, Jan 6, 2011 at 9:47 AM, Trond Myklebust
> <[email protected]> wrote:
> >
> > Why is this line needed? We're not writing through the virtual mapping.
>
> I haven't looked at the sequence of accesses, but you need to be
> _very_ aware that "write-through" is absolutely NOT sufficient for
> cache coherency.
>
> In cache coherency, you have three options:
>
> - true coherency (eg physically indexed/tagged caches)
>
> - exclusion (eg virtual caches, but with an exclusion guarantee that
> guarantees that aliases cannot happen: either by using physical
> tagging or by not allowing cases that could cause virtual aliases)
>
> - write-through AND non-cached reads (ie "no caching at all").
>
> You seem to be forgetting the "no cached reads" part. It's not
> sufficient to flush after a write - you need to make sure that you
> also don't have a cached copy of the alias for the read.
>
> So "We're not writing through the virtual mapping" is NOT a sufficient
> excuse. If you're reading through the virtual mapping, you need to
> make sure that the virtual mapping is flushed _after_ any writes
> through any other mapping and _before_ any reads through the virtual
> one.

I'm aware of that. That part should be taken care of by the call to
invalidate_kernel_vmap_range() which was in both James and my patch.

There is already code in the SUNRPC layer that calls flush_dcache_page()
after writing (although as Russell pointed out earlier, that is
apparently a no-op for non-page cache pages such as these).

> This is why you really really really generally don't want to have
> aliasing. Purely virtual caches are pure crap. Really.

Well, it looks as if NOMMU is giving us problems due to the lack of a
vm_map_ram() (see https://bugzilla.kernel.org/show_bug.cgi?id=26262).

I'd still like to keep the existing code for those architectures that
don't have problems, since that allows us to send 32k READDIR requests
instead of being limited to 4k. For large directories, that is a clear
win.
For the NOMMU case we will just go back to using a single page for
storage (and 4k READDIR requests only). Should I just do the same for
architectures like ARM and PARISC?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-05 23:07:03

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 13:30 -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust
> <[email protected]> wrote:
> >
> > So what should be the preferred way to ensure data gets flushed when
> > you've written directly to a page, and then want to read through the
> > vm_map_ram() virtual range? Should we be adding new semantics to
> > flush_kernel_dcache_page()?
>
> The "preferred way" is actually simple: "don't do that". IOW, if some
> page is accessed through a virtual mapping you've set up, then
> _always_ access it through that virtual mapping.
>
> Now, when that is impossible (and yes, it sometimes is), then you
> should flush after doing all writes. And if you do the write through
> the regular kernel mapping, you should use flush_dcache_page(). And if
> you did it through the virtual mapping, you should use
> "flush_kernel_vmap_range()" or whatever.
>
> NOTE! I really didn't look those up very closely, and if the accesses
> can happen concurrently you are basically screwed, so you do need to
> do locking or something else to guarantee that there is some nice
> sequential order. And maybe I forgot something. Which is why I do
> suggest "don't do that" as a primary approach to the problem if at all
> possible.
>
> Oh, and you may need to flush before reading too (and many writes do
> end up being "read-modify-write" cycles) in case it's possible that
> you have stale data from a previous read that was then invalidated by
> a write to the aliasing address. Even if that write was flushed out,
> the stale read data may exist at the virtual address. I forget what
> all we required - in the end the only sane model is "virtual caches
> suck so bad that anybody who does them should be laughed at for being
> a retard".

Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(),
which takes care of invalidating the cache prior to a virtual address
read.

My question was specifically about the write through the regular kernel
mapping: according to Russell and my reading of the cachetlb.txt
documentation, flush_dcache_page() is only guaranteed to have an effect
on page cache pages.
flush_kernel_dcache_page() (not to be confused with flush_dcache_page)
would appear to be the closest fit according to my reading of the
documentation, however the ARM implementation appears to be a no-op...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-07 19:03:05

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote:
> I'd still like to keep the existing code for those architectures that
> don't have problems, since that allows us to send 32k READDIR requests
> instead of being limited to 4k. For large directories, that is a clear
> win.
> For the NOMMU case we will just go back to using a single page for
> storage (and 4k READDIR requests only). Should I just do the same for
> architectures like ARM and PARISC?

I think you said that readdir reads via the vmalloc mapping of the
group of pages, but XDR writes to the individual pages.

As I understand NFS, you receive a packet, you then have to use XDR
to unpack the data, which you presumably write into the set of
struct page *'s using kmap?

Isn't a solution to have XDR write directly into the vmalloc mapping
rather than using struct page * and kmap?

2011-01-06 00:00:29

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 05, 2011 at 03:28:53PM -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2011 at 3:06 PM, Trond Myklebust
> <[email protected]> wrote:
> >
> > Yes. The fix I sent out was a call to invalidate_kernel_vmap_range(),
> > which takes care of invalidating the cache prior to a virtual address
> > read.
> >
> > My question was specifically about the write through the regular kernel
> > mapping: according to Russell and my reading of the cachetlb.txt
> > documentation, flush_dcache_page() is only guaranteed to have an effect
> > on page cache pages.
>
> I don't think that should ever matter. It's not like the hardware can
> know whether it's a dcache page or not.
>
> And if the sw implementation cares, it's doing something really odd.

>From the hardware perspective you're correct that it doesn't. However,
from the efficient implementation perspective it does matter.

Take for example the read-ahead done on block devices. We don't want to
flush all those pages that were read in when we don't know that they're
ever going to end up in a user mapping. So what's commonly done (as
suggested by DaveM) is that flush_dcache_page() detects that it's a
dcache page, ensures that there's no user mappings, and sets a 'dirty'
flag. This flag is guaranteed to be clear when new, clean, unread
pages enter the page cache.

When the page eventually ends up in a user mapping, that dirty flag is
checked and the necessary cache flushing done at that point.

Note that when there are user mappings, flush_dcache_page() has to flush
those mappings too, otherwise mmap() <-> read()/write() coherency breaks.
I believe this was what flush_dcache_page() was created to resolve.

flush_kernel_dcache_page() was to solve the problem of PIO drivers
writing to dcache pages, so that data written into the kernel mapping
would be visible to subsequent user mappings.

We chose a different overall approach - which had already been adopted by
PPC - where we invert the meaning of this 'dirty' bit to mean that it's
clean. So every new page cache page starts out life as being marked
dirty and so nothing needs to be done at flush_kernel_dcache_page().
We continue to use davem's optimization but with the changed meaning of
the bit, but as we now support SMP we do the flushing at set_pte_at()
time.

This also means that we don't have to rely on the (endlessly) buggy PIO
drivers remembering to add flush_kernel_dcache_page() calls - something
which has been a source of constant never-ending pain for us.

The final piece of the jigsaw is flush_anon_page() which deals with
kernel<->user coherency for anonymous pages by flushing both the user
and kernel sides of the mapping. This was to solve direct-io coherency
problems.

As the users of flush_anon_page() always do:

flush_anon_page(vma, page, addr);
flush_dcache_page(page);

and documentation doesn't appear to imply that this will always be the
case, we restrict flush_dcache_page() to only work on page cache pages,
otherwise we end up flushing the kernel-side mapping multiple time in
succession.

Maybe we should make flush_anon_page() only flush the user mapping,
stipulate that it shall always be followed by flush_dcache_page(),
which shall flush the kernel side mapping even for anonymous pages?
That sounds to me like a recipe for missing flush_dcache_page() calls
causing bugs.

2011-01-08 23:17:06

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Sat, 2011-01-08 at 11:49 -0500, Trond Myklebust wrote:
> On Fri, 2011-01-07 at 13:11 -0600, James Bottomley wrote:
> > On the other hand, the xdr routines, since they take the pages anyway,
> > could use a scatterlist approach to writing through the kernel mapping
> > instead of using vmap ... we have all the machinery for this in
> > lib/scatterlist.c ... it's not designed for this case, since it's
> > designed to allow arbitrary linear reads and writes on a block
> > scatterlist, but the principle is the same ... it looks like it would be
> > rather a big patch, though ...
>
> The following alternative seems to work for me, but has only been
> lightly tested so far. It's a bit large for a stable patch, but not too
> ungainly.
>
> It modifies xdr_stream, adding the ability to iterate through page data.
> To avoid kmap()/kunmap(), it does require that pages be allocated in
> lowmem, but since the only use case here is when using page arrays as
> temporary buffers, that seems like an acceptable compromise.

...and here is an update which makes the whole process transparent to
the decoder. It basically teaches xdr_inline_decode() how to switch
buffers when it hits the end of the current iovec and/or page.

Cheers
Trond
-----------------------------------------------------------------------------------

2011-01-10 17:20:53

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote:
> On 01/10/2011 05:25 PM, Trond Myklebust wrote:
> > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote:
> >> Hi Trond,
> >>
> >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote:
> >>> -----------------------------------------------------------------------------------
> >> It would be great if you could add a "8<" in the line above next time.
> >> Then git-am -c does the right thing (at least I think so).
> >
> > Sorry. I wasn't aware of that particular idiom. So something like
> >
> > 8<------------------------------------------------------------
> > From: Trond Myklebust <[email protected]>
> > Subject: .....
>
> Yes.
>
> From "man git-mailinfo":
> "A line that mainly consists of scissors (either ">8" or "8<") and
> perforation (dash "-")"
>
> BTW:
> Is this patch a candidate for stable?

Yes. I'm planning on sending it to the stable list after Linus merges it
into mainline.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-06 17:51:39

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, 2011-01-06 at 12:47 -0500, Trond Myklebust wrote:
> On Thu, 2011-01-06 at 11:40 -0600, James Bottomley wrote:
> > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote:
> > > Can you explain how the code works? it looks to me like you read the xdr
> > > stuff through the vmap region then write it out directly to the pages?
> >
> > OK, I think I see how this is supposed to work: It's a sequential loop
> > of reading in via the pages (i.e. through the kernel mapping) and then
> > updating those pages via the vmap. In which case, I think this patch is
> > what you need.
> >
> > The theory of operation is that the readdir on pages actually uses the
> > network DMA operations to perform, so when it's finished, the underlying
> > page is up to date. After this you invalidate the vmap range, so we
> > have no cache lines above it (so it picks up the values from the
> > uptodate page). Finally, after the operation on the vmap region has
> > finished, you flush it so that any updated contents go back to the pages
> > themselves before the next iteration begins.
> >
> > Does this look right to people? I've verified it fixes the issues on
> > parisc.
> >
> > James
> >
> > ---
> >
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 996dd89..bde1911 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
> > if (status < 0)
> > break;
> > pglen = status;
> > +
> > + invalidate_kernel_vmap_range(pages_ptr, pglen);
> > +
> > status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen);
> > if (status < 0) {
> > if (status == -ENOSPC)
> > status = 0;
> > break;
> > }
> > + flush_kernel_vmap_range(pages_ptr, pglen);
>
> Why is this line needed? We're not writing through the virtual mapping.

If you're not altering it, it isn't ... the problem on parisc is that
invalidate is a nop for us because flush does it all, but I can fix
that.

James

> We checked using just the invalidate_kernel_vmap_range(), and that
> appeared to suffice to fix the problem on ARM.
>
> Cheers
> Trond



2011-01-05 19:36:15

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley
> <[email protected]> wrote:
> >
> > I think the solution for the kernel direct mapping problem is to take
> > the expected flushes and invalidates into kmap/kunmap[_atomic].
>
> No, we really can't do that. Most of the time, the kmap() is the only
> way we access the page anyway, so flushing things would just be
> stupid. Why waste time and energy on doing something pointless?

It's hardly pointless. The kmap sets up an inequivalent alias in the
cache. When you write to the kmap region, you dirty the CPU caches for
that alias. If you tear down the mapping without flushing, the CPU will
write out the cache lines at its leisure. If you access the line via
the other mapping *before* the CPU does writeout, you see stale data.

When the kernel dirties a kmap region, it always has to flush somehow
before kunmap. One of the problems here is that that flush isn't in the
NFS code.

> In fact, kmap() here is a total non-issue. It's not the kmap() that
> introduces any virtual aliases, and never has been. It's the
> "vm_map_ram()" that is the problem. Unlike the kmap(), that really
> _does_ introduce a virtual alias, and is a problem for any virtual
> cache.
>
> So don't blame kmap(). It's innocent and irrelevant - the bug could
> happen entirely without it (think a 64-bit address space that doesn't
> even _have_ kmap, but has software that mixes vm_map_ram() with
> non-mapped accesses).

I didn't say it was kmap's entire problem ... I just said, can't we
simplify some of this by consolidating the flushing into the interfaces.

James



2011-01-05 21:16:14

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 12:48 -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2011 at 12:33 PM, James Bottomley
> <[email protected]> wrote:
> >
> > well, that depends. For us on parisc, kmap of a user page in !HIGHMEM
> > sets up an inequivalent aliase still ... because the cache colour of the
> > user and kernel virtual addresses are different. Depending on the
> > return path to userspace, we still usually have to flush to get the user
> > to see the changes the kernel has made.
>
> Umm. Again, that has nothing to do with kmap().
>
> This time it's about the user space mapping.
>
> Repeat after me: even without the kmap(), the kernel access to that
> mapping would have caused cache aliases.
>
> See? Once more, the kmap() is entirely innocent. You can have a
> non-highmem mapping that you never use kmap for, and that you map into
> user space, and you'd see exactly the same aliases. Notice? Look ma,
> no kmap().

Yes, I understand that (we have no highmem on parisc, so kmap is a nop).
The problem (at least as I see it) is that once something within the
kernel (well, OK, mostly within drivers) touches a user page via its
kernel mapping, the flush often gets forgotten (mainly because it always
works on x86). What I was thinking about is that every time the kernel
touches a user space page, it has to be within a kmap/kunmap pair
(because the page might be highmem) ... so it's possible to make
kmap/kunmap do the flushing for this case so the driver writer can't
ever forget it.

I think the problem case is only really touching scatter/gather elements
outside of the DMA API (i.e. the driver pio case), so this may be
overkill. Russell also pointed out that a lot of the PIO iterators do
excessive kmap_atomic/kunmap_atomic on the same page, so adding a flush
could damage performance to the point where the flash root devices on
arm might not work. Plus the pio iterators already contain the
appropriate flush, so perhaps just using them in every case fixes the
problem.

> So clearly kmap() is not the issue. The issue continues to be a
> totally separate virtual mapping. Whether it's a user mapping or
> vm_map_ram() is obviously immaterial - as far as the CPU is concerned,
> there is no difference between the two (apart from the trivial
> differences of virtual location and permissions).
>
> (You can also force the problem with vmalloc() an then following the
> kernel page tables, but I hope nobody does that any more. I suspect
> I'm wrong, though, there's probably code that mixes vmalloc and
> physical page accesses in various drivers)

Yes, unfortunately, we have seen this quite a bit; mainly to get large
buffers. Its not just confined to drivers: xfs used to fail on both
arm and parisc because it used a vmalloc region for its log buffer which
it then had to do I/O on.

James



2011-01-06 18:25:45

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, 2011-01-06 at 12:14 -0600, James Bottomley wrote:
> On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote:
> > What network DMA operations - what if your NIC doesn't do DMA because
> > it's an SMSC device?
>
> So this is the danger area ... we might be caught by our own flushing
> tricks. I can't test this on parisc since all my network drivers use
> DMA (which automatically coheres the kernel mapping by
> flush/invalidate).
>
> What should happen is that the kernel mapping pages go through the
> ->readdir() path. Any return from this has to be ready to map the pages
> back to user space, so the kernel alias has to be flushed to make the
> underlying page up to date.
>
> The exception is pages we haven't yet mapped to userspace. Here we set
> the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we
> expect the addition of a userspace mapping will detect this case and do
> the flush and clear the bit before the mapping goes live. I assume
> you're thinking that because this page is allocated and freed internally
> to NFS, it never gets a userspace mapping and therefore, we can return
> from ->readdir() with a dirty kernel cache (and the corresponding flag
> set)? I think that is a possible hypothesis in certain cases.

OK, so thinking about this, it seems that the only danger is actually
what NFS is doing: reading cache pages via a vmap. In that case, since
the requirement is to invalidate the vmap range to prepare for read, we
could have invalidate_kernel_vmap_range loop over the underlying pages
and flush them through the kernel alias if the architecture specific
flag indicates their contents might be dirty.

The loop adds expense that is probably largely unnecessary to
invalidate_kernel_vmap_range() but the alternative is adding to the API
proliferation with something that only flushes the kernel pages if the
arch specific flag says they're dirty.

James



2011-01-10 19:31:55

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Mon, 2011-01-10 at 14:29 -0500, Trond Myklebust wrote:
> I usually do this, but there is a slight problem with that approach:
> Greg gets to do all the work of figuring out to which stable kernels
> this particular patch applies. In this case, since we're only talking
> about the 2.6.37 kernel, I prefer to use the mailing lists.

So the non-standard, but accepted way of doing this is

Cc: Stable Tree <[email protected]> [2.6.37]

James



2011-01-06 18:14:35

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, 2011-01-06 at 18:05 +0000, Russell King - ARM Linux wrote:
> On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote:
> > On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote:
> > > Can you explain how the code works? it looks to me like you read the xdr
> > > stuff through the vmap region then write it out directly to the pages?
> >
> > OK, I think I see how this is supposed to work: It's a sequential loop
> > of reading in via the pages (i.e. through the kernel mapping) and then
> > updating those pages via the vmap. In which case, I think this patch is
> > what you need.
> >
> > The theory of operation is that the readdir on pages actually uses the
> > network DMA operations to perform, so when it's finished, the underlying
>
> What network DMA operations - what if your NIC doesn't do DMA because
> it's an SMSC device?

So this is the danger area ... we might be caught by our own flushing
tricks. I can't test this on parisc since all my network drivers use
DMA (which automatically coheres the kernel mapping by
flush/invalidate).

What should happen is that the kernel mapping pages go through the
->readdir() path. Any return from this has to be ready to map the pages
back to user space, so the kernel alias has to be flushed to make the
underlying page up to date.

The exception is pages we haven't yet mapped to userspace. Here we set
the PG_dcache_dirty bit (sparc trick) but don't flush the page, since we
expect the addition of a userspace mapping will detect this case and do
the flush and clear the bit before the mapping goes live. I assume
you're thinking that because this page is allocated and freed internally
to NFS, it never gets a userspace mapping and therefore, we can return
from ->readdir() with a dirty kernel cache (and the corresponding flag
set)? I think that is a possible hypothesis in certain cases.

James



2011-01-05 21:09:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux
<[email protected]> wrote:
> On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote:
>> (You can also force the problem with vmalloc() an then following the
>> kernel page tables, but I hope nobody does that any more. I suspect
>> I'm wrong, though, there's probably code that mixes vmalloc and
>> physical page accesses in various drivers)
>
> Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be
> deprecated then? ;)

I do think that the "modern" way of doing it is
"vmap()"/"vm_map_ram()" and friends, and it should be preferred over
using vmalloc() and then looking up the pages.

But in the end, the two approaches really are equivalent, so it's not
like it really matters. So I don't think we need to deprecate things
officially, but obviously we should make people more aware of the
whole virtual alias thing that crops up whenever you use any of these
approaches.

Linus

2011-01-10 19:35:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

2011/1/10 Trond Myklebust <[email protected]>:
>
> I usually do this, but there is a slight problem with that approach:
> Greg gets to do all the work of figuring out to which stable kernels
> this particular patch applies. In this case, since we're only talking
> about the 2.6.37 kernel, I prefer to use the mailing lists.

Just do

Cc: [email protected] [2.6.37]

or similar. It's quite common.

So EVEN IF you want to email people around about the patch separately,
do add the "Cc: stable" marker. It's worthwhile information about the
patch for everybody involved.

Linus

2011-01-05 21:31:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 5, 2011 at 1:16 PM, Trond Myklebust
<[email protected]> wrote:
>
> So what should be the preferred way to ensure data gets flushed when
> you've written directly to a page, and then want to read through the
> vm_map_ram() virtual range? Should we be adding new semantics to
> flush_kernel_dcache_page()?

The "preferred way" is actually simple: "don't do that". IOW, if some
page is accessed through a virtual mapping you've set up, then
_always_ access it through that virtual mapping.

Now, when that is impossible (and yes, it sometimes is), then you
should flush after doing all writes. And if you do the write through
the regular kernel mapping, you should use flush_dcache_page(). And if
you did it through the virtual mapping, you should use
"flush_kernel_vmap_range()" or whatever.

NOTE! I really didn't look those up very closely, and if the accesses
can happen concurrently you are basically screwed, so you do need to
do locking or something else to guarantee that there is some nice
sequential order. And maybe I forgot something. Which is why I do
suggest "don't do that" as a primary approach to the problem if at all
possible.

Oh, and you may need to flush before reading too (and many writes do
end up being "read-modify-write" cycles) in case it's possible that
you have stale data from a previous read that was then invalidated by
a write to the aliasing address. Even if that write was flushed out,
the stale read data may exist at the virtual address. I forget what
all we required - in the end the only sane model is "virtual caches
suck so bad that anybody who does them should be laughed at for being
a retard".

Linus

2011-01-06 18:05:47

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, Jan 06, 2011 at 11:40:13AM -0600, James Bottomley wrote:
> On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote:
> > Can you explain how the code works? it looks to me like you read the xdr
> > stuff through the vmap region then write it out directly to the pages?
>
> OK, I think I see how this is supposed to work: It's a sequential loop
> of reading in via the pages (i.e. through the kernel mapping) and then
> updating those pages via the vmap. In which case, I think this patch is
> what you need.
>
> The theory of operation is that the readdir on pages actually uses the
> network DMA operations to perform, so when it's finished, the underlying

What network DMA operations - what if your NIC doesn't do DMA because
it's an SMSC device?

2011-01-10 19:29:32

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Mon, 2011-01-10 at 20:25 +0100, Uwe Kleine-König wrote:
> Hello Trond,
>
> On Mon, Jan 10, 2011 at 12:20:35PM -0500, Trond Myklebust wrote:
> > On Mon, 2011-01-10 at 18:08 +0100, Marc Kleine-Budde wrote:
> > > On 01/10/2011 05:25 PM, Trond Myklebust wrote:
> > > > On Mon, 2011-01-10 at 11:50 +0100, Uwe Kleine-König wrote:
> > > >> Hi Trond,
> > > >>
> > > >> On Sat, Jan 08, 2011 at 06:15:51PM -0500, Trond Myklebust wrote:
> > > >>> -----------------------------------------------------------------------------------
> > > >> It would be great if you could add a "8<" in the line above next time.
> > > >> Then git-am -c does the right thing (at least I think so).
> > > >
> > > > Sorry. I wasn't aware of that particular idiom. So something like
> > > >
> > > > 8<------------------------------------------------------------
> > > > From: Trond Myklebust <[email protected]>
> > > > Subject: .....
> > >
> > > Yes.
> > >
> > > From "man git-mailinfo":
> > > "A line that mainly consists of scissors (either ">8" or "8<") and
> > > perforation (dash "-")"
> > >
> > > BTW:
> > > Is this patch a candidate for stable?
> >
> > Yes. I'm planning on sending it to the stable list after Linus merges it
> > into mainline.
> So there is another idiom for you: just put
>
> Cc: [email protected]
>
> in the S-o-b block and Greg will pick it off "automatically". (Just in
> case you don't know, and if you do, maybe someone else learned
> something.)

I usually do this, but there is a slight problem with that approach:
Greg gets to do all the work of figuring out to which stable kernels
this particular patch applies. In this case, since we're only talking
about the 2.6.37 kernel, I prefer to use the mailing lists.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-05 21:05:34

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote:
> (You can also force the problem with vmalloc() an then following the
> kernel page tables, but I hope nobody does that any more. I suspect
> I'm wrong, though, there's probably code that mixes vmalloc and
> physical page accesses in various drivers)

Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be
deprecated then? ;)

However, we seem have new ways of doing this - rather than asking
vmalloc() for some memory, and then getting at the pages by following
the page tables, we now have ways to create mappings using arrays of
struct pages and access them via their already known mappings:

- vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot)
- map_kernel_range_noflush(unsigned long addr, unsigned long size, pgprot_t prot, struct page **pages)
- map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
- vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot)

So really it's the same problem, just created by some other easier
to use methods.

2011-01-05 19:50:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

2011/1/5 James Bottomley <[email protected]>:
>>
>> No, we really can't do that. Most of the time, the kmap() is the only
>> way we access the page anyway, so flushing things would just be
>> stupid. Why waste time and energy on doing something pointless?
>
> It's hardly pointless. ?The kmap sets up an inequivalent alias in the
> cache.

NO IT DOES NOT.

Stop arguing, when you are so wrong.

kmap() does not create any aliases. For low-memory, it just returns
the physical address. No alias. And for high memory, there is no
equivalent low memory address to alias _with_.

Now, when you actually mix multiple kmap's and you have a virtually
based cache, then the kmap's obviously need to flush that particular
page when switching between each other. But that has nothing to do
with the actual page being kmap'ed, it's entirely an internal issue
about the particular virtual memory area being re-used. And ARM (and
any other virtually based CPU) already does that in __kunmap_atomic().

But notice the case of the low-mem. And understand that you are WRONG
about the "inequivalent alias" thing.

So I repeat: this has absolutely *NOTHING* to do with kmap(). Stop blathering.

It's _purely_ an issue of vm_map_ram(). Nothing else.

Linus

2011-01-05 19:18:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley
<[email protected]> wrote:
>
> I think the solution for the kernel direct mapping problem is to take
> the expected flushes and invalidates into kmap/kunmap[_atomic].

No, we really can't do that. Most of the time, the kmap() is the only
way we access the page anyway, so flushing things would just be
stupid. Why waste time and energy on doing something pointless?

In fact, kmap() here is a total non-issue. It's not the kmap() that
introduces any virtual aliases, and never has been. It's the
"vm_map_ram()" that is the problem. Unlike the kmap(), that really
_does_ introduce a virtual alias, and is a problem for any virtual
cache.

So don't blame kmap(). It's innocent and irrelevant - the bug could
happen entirely without it (think a 64-bit address space that doesn't
even _have_ kmap, but has software that mixes vm_map_ram() with
non-mapped accesses).

Linus

2011-01-05 21:16:59

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 13:08 -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2011 at 1:04 PM, Russell King - ARM Linux
> <[email protected]> wrote:
> > On Wed, Jan 05, 2011 at 12:48:32PM -0800, Linus Torvalds wrote:
> >> (You can also force the problem with vmalloc() an then following the
> >> kernel page tables, but I hope nobody does that any more. I suspect
> >> I'm wrong, though, there's probably code that mixes vmalloc and
> >> physical page accesses in various drivers)
> >
> > Should vmalloc_to_page() (84 users)/vmalloc_to_pfn() (17 users) be
> > deprecated then? ;)
>
> I do think that the "modern" way of doing it is
> "vmap()"/"vm_map_ram()" and friends, and it should be preferred over
> using vmalloc() and then looking up the pages.
>
> But in the end, the two approaches really are equivalent, so it's not
> like it really matters. So I don't think we need to deprecate things
> officially, but obviously we should make people more aware of the
> whole virtual alias thing that crops up whenever you use any of these
> approaches.

So what should be the preferred way to ensure data gets flushed when
you've written directly to a page, and then want to read through the
vm_map_ram() virtual range? Should we be adding new semantics to
flush_kernel_dcache_page()?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-06 20:36:06

by John Stoffel

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

>>>>> "James" == James Bottomley <[email protected]> writes:

James> On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote:
>> Can you explain how the code works? it looks to me like you read the xdr
>> stuff through the vmap region then write it out directly to the pages?

James> OK, I think I see how this is supposed to work: It's a
James> sequential loop of reading in via the pages (i.e. through the
James> kernel mapping) and then updating those pages via the vmap. In
James> which case, I think this patch is what you need.

James> The theory of operation is that the readdir on pages actually
James> uses the network DMA operations to perform, so when it's
James> finished, the underlying page is up to date. After this you
James> invalidate the vmap range, so we have no cache lines above it
James> (so it picks up the values from the uptodate page). Finally,
James> after the operation on the vmap region has finished, you flush
James> it so that any updated contents go back to the pages themselves
James> before the next iteration begins.

You need to re-spin this patch to include the above description into
the magic steps your taking here, or at least document it more clearly
somewhere why you need to make these funky steps.

John


James> James

James> ---

James> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
James> index 996dd89..bde1911 100644
James> --- a/fs/nfs/dir.c
James> +++ b/fs/nfs/dir.c
James> @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
James> if (status < 0)
James> break;
James> pglen = status;
James> +
James> + invalidate_kernel_vmap_range(pages_ptr, pglen);
James> +
James> status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen);
James> if (status < 0) {
James> if (status == -ENOSPC)
James> status = 0;
James> break;
James> }
James> + flush_kernel_vmap_range(pages_ptr, pglen);
James> } while (array->eof_index < 0);

James> nfs_readdir_free_large_page(pages_ptr, pages, array_size);


James> --
James> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
James> the body of a message to [email protected]
James> More majordomo info at http://vger.kernel.org/majordomo-info.html
James> Please read the FAQ at http://www.tux.org/lkml/

--

2011-01-07 19:11:09

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Fri, 2011-01-07 at 19:02 +0000, Russell King - ARM Linux wrote:
> On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote:
> > I'd still like to keep the existing code for those architectures that
> > don't have problems, since that allows us to send 32k READDIR requests
> > instead of being limited to 4k. For large directories, that is a clear
> > win.
> > For the NOMMU case we will just go back to using a single page for
> > storage (and 4k READDIR requests only). Should I just do the same for
> > architectures like ARM and PARISC?
>
> I think you said that readdir reads via the vmalloc mapping of the
> group of pages, but XDR writes to the individual pages.

Actually it's the other way around, but the point still stands.

> As I understand NFS, you receive a packet, you then have to use XDR
> to unpack the data, which you presumably write into the set of
> struct page *'s using kmap?
>
> Isn't a solution to have XDR write directly into the vmalloc mapping
> rather than using struct page * and kmap?

So, unfortuantely, I looked at doing this and we can't. the ->readdir()
call takes an array of pages, not a kernel virtual address of the pages,
so there's no way to tell it to use a different mapping from the usual
kernel one on them.

On the other hand, the xdr routines, since they take the pages anyway,
could use a scatterlist approach to writing through the kernel mapping
instead of using vmap ... we have all the machinery for this in
lib/scatterlist.c ... it's not designed for this case, since it's
designed to allow arbitrary linear reads and writes on a block
scatterlist, but the principle is the same ... it looks like it would be
rather a big patch, though ...

James



2011-01-05 20:01:12

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 05, 2011 at 01:36:09PM -0600, James Bottomley wrote:
> On Wed, 2011-01-05 at 11:18 -0800, Linus Torvalds wrote:
> > On Wed, Jan 5, 2011 at 11:05 AM, James Bottomley
> > <[email protected]> wrote:
> > >
> > > I think the solution for the kernel direct mapping problem is to take
> > > the expected flushes and invalidates into kmap/kunmap[_atomic].
> >
> > No, we really can't do that. Most of the time, the kmap() is the only
> > way we access the page anyway, so flushing things would just be
> > stupid. Why waste time and energy on doing something pointless?
>
> It's hardly pointless. The kmap sets up an inequivalent alias in the
> cache.

No it doesn't. For pages which are inaccessible, it sets up a mapping
for those pages. On aliasing cache architectures, when you tear down
such a mapping, you have to flush the cache before you do so (otherwise
you can end up with cache lines existing in the cache for inaccessible
mappings.)

For lowmem pages, kmap() (should always) bypass the 'setup mapping' stage
because all lowmem pages are already accessible. So kunmap() doesn't
do anything - just like the !HIGHMEM implementation for these macros.

So, for highmem-enabled systems:

low_addr = kmap_atomic(lowmem_page);
high_addr = kmap_atomic(highmem_page);

results in low_addr in the kernel direct-mapped region, and high_addr
in the kmap_atomic region.

2011-01-06 17:40:19

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote:
> Can you explain how the code works? it looks to me like you read the xdr
> stuff through the vmap region then write it out directly to the pages?

OK, I think I see how this is supposed to work: It's a sequential loop
of reading in via the pages (i.e. through the kernel mapping) and then
updating those pages via the vmap. In which case, I think this patch is
what you need.

The theory of operation is that the readdir on pages actually uses the
network DMA operations to perform, so when it's finished, the underlying
page is up to date. After this you invalidate the vmap range, so we
have no cache lines above it (so it picks up the values from the
uptodate page). Finally, after the operation on the vmap region has
finished, you flush it so that any updated contents go back to the pages
themselves before the next iteration begins.

Does this look right to people? I've verified it fixes the issues on
parisc.

James

---

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 996dd89..bde1911 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
if (status < 0)
break;
pglen = status;
+
+ invalidate_kernel_vmap_range(pages_ptr, pglen);
+
status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen);
if (status < 0) {
if (status == -ENOSPC)
status = 0;
break;
}
+ flush_kernel_vmap_range(pages_ptr, pglen);
} while (array->eof_index < 0);

nfs_readdir_free_large_page(pages_ptr, pages, array_size);



2011-01-07 19:14:05

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Fri, 2011-01-07 at 19:02 +0000, Russell King - ARM Linux wrote:
> On Fri, Jan 07, 2011 at 01:53:25PM -0500, Trond Myklebust wrote:
> > I'd still like to keep the existing code for those architectures that
> > don't have problems, since that allows us to send 32k READDIR requests
> > instead of being limited to 4k. For large directories, that is a clear
> > win.
> > For the NOMMU case we will just go back to using a single page for
> > storage (and 4k READDIR requests only). Should I just do the same for
> > architectures like ARM and PARISC?
>
> I think you said that readdir reads via the vmalloc mapping of the
> group of pages, but XDR writes to the individual pages.
>
> As I understand NFS, you receive a packet, you then have to use XDR
> to unpack the data, which you presumably write into the set of
> struct page *'s using kmap?

No. The socket or RDMA layers place the data directly into the struct
pages. We then unpack them in the XDR layer using the vmalloc mapping
and place the resulting processed readdir data into the page cache.

> Isn't a solution to have XDR write directly into the vmalloc mapping
> rather than using struct page * and kmap?

Unfortunately that isn't possible. :-(

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-01-10 12:44:41

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On 01/09/2011 12:15 AM, Trond Myklebust wrote:
> From 8b2e60cef5c65eef41ab61286f62dec6bfb1ac27 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Sat, 8 Jan 2011 17:45:38 -0500
> Subject: [PATCH] NFS: Don't use vm_map_ram() in readdir
>
> vm_map_ram() is not available on NOMMU platforms, and causes trouble
> on incoherrent architectures such as ARM when we access the page data
> through both the direct and the virtual mapping.
>
> The alternative is to use the direct mapping to access page data
> for the case when we are not crossing a page boundary, but to copy
> the data into a linear scratch buffer when we are accessing data
> that spans page boundaries.
>
> Signed-off-by: Trond Myklebust <[email protected]>
Tested-by: Marc Kleine-Budde <[email protected]>

..on AT91 (armv5)

regards, Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |


Attachments:
signature.asc (262.00 B)
OpenPGP digital signature

2011-01-05 20:35:21

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, 2011-01-05 at 11:49 -0800, Linus Torvalds wrote:
> 2011/1/5 James Bottomley <[email protected]>:
> >>
> >> No, we really can't do that. Most of the time, the kmap() is the only
> >> way we access the page anyway, so flushing things would just be
> >> stupid. Why waste time and energy on doing something pointless?
> >
> > It's hardly pointless. The kmap sets up an inequivalent alias in the
> > cache.
>
> NO IT DOES NOT.

Well, it does ... but not in this case because the page is freshly
allocated (which I missed before) so it has no use cache colour yet.

James

> Stop arguing, when you are so wrong.
>
> kmap() does not create any aliases. For low-memory, it just returns
> the physical address. No alias. And for high memory, there is no
> equivalent low memory address to alias _with_.
>
> Now, when you actually mix multiple kmap's and you have a virtually
> based cache, then the kmap's obviously need to flush that particular
> page when switching between each other. But that has nothing to do
> with the actual page being kmap'ed, it's entirely an internal issue
> about the particular virtual memory area being re-used. And ARM (and
> any other virtually based CPU) already does that in __kunmap_atomic().
>
> But notice the case of the low-mem. And understand that you are WRONG
> about the "inequivalent alias" thing.
>
> So I repeat: this has absolutely *NOTHING* to do with kmap(). Stop blathering.
>
> It's _purely_ an issue of vm_map_ram(). Nothing else.
>
> Linus



2011-01-07 19:05:04

by James Bottomley

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Fri, 2011-01-07 at 13:53 -0500, Trond Myklebust wrote:
> There is already code in the SUNRPC layer that calls flush_dcache_page()
> after writing (although as Russell pointed out earlier, that is
> apparently a no-op for non-page cache pages such as these).

Actually (and possibly fortunately) none of our flush_dcache_page()
implementations do this (check for an actual non page cache page and nop
if they find one). Although, they may according to the docs which say
that flush_dcache_page() is only called on page cache pages.

But it's definitely using the API outside its documented scope. We have
lots of places in the VFS where we don't call flush_dcache_page() even
after altering a kernel page (even in the page cache) if we know the
page will never be mapped to userspace. The assumption here is that the
kernel never sets up non-user aliases of these pages, so not doing the
flushing is an optimisation since we only access them through the kernel
address space. Of course, setting up vmap areas of these pages within
the kernel violates this assumption.

> > This is why you really really really generally don't want to have
> > aliasing. Purely virtual caches are pure crap. Really.
>
> Well, it looks as if NOMMU is giving us problems due to the lack of a
> vm_map_ram() (see https://bugzilla.kernel.org/show_bug.cgi?id=26262).
>
> I'd still like to keep the existing code for those architectures that
> don't have problems, since that allows us to send 32k READDIR requests
> instead of being limited to 4k. For large directories, that is a clear
> win.
> For the NOMMU case we will just go back to using a single page for
> storage (and 4k READDIR requests only). Should I just do the same for
> architectures like ARM and PARISC?

Well, that would include any VI architecture (like SPARC and others) as
well. However, I think we can just make the
invalidate_kernel_vmap_range() work.

James



2011-01-05 20:49:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Wed, Jan 5, 2011 at 12:33 PM, James Bottomley
<[email protected]> wrote:
>
> well, that depends. ?For us on parisc, kmap of a user page in !HIGHMEM
> sets up an inequivalent aliase still ... because the cache colour of the
> user and kernel virtual addresses are different. ?Depending on the
> return path to userspace, we still usually have to flush to get the user
> to see the changes the kernel has made.

Umm. Again, that has nothing to do with kmap().

This time it's about the user space mapping.

Repeat after me: even without the kmap(), the kernel access to that
mapping would have caused cache aliases.

See? Once more, the kmap() is entirely innocent. You can have a
non-highmem mapping that you never use kmap for, and that you map into
user space, and you'd see exactly the same aliases. Notice? Look ma,
no kmap().

So clearly kmap() is not the issue. The issue continues to be a
totally separate virtual mapping. Whether it's a user mapping or
vm_map_ram() is obviously immaterial - as far as the CPU is concerned,
there is no difference between the two (apart from the trivial
differences of virtual location and permissions).

(You can also force the problem with vmalloc() an then following the
kernel page tables, but I hope nobody does that any more. I suspect
I'm wrong, though, there's probably code that mixes vmalloc and
physical page accesses in various drivers)

Linus

2011-01-06 17:47:54

by Myklebust, Trond

[permalink] [raw]
Subject: Re: still nfs problems [Was: Linux 2.6.37-rc8]

On Thu, 2011-01-06 at 11:40 -0600, James Bottomley wrote:
> On Wed, 2011-01-05 at 23:28 +0000, James Bottomley wrote:
> > Can you explain how the code works? it looks to me like you read the xdr
> > stuff through the vmap region then write it out directly to the pages?
>
> OK, I think I see how this is supposed to work: It's a sequential loop
> of reading in via the pages (i.e. through the kernel mapping) and then
> updating those pages via the vmap. In which case, I think this patch is
> what you need.
>
> The theory of operation is that the readdir on pages actually uses the
> network DMA operations to perform, so when it's finished, the underlying
> page is up to date. After this you invalidate the vmap range, so we
> have no cache lines above it (so it picks up the values from the
> uptodate page). Finally, after the operation on the vmap region has
> finished, you flush it so that any updated contents go back to the pages
> themselves before the next iteration begins.
>
> Does this look right to people? I've verified it fixes the issues on
> parisc.
>
> James
>
> ---
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 996dd89..bde1911 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -587,12 +587,16 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
> if (status < 0)
> break;
> pglen = status;
> +
> + invalidate_kernel_vmap_range(pages_ptr, pglen);
> +
> status = nfs_readdir_page_filler(desc, &entry, pages_ptr, page, pglen);
> if (status < 0) {
> if (status == -ENOSPC)
> status = 0;
> break;
> }
> + flush_kernel_vmap_range(pages_ptr, pglen);

Why is this line needed? We're not writing through the virtual mapping.

We checked using just the invalidate_kernel_vmap_range(), and that
appeared to suffice to fix the problem on ARM.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com