2007-10-12 16:58:58

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Interaction between Xen and XFS: stray RW mappings

Hi Dave & other XFS folk,

I'm tracking down a bug which appears to be a bad interaction between
XFS and Xen. It looks like XFS is holding RW mappings on free pages,
which Xen is trying to get an exclusive RO mapping on so it can turn
them into pagetables.

I'm assuming the pages are actually free, and this isn't a use after
free problem. From a quick poke around, the most likely pieces of XFS
code is the stuff in xfs_map.c which creates a virtually contiguous
mapping of pages with vmap, and seems to delay unmapping them.

When pinning a pagetable, Xen tries to eliminate any RW aliases of the
pages its using. This is generally trivial because pages returned by
get_free_page don't have any mappings aside from the normal kernel
mapping. High pages, when using CONFIG_HIGHPTE, may have a residual
kmap mapping, so we clear out the kmap cache if we encounter a highpage
in the pagetable.

I guess we could create a special-case interface to do the same thing
with XFS mappings, but it would be nicer to have something more generic.

Is my analysis correct? Or should XFS not be holding stray mappings?
Or is there already some kind of generic mechanism I can use to get it
to release its mappings?

Thanks,
J


2007-10-12 17:09:52

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

Jeremy Fitzhardinge wrote:
> I guess we could create a special-case interface to do the same thing
> with XFS mappings, but it would be nicer to have something more generic.
>
> Is my analysis correct? Or should XFS not be holding stray mappings?
> Or is there already some kind of generic mechanism I can use to get it
> to release its mappings?

This test patch confirms my theory:

diff -r 36a518c1fb4b fs/xfs/linux-2.6/xfs_buf.c
--- a/fs/xfs/linux-2.6/xfs_buf.c Fri Oct 12 10:03:56 2007 -0700
+++ b/fs/xfs/linux-2.6/xfs_buf.c Fri Oct 12 10:07:03 2007 -0700
@@ -186,6 +186,11 @@ free_address(
void *addr)
{
a_list_t *aentry;
+
+#ifdef CONFIG_XEN
+ vunmap(addr);
+ return;
+#endif

aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {


With this in place, the problem goes away.

J

2007-10-14 22:56:56

by David Chinner

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Fri, Oct 12, 2007 at 09:58:43AM -0700, Jeremy Fitzhardinge wrote:
> Hi Dave & other XFS folk,
>
> I'm tracking down a bug which appears to be a bad interaction between XFS
> and Xen. It looks like XFS is holding RW mappings on free pages, which Xen
> is trying to get an exclusive RO mapping on so it can turn them into
> pagetables.
>
> I'm assuming the pages are actually free, and this isn't a use after free
> problem. From a quick poke around, the most likely pieces of XFS code is
> the stuff in xfs_map.c which creates a virtually contiguous mapping of pages
> with vmap, and seems to delay unmapping them.

You mean xfs_buf.c.

And yes, we delay unmapping pages until we have a batch of them
to unmap. vmap and vunmap do not scale, so this is batching helps
alleviate some of the worst of the problems.

> When pinning a pagetable, Xen tries to eliminate any RW aliases of the pages
> its using. This is generally trivial because pages returned by
> get_free_page don't have any mappings aside from the normal kernel mapping.
> High pages, when using CONFIG_HIGHPTE, may have a residual kmap mapping, so
> we clear out the kmap cache if we encounter a highpage in the pagetable.
>
> I guess we could create a special-case interface to do the same thing with
> XFS mappings, but it would be nicer to have something more generic.

*nod*

> Is my analysis correct? Or should XFS not be holding stray mappings? Or is
> there already some kind of generic mechanism I can use to get it to release
> its mappings?

The xfsbufd cleans out any stale mappings - it's woken by the memory
shrinker interface (i.e. calls xfsbufd_wakeup()). Otherwise every
64th buffer being vmap()d will flush out stale mappings.

Realistically, if this delayed release of vmaps is a problem for
Xen, then I think that some generic VM solution is needed to this
problem as vmap() is likely to become more common in future (think
large blocks in filesystems). Nick - any comments?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-14 23:12:30

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

David Chinner wrote:
> You mean xfs_buf.c.
>

Yes, sorry.

> And yes, we delay unmapping pages until we have a batch of them
> to unmap. vmap and vunmap do not scale, so this is batching helps
> alleviate some of the worst of the problems.
>

How much performance does it cost? What kind of workloads would it show
up under?

> Realistically, if this delayed release of vmaps is a problem for
> Xen, then I think that some generic VM solution is needed to this
> problem as vmap() is likely to become more common in future (think
> large blocks in filesystems). Nick - any comments?
>

Well, the only real problem is that the pages are returned to the free
pool and reallocated while still being part of a mapping. If the pages
are still owned by the filesystem/pagecache, then there's no problem.

What's the lifetime of things being vmapped/unmapped in xfs? Are they
necessarily being freed when they're unmapped, or could unmapping of
freed memory be more immediate than other memory?

Maybe it just needs a notifier chain or something.

J

2007-10-14 23:24:22

by Nick Piggin

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Monday 15 October 2007 09:12, Jeremy Fitzhardinge wrote:
> David Chinner wrote:
> > You mean xfs_buf.c.
>
> Yes, sorry.
>
> > And yes, we delay unmapping pages until we have a batch of them
> > to unmap. vmap and vunmap do not scale, so this is batching helps
> > alleviate some of the worst of the problems.
>
> How much performance does it cost? What kind of workloads would it show
> up under?
>
> > Realistically, if this delayed release of vmaps is a problem for
> > Xen, then I think that some generic VM solution is needed to this
> > problem as vmap() is likely to become more common in future (think
> > large blocks in filesystems). Nick - any comments?
>
> Well, the only real problem is that the pages are returned to the free
> pool and reallocated while still being part of a mapping. If the pages
> are still owned by the filesystem/pagecache, then there's no problem.
>
> What's the lifetime of things being vmapped/unmapped in xfs? Are they
> necessarily being freed when they're unmapped, or could unmapping of
> freed memory be more immediate than other memory?

Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
because it generally has to invalidate TLBs on all CPUs.

I'm looking at some more general solutions to this (already have some
batching / lazy unmapping that replaces the XFS specific one), however
they are still likely going to leave vmap mappings around after freeing
the page.

We _could_ hold on to the pages as well, but that's pretty inefficient.
The memory cost of keeping the mappings around tends to be well under
1% the cost of the page itself. OTOH we could also avoid lazy flushes
on architectures where it is not costly. Either way, it probably would
require an arch hook or even a couple of ifdefs in mm/vmalloc.c for
Xen. Although... it would be nice if Xen could take advantage of some
of these optimisations as well.

What's the actual problem for Xen? Anything that can be changed?

2007-10-14 23:34:16

by David Chinner

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Sun, Oct 14, 2007 at 04:12:20PM -0700, Jeremy Fitzhardinge wrote:
> David Chinner wrote:
> > You mean xfs_buf.c.
> >
>
> Yes, sorry.
>
> > And yes, we delay unmapping pages until we have a batch of them
> > to unmap. vmap and vunmap do not scale, so this is batching helps
> > alleviate some of the worst of the problems.
> >
>
> How much performance does it cost?

Every vunmap() cal causes a global TLB sync, and the region lists
are globl with a spin lock protecting them. I thin kNick has shown
a 64p altix with ~60 cpus spinning on the vmap locks under a
simple workload....

> What kind of workloads would it show
> up under?

A directory traversal when using large directory block sizes
with large directories....


> > Realistically, if this delayed release of vmaps is a problem for
> > Xen, then I think that some generic VM solution is needed to this
> > problem as vmap() is likely to become more common in future (think
> > large blocks in filesystems). Nick - any comments?
> >
>
> Well, the only real problem is that the pages are returned to the free
> pool and reallocated while still being part of a mapping. If the pages
> are still owned by the filesystem/pagecache, then there's no problem.

The pages are still attached to the blockdev address space mapping,
but there's nothing stopping them from being reclaimed before they are
unmapped.

> What's the lifetime of things being vmapped/unmapped in xfs? Are they
> necessarily being freed when they're unmapped, or could unmapping of
> freed memory be more immediate than other memory?

It's all "freed memory". At the time we pull the buffer down, there are
no further references to the buffer. the pages are released and the mapping
is never used again until it is torn down. it is torn down either on the
next xfsbufd run (either memory pressure or every 15s) or every 64th
new vmap() call to map new buffers.

> Maybe it just needs a notifier chain or something.

We've already got a memroy shrinker hook that triggers this reclaim.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-15 00:57:50

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

Nick Piggin wrote:
> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> because it generally has to invalidate TLBs on all CPUs.
>

I see.

> I'm looking at some more general solutions to this (already have some
> batching / lazy unmapping that replaces the XFS specific one), however
> they are still likely going to leave vmap mappings around after freeing
> the page.
>

Hm. Could there be a call to shoot down any lazy mappings of a page, so
the Xen pagetable code could use it on any pagetable page? Ideally one
that could be used on any page, but only causes expensive operations
where needed.

> We _could_ hold on to the pages as well, but that's pretty inefficient.
> The memory cost of keeping the mappings around tends to be well under
> 1% the cost of the page itself. OTOH we could also avoid lazy flushes
> on architectures where it is not costly. Either way, it probably would
> require an arch hook or even a couple of ifdefs in mm/vmalloc.c for
> Xen. Although... it would be nice if Xen could take advantage of some
> of these optimisations as well.
>

In general the lazy unmappings won't worry Xen. It's only for the
specific case of allocating memory for pagetables. Xen can do a bit of
extra optimisation for cross-cpu tlb flushes (if the target vcpus are
not currently running, then you don't need to do anything), but they're
still an expensive operation, so the optimisation is definitely useful.

> What's the actual problem for Xen? Anything that can be changed?
>

Not easily. Xen doesn't use shadow pagetables. Instead, it gives the
guest domains direct access to the real CPU's pagetable, but makes sure
they're always mapped RO so that the hypervisor can control updates to
the pagetables (either by trapping writes or via explicit hypercalls).
This means that when constructing a new pagetable, Xen will verify that
all the mappings of pages making up the new pagetable are RO before
allowing it to be used. If there are stray RW mappings of those pages,
pagetable construction will fail.

Aside from XFS, the only other case I've found where there could be
stray RW mappings is when using high pages which are still in the kmap
cache; I added an explicit call to flush the kmap cache to handle this.
If vmap and kmap can be unified (at least the lazy unmap aspects of
them), then that would be a nice little cleanup.

J

2007-10-15 02:35:22

by Nick Piggin

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Monday 15 October 2007 10:57, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> > because it generally has to invalidate TLBs on all CPUs.
>
> I see.
>
> > I'm looking at some more general solutions to this (already have some
> > batching / lazy unmapping that replaces the XFS specific one), however
> > they are still likely going to leave vmap mappings around after freeing
> > the page.
>
> Hm. Could there be a call to shoot down any lazy mappings of a page, so
> the Xen pagetable code could use it on any pagetable page? Ideally one
> that could be used on any page, but only causes expensive operations
> where needed.

Yeah, it would be possible. The easiest way would just be to shoot down
all lazy vmaps (because you're doing the global IPIs anyway, which are
the expensive thing, at which point you may as well purge the rest of
your lazy mappings).

If it is sufficiently rare, then it could be the simplest thing to do.


> > We _could_ hold on to the pages as well, but that's pretty inefficient.
> > The memory cost of keeping the mappings around tends to be well under
> > 1% the cost of the page itself. OTOH we could also avoid lazy flushes
> > on architectures where it is not costly. Either way, it probably would
> > require an arch hook or even a couple of ifdefs in mm/vmalloc.c for
> > Xen. Although... it would be nice if Xen could take advantage of some
> > of these optimisations as well.
>
> In general the lazy unmappings won't worry Xen. It's only for the
> specific case of allocating memory for pagetables. Xen can do a bit of
> extra optimisation for cross-cpu tlb flushes (if the target vcpus are
> not currently running, then you don't need to do anything), but they're
> still an expensive operation, so the optimisation is definitely useful.

OK.


> > What's the actual problem for Xen? Anything that can be changed?
>
> Not easily. Xen doesn't use shadow pagetables. Instead, it gives the
> guest domains direct access to the real CPU's pagetable, but makes sure
> they're always mapped RO so that the hypervisor can control updates to
> the pagetables (either by trapping writes or via explicit hypercalls).
> This means that when constructing a new pagetable, Xen will verify that
> all the mappings of pages making up the new pagetable are RO before
> allowing it to be used. If there are stray RW mappings of those pages,
> pagetable construction will fail.

OK, I see. Because even though it is technically safe where we are
using it (because nothing writes through the mappings after the page
is freed), a corrupted guest could use the same window to do bad
things with the pagetables?


> Aside from XFS, the only other case I've found where there could be
> stray RW mappings is when using high pages which are still in the kmap
> cache; I added an explicit call to flush the kmap cache to handle this.
> If vmap and kmap can be unified (at least the lazy unmap aspects of
> them), then that would be a nice little cleanup.

vmap is slightly harder than kmap in some respects. However it would
be really nice to get vmap fast and general enough to completely
replace all the kmap crud -- that's one goal, but the first thing
I'm doing is to concentrate on just vmap to work out how to make it
as fast as possible.

For Xen -- shouldn't be a big deal. We can have a single Linux mm API
to call, and we can do the right thing WRT vmap/kamp. I should try to
merge my current lazy vmap patches which replace the XFS stuff, so we
can implement such an API and fix your XFS issue? That's not going to
happen for at least a cycle or two though, so in the meantime maybe
an ifdef for that XFS vmap batching code would help?

2007-10-15 03:42:45

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

Nick Piggin wrote:
> Yeah, it would be possible. The easiest way would just be to shoot down
> all lazy vmaps (because you're doing the global IPIs anyway, which are
> the expensive thing, at which point you may as well purge the rest of
> your lazy mappings).
>

Sure.

> If it is sufficiently rare, then it could be the simplest thing to do.
>

Yes. If there's some way to tell whether a particular page is in a lazy
mapping then that would help, since we could easily tell whether we need
to do the whole shootdown thing. I would expect the population of
lazily mapped pages in the whole freepage pool to be pretty small, but
if the allocator tends to return the most recently freed pages you might
hit them fairly regularly (shoving them at the other end of the freelist
might be useful).

> OK, I see. Because even though it is technically safe where we are
> using it (because nothing writes through the mappings after the page
> is freed), a corrupted guest could use the same window to do bad
> things with the pagetables?
>

That's right. The hypervisor doesn't trust the guests, so it prevents
them from getting into a state where they can do bad things.

> For Xen -- shouldn't be a big deal. We can have a single Linux mm API
> to call, and we can do the right thing WRT vmap/kamp. I should try to
> merge my current lazy vmap patches which replace the XFS stuff, so we
> can implement such an API and fix your XFS issue?

Sounds good.

> That's not going to
> happen for at least a cycle or two though, so in the meantime maybe
> an ifdef for that XFS vmap batching code would help?
>

For now I've proposed a patch to simply eagerly vunmap everything when
CONFIG_XEN is set. It certainly works, but I don't have a good feel for
how much of a performance hit that imposes on XFS. A slightly more
subtle change would be to test to see if we're actually running under
Xen before taking the eager path, so at least the performance burden
only affects actual Xen users (and I presume xfs+xen is a fairly rare
combination, or this problem would have turned up earlier, or perhaps
the old xenified kernels have some other workaround for it).

J

2007-10-15 04:12:00

by David Chinner

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Sun, Oct 14, 2007 at 08:42:34PM -0700, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > That's not going to
> > happen for at least a cycle or two though, so in the meantime maybe
> > an ifdef for that XFS vmap batching code would help?
> >
>
> For now I've proposed a patch to simply eagerly vunmap everything when
> CONFIG_XEN is set. It certainly works, but I don't have a good feel for
> how much of a performance hit that imposes on XFS.

With defaults - little effect as vmap should never be used. It's
only when you start using larger block sizes for metadata that this
becomes an issue. The CONFIG_XEN workaround should be fine until we
get a proper vmap cache....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-15 04:18:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

David Chinner wrote:
> With defaults - little effect as vmap should never be used. It's
> only when you start using larger block sizes for metadata that this
> becomes an issue. The CONFIG_XEN workaround should be fine until we
> get a proper vmap cache....

Hm, well I saw the problem with a filesystem made with mkfs.xfs with no
options, so there must be at least *some* vmapping going on there.

J

2007-10-15 04:26:18

by David Chinner

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Sun, Oct 14, 2007 at 09:18:17PM -0700, Jeremy Fitzhardinge wrote:
> David Chinner wrote:
> > With defaults - little effect as vmap should never be used. It's
> > only when you start using larger block sizes for metadata that this
> > becomes an issue. The CONFIG_XEN workaround should be fine until we
> > get a proper vmap cache....
>
> Hm, well I saw the problem with a filesystem made with mkfs.xfs with no
> options, so there must be at least *some* vmapping going on there.

Sorry - I should have been more precise - vmap should never be used in
performance critical paths on default configs. Log recovery will
trigger vmap/vunmap usage, so this is probably what you are seeing.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-15 08:33:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [xfs-masters] Re: Interaction between Xen and XFS: stray RW mappings

On Mon, Oct 15, 2007 at 02:25:46PM +1000, David Chinner wrote:
> > Hm, well I saw the problem with a filesystem made with mkfs.xfs with no
> > options, so there must be at least *some* vmapping going on there.
>
> Sorry - I should have been more precise - vmap should never be used in
> performance critical paths on default configs. Log recovery will
> trigger vmap/vunmap usage, so this is probably what you are seeing.

The iclogs are also vmapped, but they aren't unmapped until unmount so
this optimizations doesn't matter either.

2007-10-15 09:37:01

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

David Chinner <[email protected]> writes:
>
> And yes, we delay unmapping pages until we have a batch of them
> to unmap. vmap and vunmap do not scale, so this is batching helps
> alleviate some of the worst of the problems.

You're keeping vmaps around for already freed pages?

That will be a big problem for proper PAT support, which needs
to track all mappings to memory. It's not just a problem for Xen.

In fact I suspect it is already broken with DRM or AGP for example which
can use UC and WC mappings -- if you keep the mapping around and
DRM or AGP turns the page in another mapping uncacheable you're
creating an illegal cache attribute alias. These are known to occasionally
create cache corruptions on several x86s; giving ___VERY___ hard to debug
bugs once a blue moon.

Probably it'll require some generic VM batching mechanism where
Xen or PAT code can hook into the list or force unmap the mappings
as needed.

Definitely needs to be fixed if true. You're lucky that Xen caught it
in time.

-Andi

2007-10-15 09:46:45

by Nick Piggin

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Monday 15 October 2007 19:36, Andi Kleen wrote:
> David Chinner <[email protected]> writes:
> > And yes, we delay unmapping pages until we have a batch of them
> > to unmap. vmap and vunmap do not scale, so this is batching helps
> > alleviate some of the worst of the problems.
>
> You're keeping vmaps around for already freed pages?

> That will be a big problem for proper PAT support, which needs
> to track all mappings to memory. It's not just a problem for Xen.
>
> In fact I suspect it is already broken with DRM or AGP for example which
> can use UC and WC mappings -- if you keep the mapping around and
> DRM or AGP turns the page in another mapping uncacheable you're
> creating an illegal cache attribute alias. These are known to occasionally
> create cache corruptions on several x86s; giving ___VERY___ hard to debug
> bugs once a blue moon.

Is this true even if you don't write through those old mappings?
Is DRM or AGP then not also broken with lazy highmem flushing, or
how do they solve that?


> Probably it'll require some generic VM batching mechanism where
> Xen or PAT code can hook into the list or force unmap the mappings
> as needed.

Definitely.


> Definitely needs to be fixed if true. You're lucky that Xen caught it
> in time.

I've been thinking that a simple debug mode could be a good idea.
Have a new field in the struct page to count the number of possible
unflushed mappings to the page so that things like PAT could go
BUG_ON appropriate conditions. Would that be helpful?

2007-10-15 11:07:46

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote:
> Is this true even if you don't write through those old mappings?

I think it happened for reads too. It is a little counter intuitive
because in theory the CPU doesn't need to write back non dirty lines,
but in the one case which took so long to debug exactly this happened
somehow.

At it is undefined for reads and writes in the architecture so
better be safe than sorry.

And x86 CPUs are out of order and do speculative executation
and that can lead to arbitary memory accesses even if the code
never touches an particular address.

Newer Intel CPUs have something called self-snoop which was supposed
to handle this; but in some situations it doesn't seem to catch it
either.

> Is DRM or AGP then not also broken with lazy highmem flushing, or
> how do they solve that?

AGP doesn't allocate highmem pages. Not sure about the DRM code.

-Andi

2007-10-15 11:28:38

by Nick Piggin

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Monday 15 October 2007 21:07, Andi Kleen wrote:
> On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote:
> > Is this true even if you don't write through those old mappings?
>
> I think it happened for reads too. It is a little counter intuitive
> because in theory the CPU doesn't need to write back non dirty lines,
> but in the one case which took so long to debug exactly this happened
> somehow.
>
> At it is undefined for reads and writes in the architecture so
> better be safe than sorry.

Yes, typo. I meant reads or writes.


> And x86 CPUs are out of order and do speculative executation
> and that can lead to arbitary memory accesses even if the code
> never touches an particular address.
>
> Newer Intel CPUs have something called self-snoop which was supposed
> to handle this; but in some situations it doesn't seem to catch it
> either.

Fair enough, so we have to have this lazy tlb flush hook for
Xen/PAT/etc. I don't think it should be much problem to
implement.


> > Is DRM or AGP then not also broken with lazy highmem flushing, or
> > how do they solve that?
>
> AGP doesn't allocate highmem pages. Not sure about the DRM code.

Hmm, OK. It looks like DRM vmallocs memory (which gives highmem).

2007-10-15 12:54:20

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

> Hmm, OK. It looks like DRM vmallocs memory (which gives highmem).

I meant I'm not sure if it uses that memory uncached. I admit
not quite understanding that code. There used to be at least
one place where it set UC for an user mapping though.

-Andi

2007-10-21 12:17:54

by Dave Airlie

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On 10/15/07, Andi Kleen <[email protected]> wrote:
> > Hmm, OK. It looks like DRM vmallocs memory (which gives highmem).
>
> I meant I'm not sure if it uses that memory uncached. I admit
> not quite understanding that code. There used to be at least
> one place where it set UC for an user mapping though.

Currently the only DRM memory handed to userspace is vmalloc_32 in drm_scatter.c

I notice the VIA driver does its own vmalloc for dmablit, so it may
have an issue with this if highmem is involved.

This will change with the upcoming memory manager so I'll need to
investigate it a bit perhaps...

Dave.

2007-10-21 22:17:43

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings


On Mon, 2007-10-15 at 13:07 +0200, Andi Kleen wrote:
> On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote:
> > Is this true even if you don't write through those old mappings?
>
> I think it happened for reads too. It is a little counter intuitive
> because in theory the CPU doesn't need to write back non dirty lines,
> but in the one case which took so long to debug exactly this happened
> somehow.

The problem exist also on ppc, and afaik, is due to the line being in
the cache at all (either dirty (write) or not (read)), thus causing the
snoop logic to hit, that is what's causing the problem vs. non cached
accesses.

Also, on some processors, the simple fact of having the page mapped can
cause the CPU to prefetch from it even if it's not actually accessed
(speculative prefetch can cross page boundaries if things are mapped).

Ben.


2007-10-22 03:18:50

by dean gaudet

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Mon, 15 Oct 2007, Nick Piggin wrote:

> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> because it generally has to invalidate TLBs on all CPUs.

why is that? ignoring 32-bit archs we have heaps of address space
available... couldn't the kernel just burn address space and delay global
TLB invalidate by some relatively long time (say 1 second)?

-dean

2007-10-22 03:35:09

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

dean gaudet wrote:
> On Mon, 15 Oct 2007, Nick Piggin wrote:
>
>
>> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
>> because it generally has to invalidate TLBs on all CPUs.
>>
>
> why is that? ignoring 32-bit archs we have heaps of address space
> available... couldn't the kernel just burn address space and delay global
> TLB invalidate by some relatively long time (say 1 second)?
>

Yes, that's precisely the problem. xfs does delay the unmap, leaving
stray mappings, which upsets Xen.

J

2007-10-22 04:28:41

by dean gaudet

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:

> dean gaudet wrote:
> > On Mon, 15 Oct 2007, Nick Piggin wrote:
> >
> >
> >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> >> because it generally has to invalidate TLBs on all CPUs.
> >>
> >
> > why is that? ignoring 32-bit archs we have heaps of address space
> > available... couldn't the kernel just burn address space and delay global
> > TLB invalidate by some relatively long time (say 1 second)?
> >
>
> Yes, that's precisely the problem. xfs does delay the unmap, leaving
> stray mappings, which upsets Xen.

sounds like a bug in xen to me :)

-dean

2007-10-22 04:45:58

by Nick Piggin

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Monday 22 October 2007 14:28, dean gaudet wrote:
> On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:
> > dean gaudet wrote:
> > > On Mon, 15 Oct 2007, Nick Piggin wrote:
> > >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> > >> because it generally has to invalidate TLBs on all CPUs.
> > >
> > > why is that? ignoring 32-bit archs we have heaps of address space
> > > available... couldn't the kernel just burn address space and delay
> > > global TLB invalidate by some relatively long time (say 1 second)?
> >
> > Yes, that's precisely the problem. xfs does delay the unmap, leaving
> > stray mappings, which upsets Xen.
>
> sounds like a bug in xen to me :)

You could call it a bug I think. I don't know much about Xen though,
whether or not it expects to be able to run an arbitrary OS kernel.

Presumably, the hypervisor _could_ write protect and trap writes to
*all* page table page mappings.

2007-10-22 09:49:18

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Mon, Oct 22, 2007 at 08:16:01AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2007-10-15 at 13:07 +0200, Andi Kleen wrote:
> > On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote:
> > > Is this true even if you don't write through those old mappings?
> >
> > I think it happened for reads too. It is a little counter intuitive
> > because in theory the CPU doesn't need to write back non dirty lines,
> > but in the one case which took so long to debug exactly this happened
> > somehow.
>
> The problem exist also on ppc, and afaik, is due to the line being in
> the cache at all (either dirty (write) or not (read)), thus causing the
> snoop logic to hit, that is what's causing the problem vs. non cached
> accesses.

That makes sense. Snoop can effectively turn a read into a write.

> Also, on some processors, the simple fact of having the page mapped can
> cause the CPU to prefetch from it even if it's not actually accessed
> (speculative prefetch can cross page boundaries if things are mapped).

Exactly that happens on x86. Normally prefetches stop on TLB miss,
but the CPU can do speculative TLB fetches too.

Also even without any prefetching the CPU does speculative execution
and that can lead to random addresses being followed.

-Andi

2007-10-22 13:47:37

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

Jeremy Fitzhardinge <[email protected]> writes:
>
> Yes, that's precisely the problem. xfs does delay the unmap, leaving
> stray mappings, which upsets Xen.

Again it not just upsets Xen, keeping mappings to freed pages is wrong generally
and violates the x86 (and likely others like PPC) architecture because it can
cause illegal caching attribute aliases.

The patch that went into the tree was really not correct -- this
bogus optimization should have been unconditionally removed
or if you really wanted an ifdef made dependent on !CONFIG_XEN &&
!CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that
uses uncached mappings in memory).

You just worked around the obvious failure and leave the non obvious
rare corruptions in, which isn't a good strategy.

-Andi

2007-10-22 18:33:11

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

dean gaudet wrote:
> sounds like a bug in xen to me :)
>

I explained at the head of this thread how and why Xen works in this
manner. It's certainly a change from native execution; whether you
consider it to be a bug is a different matter.

But it turns out that leaving stray mappings around on pages which get
later used in special ways is a bad idea, and can manifest in all kinds
of random unpleasant ways. Getting a clear error out of Xen is probably
nicest way to discover and debug this problem.

J

2007-10-22 18:37:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

Nick Piggin wrote:
> You could call it a bug I think. I don't know much about Xen though,
> whether or not it expects to be able to run an arbitrary OS kernel.
>

Xen's paravirtualized mode always requires a guest OS to be modified;
certainly some operating systems would be very hard to make work with
Xen. But you can always fall back to using shadow pagetables or full
hvm (VT/SVM) mode.

> Presumably, the hypervisor _could_ write protect and trap writes to
> *all* page table page mappings.
>

Xen manages this stuff with refcounts; it doesn't maintain an rmap for
these pages, so it would have to exhaustively search to do this. But
aside from that, Xen never actively modifies pagetables, so this would
be a new behaviour in the interface.

J

2007-10-22 18:41:22

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

Andi Kleen wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>> Yes, that's precisely the problem. xfs does delay the unmap, leaving
>> stray mappings, which upsets Xen.
>>
>
> Again it not just upsets Xen, keeping mappings to freed pages is wrong generally
> and violates the x86 (and likely others like PPC) architecture because it can
> cause illegal caching attribute aliases.
>
> The patch that went into the tree was really not correct -- this
> bogus optimization should have been unconditionally removed
> or if you really wanted an ifdef made dependent on !CONFIG_XEN &&
> !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that
> uses uncached mappings in memory).
>
> You just worked around the obvious failure and leave the non obvious
> rare corruptions in, which isn't a good strategy.

Well, at least it becomes a known issue and/or placeholder for when Nick
does his grand unified vmap manager. I guess a clean workaround would
be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level...
I'll cook up a patch.

J

2007-10-22 19:07:51

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Jeremy Fitzhardinge <[email protected]> writes:
> >
> >> Yes, that's precisely the problem. xfs does delay the unmap, leaving
> >> stray mappings, which upsets Xen.
> >>
> >
> > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally
> > and violates the x86 (and likely others like PPC) architecture because it can
> > cause illegal caching attribute aliases.
> >
> > The patch that went into the tree was really not correct -- this
> > bogus optimization should have been unconditionally removed
> > or if you really wanted an ifdef made dependent on !CONFIG_XEN &&
> > !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that
> > uses uncached mappings in memory).
> >
> > You just worked around the obvious failure and leave the non obvious
> > rare corruptions in, which isn't a good strategy.
>
> Well, at least it becomes a known issue and/or placeholder for when Nick

It's hidden now so it causes any obvious failures any more. Just
subtle ones which is much worse.

But why not just disable it? It's not critical functionality,
just a optimization that unfortunately turned out to be incorrect.

It could be made correct short term by not freeing the pages until
vunmap for example.

> does his grand unified vmap manager. I guess a clean workaround would
> be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level...
> I'll cook up a patch.

This option could only be safely set in architectures that don't
care about caching attribute aliases or never remap kernel pages
uncached. That's not x86, powerpc, ia64 at a minimum.

I think the only good short term fix is to turn your ifdef CONFIG_XEN
into a #if 0

-Andi

2007-10-22 19:12:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

Andi Kleen wrote:
> It's hidden now so it causes any obvious failures any more. Just
> subtle ones which is much worse.
>

I think anything detected by Xen is still classed as "obscure" ;)

> But why not just disable it? It's not critical functionality,
> just a optimization that unfortunately turned out to be incorrect.
>
> It could be made correct short term by not freeing the pages until
> vunmap for example.
>

I think it only ends up holding 64 pages (or is it 64 mappings?), so its
not a horrible use of memory. Particularly since it responds to memory
pressure callbacks.

>> does his grand unified vmap manager. I guess a clean workaround would
>> be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level...
>> I'll cook up a patch.
>>
>
> This option could only be safely set in architectures that don't
> care about caching attribute aliases or never remap kernel pages
> uncached. That's not x86, powerpc, ia64 at a minimum.
>
> I think the only good short term fix is to turn your ifdef CONFIG_XEN
> into a #if 0
>

#if 1, but yes, I see your point.

J

2007-10-22 22:33:19

by David Chinner

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote:
> On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote:
> > Andi Kleen wrote:
> > > Jeremy Fitzhardinge <[email protected]> writes:
> > >
> > >> Yes, that's precisely the problem. xfs does delay the unmap, leaving
> > >> stray mappings, which upsets Xen.
> > >>
> > >
> > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally
> > > and violates the x86 (and likely others like PPC) architecture because it can
> > > cause illegal caching attribute aliases.
> > >
> > > The patch that went into the tree was really not correct -- this
> > > bogus optimization should have been unconditionally removed
> > > or if you really wanted an ifdef made dependent on !CONFIG_XEN &&
> > > !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that
> > > uses uncached mappings in memory).
> > >
> > > You just worked around the obvious failure and leave the non obvious
> > > rare corruptions in, which isn't a good strategy.
> >
> > Well, at least it becomes a known issue and/or placeholder for when Nick
>
> It's hidden now so it causes any obvious failures any more. Just
> subtle ones which is much worse.

There's no evidence of any problems ever being caused by this code.
It's been there for years.

> But why not just disable it? It's not critical functionality,
> just a optimization that unfortunately turned out to be incorrect.

It is critical functionality for larger machines because vmap()/vunmap()
really does suck in terms of scalability.

> It could be made correct short term by not freeing the pages until
> vunmap for example.

Could vmap()/vunmap() take references to the pages that are
mapped? That way delaying the unmap would also delay the freeing
of the pages and hence we'd have no problems with the pages
being reused before the mapping is torn down. That'd work for
Xen even with XFS's lazy unmapping scheme, and would allow
Nick's more advanced methods to work as well....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-22 23:35:27

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:
> On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote:
> > On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote:
> > > Andi Kleen wrote:
> > > > Jeremy Fitzhardinge <[email protected]> writes:
> > > >
> > > >> Yes, that's precisely the problem. xfs does delay the unmap, leaving
> > > >> stray mappings, which upsets Xen.
> > > >>
> > > >
> > > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally
> > > > and violates the x86 (and likely others like PPC) architecture because it can
> > > > cause illegal caching attribute aliases.
> > > >
> > > > The patch that went into the tree was really not correct -- this
> > > > bogus optimization should have been unconditionally removed
> > > > or if you really wanted an ifdef made dependent on !CONFIG_XEN &&
> > > > !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that
> > > > uses uncached mappings in memory).
> > > >
> > > > You just worked around the obvious failure and leave the non obvious
> > > > rare corruptions in, which isn't a good strategy.
> > >
> > > Well, at least it becomes a known issue and/or placeholder for when Nick
> >
> > It's hidden now so it causes any obvious failures any more. Just
> > subtle ones which is much worse.
>
> There's no evidence of any problems ever being caused by this code.

We had a fairly nasty bug a couple of years ago with such
a mapping that took months to track down. Luckily we had
help from a hardware debug lab, without that it would have
been very near impossible. But just stabilizing the problem was
a nightmare.

You will only see problems if the memory you're still mapping
will be allocated by someone who turns it into an uncached
mapping. And even then it doesn't happen always because
the CPU does not always do the necessary prefetches.

Also the corruption will not be on the XFS side,
but on the uncached mapping so typically some data sent
to some device gets a few corrupted cache line sized blocks.
Depending on the device this might also not be fatal -- e.g.
if it is texture data some corrupted pixels occasionally are
not that visible. But there can be still cases where
it can cause real failures when it hits something critical
(the failures were it was tracked down was usually it hitting
some command ring and the hardware erroring out)

Also every time I broke change_page_attr() flushing (which
handles exactly this problem and is tricky so it got broken
a few times) I eventually got complaints/reports from some users who ran
into such data inconsistencies. Given it was mostly from closed
3d driver users who seem to be most aggressive in using
uncached mappings. But with more systems running more aggressive
3d drivers it'll likely get worse.

> It's been there for years.

That doesn't mean it is correct.

Basically you're saying: I don't care if I corrupt device driver
data. That's not a very nice thing to say.

> > But why not just disable it? It's not critical functionality,
> > just a optimization that unfortunately turned out to be incorrect.
>
> It is critical functionality for larger machines because vmap()/vunmap()
> really does suck in terms of scalability.

Critical perhaps, but also broken.

And if it's that big a problem would it be really that difficult
to change only the time critical paths using it to not? I assume
it's only the directory code where it is a problem? That would
be likely even faster in general.

> > It could be made correct short term by not freeing the pages until
> > vunmap for example.
>
> Could vmap()/vunmap() take references to the pages that are
> mapped? That way delaying the unmap would also delay the freeing
> of the pages and hence we'd have no problems with the pages
> being reused before the mapping is torn down. That'd work for
> Xen even with XFS's lazy unmapping scheme, and would allow
> Nick's more advanced methods to work as well....

You could always just keep around an array of the pages and
then drop the reference count after unmap. Or walk the vmalloc mapping
and generate such an array before freeing, then unmap and then
drop the reference counts.

Or the other alternative would be change the time critical code
that uses it to not.

-Andi

2007-10-23 00:15:28

by Zachary Amsden

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Tue, 2007-10-23 at 01:35 +0200, Andi Kleen wrote:
> On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:
> > On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote:
> > > On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote:
> > > > Andi Kleen wrote:
> > > > > Jeremy Fitzhardinge <[email protected]> writes:
> > > > >
> > > > >> Yes, that's precisely the problem. xfs does delay the unmap, leaving
> > > > >> stray mappings, which upsets Xen.
> > > > >>
> > > > >
> > > > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally
> > > > > and violates the x86 (and likely others like PPC) architecture because it can
> > > > > cause illegal caching attribute aliases.

It is a serious offense to leave stray mappings for memory which can get
re-mapped to I/O devices... especially with PCI and other device
hotplug. I have to back up Andi on this one unconditionally.

On architectures where you have multibyte, non-wordsize updates to
hardware page tables, you even have races here when setting, updating
and clearing PTEs that must be done in a sequence where no aliasing of
mappings to partially written PTEs can result in I/O memory getting
mapped in a cacheable state. The window here is only one instruction,
and yes, it is possible for a window this small to create a problem. A
large (or permanently lazy) window is extremely frightening.

These things do cause bugs. The bugs take a very long time to show up
and are very difficult to track down, since they can basically cause
random behavior, such as hanging the machine or losing orbit and
crashing into the moon.

Zach

2007-10-23 00:37:28

by David Chinner

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Tue, Oct 23, 2007 at 01:35:14AM +0200, Andi Kleen wrote:
> On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:
> > On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote:
> > > It's hidden now so it causes any obvious failures any more. Just subtle
> > > ones which is much worse.
> >
> > There's no evidence of any problems ever being caused by this code.
.....
> Also the corruption will not be on the XFS side, but on the uncached mapping
> so typically some data sent to some device gets a few corrupted cache line
> sized blocks. Depending on the device this might also not be fatal -- e.g.
> if it is texture data some corrupted pixels occasionally are not that
> visible. But there can be still cases where it can cause real failures when
> it hits something critical (the failures were it was tracked down was
> usually it hitting some command ring and the hardware erroring out)

There seems to be little evidence of that occurring around the place.

> > It's been there for years.
>
> That doesn't mean it is correct.

Right, but it also points to the fact that it's not causing problems
from 99.999% of ppl out there.

> Basically you're saying: I don't care if I corrupt device driver data.
> That's not a very nice thing to say.

No, I did not say that. I said that there's no evidence that points to
this causing problems anywhere.

> > > But why not just disable it? It's not critical functionality, just a
> > > optimization that unfortunately turned out to be incorrect.
> >
> > It is critical functionality for larger machines because vmap()/vunmap()
> > really does suck in terms of scalability.
>
> Critical perhaps, but also broken.
>
> And if it's that big a problem would it be really that difficult to change
> only the time critical paths using it to not? I assume it's only the
> directory code where it is a problem? That would be likely even faster in
> general.

Difficult - yes. All the btree code in XFS would have to be rewritten
to remove the assumption that the buffer structures are contiguous in
memory. Any bug we introduce in doing this will result in on disk corruption,
which is far worse than occasionally crashing a machine with a stray
mapping.

> > > It could be made correct short term by not freeing the pages until
> > > vunmap for example.
> >
> > Could vmap()/vunmap() take references to the pages that are mapped? That
> > way delaying the unmap would also delay the freeing of the pages and hence
> > we'd have no problems with the pages being reused before the mapping is
> > torn down. That'd work for Xen even with XFS's lazy unmapping scheme, and
> > would allow Nick's more advanced methods to work as well....
>
> You could always just keep around an array of the pages and then drop the
> reference count after unmap. Or walk the vmalloc mapping and generate such
> an array before freeing, then unmap and then drop the reference counts.

You mean like vmap() could record the pages passed to it in the area->pages
array, and we walk and release than in __vunmap() like it already does
for vfree()?

If we did this, it would probably be best to pass a page release function
into the vmap or vunmap call - we'd need page_cache_release() called on
the page rather than __free_page()....

The solution belongs behind the vmap/vunmap interface, not in XFS....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-23 07:05:09

by David Chinner

[permalink] [raw]
Subject: [patch] Re: Interaction between Xen and XFS: stray RW mappings

On Tue, Oct 23, 2007 at 10:36:41AM +1000, David Chinner wrote:
> On Tue, Oct 23, 2007 at 01:35:14AM +0200, Andi Kleen wrote:
> > On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:
> > > Could vmap()/vunmap() take references to the pages that are mapped? That
> > > way delaying the unmap would also delay the freeing of the pages and hence
> > > we'd have no problems with the pages being reused before the mapping is
> > > torn down. That'd work for Xen even with XFS's lazy unmapping scheme, and
> > > would allow Nick's more advanced methods to work as well....
> >
> > You could always just keep around an array of the pages and then drop the
> > reference count after unmap. Or walk the vmalloc mapping and generate such
> > an array before freeing, then unmap and then drop the reference counts.
>
> You mean like vmap() could record the pages passed to it in the area->pages
> array, and we walk and release than in __vunmap() like it already does
> for vfree()?
>
> If we did this, it would probably be best to pass a page release function
> into the vmap or vunmap call - we'd need page_cache_release() called on
> the page rather than __free_page()....
>
> The solution belongs behind the vmap/vunmap interface, not in XFS....

Lightly tested(*) patch that does this with lazy unmapping
below for comment.

(*) a) kernel boots, b) made an XFS filesystem with 64k directory
blocks, created ~100,000 files in a directory to get a wide btree
(~1700 blocks, still only a single level) and run repeated finds
across it dropping caches in between. Each traversal maps and
unmaps every btree block.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
fs/xfs/linux-2.6/xfs_buf.c | 21 +----
include/linux/vmalloc.h | 4 +
mm/vmalloc.c | 160 +++++++++++++++++++++++++++++++++++----------
3 files changed, 133 insertions(+), 52 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-10-18 17:11:06.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-10-23 14:48:32.131088616 +1000
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;

-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ free_address(
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr, put_page);
}
}

@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);

while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr, put_page);
aentry = aentry->next;
kfree(old);
}
@@ -456,8 +443,8 @@ _xfs_buf_map_pages(
} else if (flags & XBF_MAPPED) {
if (as_list_len > 64)
purge_addresses();
- bp->b_addr = vmap(bp->b_pages, bp->b_page_count,
- VM_MAP, PAGE_KERNEL);
+ bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count,
+ VM_MAP, PAGE_KERNEL, xb_to_gfp(flags));
if (unlikely(bp->b_addr == NULL))
return -ENOMEM;
bp->b_addr += bp->b_offset;
Index: 2.6.x-xfs-new/include/linux/vmalloc.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/vmalloc.h 2007-09-12 15:41:41.000000000 +1000
+++ 2.6.x-xfs-new/include/linux/vmalloc.h 2007-10-23 14:36:56.264860921 +1000
@@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u
unsigned long flags, pgprot_t prot);
extern void vunmap(void *addr);

+extern void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask);
+extern void vunmap_pages(void *addr, void (*release)(struct page *page));
+
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
void vmalloc_sync_all(void);
Index: 2.6.x-xfs-new/mm/vmalloc.c
===================================================================
--- 2.6.x-xfs-new.orig/mm/vmalloc.c 2007-09-12 15:41:48.000000000 +1000
+++ 2.6.x-xfs-new/mm/vmalloc.c 2007-10-23 16:29:40.130384957 +1000
@@ -318,7 +318,56 @@ struct vm_struct *remove_vm_area(void *a
return v;
}

-static void __vunmap(void *addr, int deallocate_pages)
+static void vm_area_release_page(struct page *page)
+{
+ __free_page(page);
+}
+
+static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask,
+ unsigned int nr_pages, int node)
+{
+ struct page **pages;
+ unsigned int array_size;
+
+ array_size = (nr_pages * sizeof(struct page *));
+
+ area->nr_pages = nr_pages;
+ /* Please note that the recursion is strictly bounded. */
+ if (array_size > PAGE_SIZE) {
+ pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
+ PAGE_KERNEL, node);
+ area->flags |= VM_VPAGES;
+ } else {
+ pages = kmalloc_node(array_size,
+ (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO,
+ node);
+ }
+ area->pages = pages;
+ if (!area->pages) {
+ remove_vm_area(area->addr);
+ kfree(area);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void vm_area_release_pagearray(struct vm_struct *area,
+ void (*release)(struct page *))
+{
+ int i;
+
+ for (i = 0; i < area->nr_pages; i++) {
+ BUG_ON(!area->pages[i]);
+ release(area->pages[i]);
+ }
+
+ if (area->flags & VM_VPAGES)
+ vfree(area->pages);
+ else
+ kfree(area->pages);
+}
+
+static void __vunmap(void *addr, void (*release)(struct page *))
{
struct vm_struct *area;

@@ -341,19 +390,8 @@ static void __vunmap(void *addr, int dea

debug_check_no_locks_freed(addr, area->size);

- if (deallocate_pages) {
- int i;
-
- for (i = 0; i < area->nr_pages; i++) {
- BUG_ON(!area->pages[i]);
- __free_page(area->pages[i]);
- }
-
- if (area->flags & VM_VPAGES)
- vfree(area->pages);
- else
- kfree(area->pages);
- }
+ if (release)
+ vm_area_release_pagearray(area, release);

kfree(area);
return;
@@ -372,7 +410,7 @@ static void __vunmap(void *addr, int dea
void vfree(void *addr)
{
BUG_ON(in_interrupt());
- __vunmap(addr, 1);
+ __vunmap(addr, vm_area_release_page);
}
EXPORT_SYMBOL(vfree);

@@ -388,11 +426,30 @@ EXPORT_SYMBOL(vfree);
void vunmap(void *addr)
{
BUG_ON(in_interrupt());
- __vunmap(addr, 0);
+ __vunmap(addr, NULL);
}
EXPORT_SYMBOL(vunmap);

/**
+ * vunmap_pages - release virtual mapping obtained by vmap_pages()
+ * @addr: memory base address
+ * @release: release page callback function
+ *
+ * Free the virtually contiguous memory area starting at @addr,
+ * which was created from the page array passed to vmap_pages(),
+ * then call the release function on each page in the associated
+ * array of page attached to the area.
+ *
+ * Must not be called in interrupt context.
+ */
+void vunmap_pages(void *addr, void (*release)(struct page *))
+{
+ BUG_ON(in_interrupt());
+ __vunmap(addr, release);
+}
+EXPORT_SYMBOL(vunmap_pages);
+
+/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
* @count: number of pages to map
@@ -422,32 +479,63 @@ void *vmap(struct page **pages, unsigned
}
EXPORT_SYMBOL(vmap);

+/**
+ * vmap_pages - map an array of pages into virtually contiguous space
+ * @pages: array of page pointers
+ * @count: number of pages to map
+ * @flags: vm_area->flags
+ * @prot: page protection for the mapping
+ * @gfp_mask: flags for the page level allocator
+ *
+ * Maps @count pages from @pages into contiguous kernel virtual
+ * space taking a reference to each page and keeping track of all
+ * the pages within the vm area structure.
+ */
+void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask)
+{
+ struct vm_struct *area;
+ struct page **pgp;
+ int i;
+
+ if (count > num_physpages)
+ return NULL;
+
+ area = get_vm_area((count << PAGE_SHIFT), flags);
+ if (!area)
+ return NULL;
+ if (vm_area_alloc_pagearray(area, gfp_mask, count, -1))
+ return NULL;
+
+ /* map_vm_area modifies pgp */
+ pgp = pages;
+ if (map_vm_area(area, prot, &pgp)) {
+ vunmap(area->addr);
+ return NULL;
+ }
+ /*
+ * now that the region is mapped, take a reference to each
+ * page and store them in the area page array.
+ */
+ for (i = 0; i < area->nr_pages; i++) {
+ get_page(pages[i]);
+ area->pages[i] = pages[i];
+ }
+
+ return area->addr;
+}
+EXPORT_SYMBOL(vmap_pages);
+
void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
pgprot_t prot, int node)
{
struct page **pages;
- unsigned int nr_pages, array_size, i;
+ unsigned int nr_pages;
+ int i;

nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
- array_size = (nr_pages * sizeof(struct page *));
-
- area->nr_pages = nr_pages;
- /* Please note that the recursion is strictly bounded. */
- if (array_size > PAGE_SIZE) {
- pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
- PAGE_KERNEL, node);
- area->flags |= VM_VPAGES;
- } else {
- pages = kmalloc_node(array_size,
- (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO,
- node);
- }
- area->pages = pages;
- if (!area->pages) {
- remove_vm_area(area->addr);
- kfree(area);
+ if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node))
return NULL;
- }

for (i = 0; i < area->nr_pages; i++) {
if (node < 0)
@@ -461,6 +549,8 @@ void *__vmalloc_area_node(struct vm_stru
}
}

+ /* map_vm_area modifies pages */
+ pages = area->pages;
if (map_vm_area(area, prot, &pages))
goto fail;
return area->addr;

2007-10-23 09:28:50

by Andi Kleen

[permalink] [raw]
Subject: Re: Interaction between Xen and XFS: stray RW mappings

On Tue, Oct 23, 2007 at 10:36:41AM +1000, David Chinner wrote:
> > That doesn't mean it is correct.
>
> Right, but it also points to the fact that it's not causing problems
> from 99.999% of ppl out there.

So you're waiting for someone to take months to debug this again?

> You mean like vmap() could record the pages passed to it in the area->pages
> array,

The page tables contain pointers to the pages anyways. vunmap() has to walk
them. It would not be very difficult to store them in an array during
the walk.

>
> If we did this, it would probably be best to pass a page release function
> into the vmap or vunmap call - we'd need page_cache_release() called on
> the page rather than __free_page()....

Possible. Normally vmalloc pages should not be in the LRU except yours
so it would be probably fine to just change it.

> The solution belongs behind the vmap/vunmap interface, not in XFS....

You could also just keep the array from map time around yourself.
Then you could do it yourself.

-Andi

2007-10-23 09:30:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Re: Interaction between Xen and XFS: stray RW mappings

On Tue, Oct 23, 2007 at 05:04:14PM +1000, David Chinner wrote:
> On Tue, Oct 23, 2007 at 10:36:41AM +1000, David Chinner wrote:
> > On Tue, Oct 23, 2007 at 01:35:14AM +0200, Andi Kleen wrote:
> > > On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:
> > > > Could vmap()/vunmap() take references to the pages that are mapped? That
> > > > way delaying the unmap would also delay the freeing of the pages and hence
> > > > we'd have no problems with the pages being reused before the mapping is
> > > > torn down. That'd work for Xen even with XFS's lazy unmapping scheme, and
> > > > would allow Nick's more advanced methods to work as well....
> > >
> > > You could always just keep around an array of the pages and then drop the
> > > reference count after unmap. Or walk the vmalloc mapping and generate such
> > > an array before freeing, then unmap and then drop the reference counts.
> >
> > You mean like vmap() could record the pages passed to it in the area->pages
> > array, and we walk and release than in __vunmap() like it already does
> > for vfree()?
> >
> > If we did this, it would probably be best to pass a page release function
> > into the vmap or vunmap call - we'd need page_cache_release() called on
> > the page rather than __free_page()....
> >
> > The solution belongs behind the vmap/vunmap interface, not in XFS....
>
> Lightly tested(*) patch that does this with lazy unmapping
> below for comment.

Thanks

>
> (*) a) kernel boots, b) made an XFS filesystem with 64k directory
> blocks, created ~100,000 files in a directory to get a wide btree
> (~1700 blocks, still only a single level) and run repeated finds
> across it dropping caches in between. Each traversal maps and
> unmaps every btree block.

Hmm, the __free_page -> page_cache_release() change in vfree() would
have been simpler wouldn't it?

But if it works it is fine.

-Andi

2007-10-23 12:42:25

by David Chinner

[permalink] [raw]
Subject: Re: [patch] Re: Interaction between Xen and XFS: stray RW mappings

On Tue, Oct 23, 2007 at 11:30:35AM +0200, Andi Kleen wrote:
> On Tue, Oct 23, 2007 at 05:04:14PM +1000, David Chinner wrote:
> > > You mean like vmap() could record the pages passed to it in the area->pages
> > > array, and we walk and release than in __vunmap() like it already does
> > > for vfree()?
> > >
> > > If we did this, it would probably be best to pass a page release function
> > > into the vmap or vunmap call - we'd need page_cache_release() called on
> > > the page rather than __free_page()....
> > >
> > > The solution belongs behind the vmap/vunmap interface, not in XFS....
> >
> > Lightly tested(*) patch that does this with lazy unmapping
> > below for comment.
>
> Thanks
>
> >
> > (*) a) kernel boots, b) made an XFS filesystem with 64k directory
> > blocks, created ~100,000 files in a directory to get a wide btree
> > (~1700 blocks, still only a single level) and run repeated finds
> > across it dropping caches in between. Each traversal maps and
> > unmaps every btree block.
>
> Hmm, the __free_page -> page_cache_release() change in vfree() would
> have been simpler wouldn't it?

Yes, it would, but I tried to leave vmalloc doing the same thing as
it was before. I think that it would be safe simply to call put_page()
directly in the __vunmap() code and drop all the release function
passing, but I don't know enough about that code to say for certain.
I'll spin up another patch tomorrow that does this and see how it goes.

> But if it works it is fine.

Seems to - it's passed XFSQA with 64k directories and a bunch of
dirstress workouts as well.

Nick, Jeremy, (others?) any objections to this approach to solve
the problem?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-23 14:33:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch] Re: Interaction between Xen and XFS: stray RW mappings

David Chinner wrote:
> Nick, Jeremy, (others?) any objections to this approach to solve
> the problem?

Seems sounds in principle. I think Nick's shootdown-stray-mappings mm
API call is a better long-term answer, but this will do for now. I'll
test it out today.

J

2007-10-24 04:37:08

by David Chinner

[permalink] [raw]
Subject: [PATCH] Allow lazy unmapping by taking extra page references V2


Allow lazy unmapping of vmap()d regions be taking a reference
on each page in the region being vmap()ed. The page references
get released after the region is vunmap()ed, thereby ensuring
we don't leave stray mappings on freed pages that could lead to
problems pages being reallocated with incorrect mappings
associated with them.

Tested with XFS filesystems with 64k directory blocks with
dirstress and XFSQA.

Version 2:
- drop the release function stuff and just call put_page()
as it falls down to the same code in all calling cases.

Signed-Off-By: Dave Chinner <[email protected]>

---
fs/xfs/linux-2.6/xfs_buf.c | 21 +-------
include/linux/vmalloc.h | 4 +
mm/vmalloc.c | 118 +++++++++++++++++++++++++++++++++++++--------
3 files changed, 106 insertions(+), 37 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-10-18 17:11:06.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-10-24 09:16:26.930345566 +1000
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;

-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ free_address(
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr);
}
}

@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);

while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr);
aentry = aentry->next;
kfree(old);
}
@@ -456,8 +443,8 @@ _xfs_buf_map_pages(
} else if (flags & XBF_MAPPED) {
if (as_list_len > 64)
purge_addresses();
- bp->b_addr = vmap(bp->b_pages, bp->b_page_count,
- VM_MAP, PAGE_KERNEL);
+ bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count,
+ VM_MAP, PAGE_KERNEL, xb_to_gfp(flags));
if (unlikely(bp->b_addr == NULL))
return -ENOMEM;
bp->b_addr += bp->b_offset;
Index: 2.6.x-xfs-new/include/linux/vmalloc.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/vmalloc.h 2007-09-12 15:41:41.000000000 +1000
+++ 2.6.x-xfs-new/include/linux/vmalloc.h 2007-10-24 09:16:55.194741855 +1000
@@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u
unsigned long flags, pgprot_t prot);
extern void vunmap(void *addr);

+extern void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask);
+extern void vunmap_pages(void *addr);
+
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
void vmalloc_sync_all(void);
Index: 2.6.x-xfs-new/mm/vmalloc.c
===================================================================
--- 2.6.x-xfs-new.orig/mm/vmalloc.c 2007-09-12 15:41:48.000000000 +1000
+++ 2.6.x-xfs-new/mm/vmalloc.c 2007-10-24 09:30:53.447102103 +1000
@@ -318,6 +318,34 @@ struct vm_struct *remove_vm_area(void *a
return v;
}

+static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask,
+ unsigned int nr_pages, int node)
+{
+ struct page **pages;
+ unsigned int array_size;
+
+ array_size = (nr_pages * sizeof(struct page *));
+
+ area->nr_pages = nr_pages;
+ /* Please note that the recursion is strictly bounded. */
+ if (array_size > PAGE_SIZE) {
+ pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
+ PAGE_KERNEL, node);
+ area->flags |= VM_VPAGES;
+ } else {
+ pages = kmalloc_node(array_size,
+ (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO,
+ node);
+ }
+ area->pages = pages;
+ if (!area->pages) {
+ remove_vm_area(area->addr);
+ kfree(area);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
static void __vunmap(void *addr, int deallocate_pages)
{
struct vm_struct *area;
@@ -346,7 +374,7 @@ static void __vunmap(void *addr, int dea

for (i = 0; i < area->nr_pages; i++) {
BUG_ON(!area->pages[i]);
- __free_page(area->pages[i]);
+ put_page(area->pages[i]);
}

if (area->flags & VM_VPAGES)
@@ -393,6 +421,23 @@ void vunmap(void *addr)
EXPORT_SYMBOL(vunmap);

/**
+ * vunmap_pages - release virtual mapping obtained by vmap_pages()
+ * @addr: memory base address
+ *
+ * Free the virtually contiguous memory area starting at @addr,
+ * which was created from the page array passed to vmap_pages(),
+ * releasing the reference on the pages gained in vmap_pages().
+ *
+ * Must not be called in interrupt context.
+ */
+void vunmap_pages(void *addr)
+{
+ BUG_ON(in_interrupt());
+ __vunmap(addr, 1);
+}
+EXPORT_SYMBOL(vunmap_pages);
+
+/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
* @count: number of pages to map
@@ -422,32 +467,63 @@ void *vmap(struct page **pages, unsigned
}
EXPORT_SYMBOL(vmap);

+/**
+ * vmap_pages - map an array of pages into virtually contiguous space
+ * @pages: array of page pointers
+ * @count: number of pages to map
+ * @flags: vm_area->flags
+ * @prot: page protection for the mapping
+ * @gfp_mask: flags for the page level allocator
+ *
+ * Maps @count pages from @pages into contiguous kernel virtual
+ * space taking a reference to each page and keeping track of all
+ * the pages within the vm area structure.
+ */
+void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask)
+{
+ struct vm_struct *area;
+ struct page **pgp;
+ int i;
+
+ if (count > num_physpages)
+ return NULL;
+
+ area = get_vm_area((count << PAGE_SHIFT), flags);
+ if (!area)
+ return NULL;
+ if (vm_area_alloc_pagearray(area, gfp_mask, count, -1))
+ return NULL;
+
+ /* map_vm_area modifies pgp */
+ pgp = pages;
+ if (map_vm_area(area, prot, &pgp)) {
+ vunmap(area->addr);
+ return NULL;
+ }
+ /*
+ * now that the region is mapped, take a reference to each
+ * page and store them in the area page array.
+ */
+ for (i = 0; i < area->nr_pages; i++) {
+ get_page(pages[i]);
+ area->pages[i] = pages[i];
+ }
+
+ return area->addr;
+}
+EXPORT_SYMBOL(vmap_pages);
+
void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
pgprot_t prot, int node)
{
struct page **pages;
- unsigned int nr_pages, array_size, i;
+ unsigned int nr_pages;
+ int i;

nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
- array_size = (nr_pages * sizeof(struct page *));
-
- area->nr_pages = nr_pages;
- /* Please note that the recursion is strictly bounded. */
- if (array_size > PAGE_SIZE) {
- pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
- PAGE_KERNEL, node);
- area->flags |= VM_VPAGES;
- } else {
- pages = kmalloc_node(array_size,
- (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO,
- node);
- }
- area->pages = pages;
- if (!area->pages) {
- remove_vm_area(area->addr);
- kfree(area);
+ if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node))
return NULL;
- }

for (i = 0; i < area->nr_pages; i++) {
if (node < 0)
@@ -461,6 +537,8 @@ void *__vmalloc_area_node(struct vm_stru
}
}

+ /* map_vm_area modifies pages */
+ pages = area->pages;
if (map_vm_area(area, prot, &pages))
goto fail;
return area->addr;

2007-10-24 05:08:57

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] Allow lazy unmapping by taking extra page references V2

David Chinner wrote:
> Allow lazy unmapping of vmap()d regions be taking a reference
> on each page in the region being vmap()ed. The page references
> get released after the region is vunmap()ed, thereby ensuring
> we don't leave stray mappings on freed pages that could lead to
> problems pages being reallocated with incorrect mappings
> associated with them.
>
> Tested with XFS filesystems with 64k directory blocks with
> dirstress and XFSQA.
>
> Version 2:
> - drop the release function stuff and just call put_page()
> as it falls down to the same code in all calling cases.
>

This doesn't apply cleanly to the current git tree.

J

2007-10-24 21:49:19

by David Chinner

[permalink] [raw]
Subject: [PATCH] Allow lazy unmapping by taking extra page references V3

Allow lazy unmapping of vmap()d regions be taking a reference
on each page in the region being vmap()ed. The page references
get released after the region is vunmap()ed, thereby ensuring
we don't leave stray mappings on freed pages that could lead to
problems pages being reallocated with incorrect mappings
associated with them.

Tested with XFS filesystems with 64k directory blocks with
dirstress and XFSQA.

Version 3:
- compile on latest -git

Version 2:
- drop the release function stuff and just call put_page()
as it falls down to the same code in all calling cases.

Signed-off-by: Dave Chinner <[email protected]>

---

fs/xfs/linux-2.6/xfs_buf.c | 21 +-------
include/linux/vmalloc.h | 4 +
mm/vmalloc.c | 118 +++++++++++++++++++++++++++++++++++++--------
3 files changed, 106 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b9c8589..38f073f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;

-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ #endif
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr);
}
}

@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);

while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr);
aentry = aentry->next;
kfree(old);
}
@@ -458,8 +445,8 @@ _xfs_buf_map_pages(
} else if (flags & XBF_MAPPED) {
if (as_list_len > 64)
purge_addresses();
- bp->b_addr = vmap(bp->b_pages, bp->b_page_count,
- VM_MAP, PAGE_KERNEL);
+ bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count,
+ VM_MAP, PAGE_KERNEL, xb_to_gfp(flags));
if (unlikely(bp->b_addr == NULL))
return -ENOMEM;
bp->b_addr += bp->b_offset;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 89338b4..40c34da 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u
unsigned long flags, pgprot_t prot);
extern void vunmap(void *addr);

+extern void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask);
+extern void vunmap_pages(void *addr);
+
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
void vmalloc_sync_all(void);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index af77e17..95ba0fd 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -319,6 +319,34 @@ struct vm_struct *remove_vm_area(void *a
return v;
}

+static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask,
+ unsigned int nr_pages, int node)
+{
+ struct page **pages;
+ unsigned int array_size;
+
+ array_size = (nr_pages * sizeof(struct page *));
+
+ area->nr_pages = nr_pages;
+ /* Please note that the recursion is strictly bounded. */
+ if (array_size > PAGE_SIZE) {
+ pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
+ PAGE_KERNEL, node);
+ area->flags |= VM_VPAGES;
+ } else {
+ pages = kmalloc_node(array_size,
+ (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO,
+ node);
+ }
+ area->pages = pages;
+ if (!area->pages) {
+ remove_vm_area(area->addr);
+ kfree(area);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
static void __vunmap(void *addr, int deallocate_pages)
{
struct vm_struct *area;
@@ -347,7 +375,7 @@ static void __vunmap(void *addr, int dea

for (i = 0; i < area->nr_pages; i++) {
BUG_ON(!area->pages[i]);
- __free_page(area->pages[i]);
+ put_page(area->pages[i]);
}

if (area->flags & VM_VPAGES)
@@ -394,6 +422,23 @@ void vunmap(void *addr)
EXPORT_SYMBOL(vunmap);

/**
+ * vunmap_pages - release virtual mapping obtained by vmap_pages()
+ * @addr: memory base address
+ *
+ * Free the virtually contiguous memory area starting at @addr,
+ * which was created from the page array passed to vmap_pages(),
+ * releasing the reference on the pages gained in vmap_pages().
+ *
+ * Must not be called in interrupt context.
+ */
+void vunmap_pages(void *addr)
+{
+ BUG_ON(in_interrupt());
+ __vunmap(addr, 1);
+}
+EXPORT_SYMBOL(vunmap_pages);
+
+/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
* @count: number of pages to map
@@ -423,32 +468,63 @@ void *vmap(struct page **pages, unsigned
}
EXPORT_SYMBOL(vmap);

+/**
+ * vmap_pages - map an array of pages into virtually contiguous space
+ * @pages: array of page pointers
+ * @count: number of pages to map
+ * @flags: vm_area->flags
+ * @prot: page protection for the mapping
+ * @gfp_mask: flags for the page level allocator
+ *
+ * Maps @count pages from @pages into contiguous kernel virtual
+ * space taking a reference to each page and keeping track of all
+ * the pages within the vm area structure.
+ */
+void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask)
+{
+ struct vm_struct *area;
+ struct page **pgp;
+ int i;
+
+ if (count > num_physpages)
+ return NULL;
+
+ area = get_vm_area((count << PAGE_SHIFT), flags);
+ if (!area)
+ return NULL;
+ if (vm_area_alloc_pagearray(area, gfp_mask, count, -1))
+ return NULL;
+
+ /* map_vm_area modifies pgp */
+ pgp = pages;
+ if (map_vm_area(area, prot, &pgp)) {
+ vunmap(area->addr);
+ return NULL;
+ }
+ /*
+ * now that the region is mapped, take a reference to each
+ * page and store them in the area page array.
+ */
+ for (i = 0; i < area->nr_pages; i++) {
+ get_page(pages[i]);
+ area->pages[i] = pages[i];
+ }
+
+ return area->addr;
+}
+EXPORT_SYMBOL(vmap_pages);
+
void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
pgprot_t prot, int node)
{
struct page **pages;
- unsigned int nr_pages, array_size, i;
+ unsigned int nr_pages;
+ int i;

nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
- array_size = (nr_pages * sizeof(struct page *));
-
- area->nr_pages = nr_pages;
- /* Please note that the recursion is strictly bounded. */
- if (array_size > PAGE_SIZE) {
- pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
- PAGE_KERNEL, node);
- area->flags |= VM_VPAGES;
- } else {
- pages = kmalloc_node(array_size,
- (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
- node);
- }
- area->pages = pages;
- if (!area->pages) {
- remove_vm_area(area->addr);
- kfree(area);
+ if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node))
return NULL;
- }

for (i = 0; i < area->nr_pages; i++) {
if (node < 0)
@@ -462,6 +538,8 @@ void *__vmalloc_area_node(struct vm_stru
}
}

+ /* map_vm_area modifies pages */
+ pages = area->pages;
if (map_vm_area(area, prot, &pages))
goto fail;
return area->addr;

2007-10-24 22:46:43

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] Allow lazy unmapping by taking extra page references V3

David Chinner wrote:
> Allow lazy unmapping of vmap()d regions be taking a reference
> on each page in the region being vmap()ed. The page references
> get released after the region is vunmap()ed, thereby ensuring
> we don't leave stray mappings on freed pages that could lead to
> problems pages being reallocated with incorrect mappings
> associated with them.
>
> Tested with XFS filesystems with 64k directory blocks with
> dirstress and XFSQA.
>
> Version 3:
> - compile on latest -git
>

Not quite:

CC mm/vmalloc.o
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c: In function 'vm_area_alloc_pagearray':
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: 'GFP_LEVEL_MASK' undeclared (first use in this function)
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: (Each undeclared identifier is reported only once
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: for each function it appears in.)
make[3]: *** [mm/vmalloc.o] Error 1
make[2]: *** [mm] Error 2
make[2]: *** Waiting for unfinished jobs....


GFP_RECLAM_MASK now?

J

2007-10-24 23:21:48

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Allow lazy unmapping by taking extra page references V3

On Wed, Oct 24, 2007 at 03:46:16PM -0700, Jeremy Fitzhardinge wrote:
> David Chinner wrote:
> > Version 3:
> > - compile on latest -git
> >
>
> Not quite:
>
> CC mm/vmalloc.o
> /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c: In function 'vm_area_alloc_pagearray':
> /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: 'GFP_LEVEL_MASK' undeclared (first use in this function)
> /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: (Each undeclared identifier is reported only once
> /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: for each function it appears in.)
> make[3]: *** [mm/vmalloc.o] Error 1
> make[2]: *** [mm] Error 2
> make[2]: *** Waiting for unfinished jobs....
>
>
> GFP_RECLAM_MASK now?

Yeah, it is. Not sure what happened there - I did make that change.
new diff below.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group


diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b9c8589..38f073f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;

-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ #endif
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr);
}
}

@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);

while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr);
aentry = aentry->next;
kfree(old);
}
@@ -458,8 +445,8 @@ _xfs_buf_map_pages(
} else if (flags & XBF_MAPPED) {
if (as_list_len > 64)
purge_addresses();
- bp->b_addr = vmap(bp->b_pages, bp->b_page_count,
- VM_MAP, PAGE_KERNEL);
+ bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count,
+ VM_MAP, PAGE_KERNEL, xb_to_gfp(flags));
if (unlikely(bp->b_addr == NULL))
return -ENOMEM;
bp->b_addr += bp->b_offset;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 89338b4..40c34da 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u
unsigned long flags, pgprot_t prot);
extern void vunmap(void *addr);

+extern void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask);
+extern void vunmap_pages(void *addr);
+
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
void vmalloc_sync_all(void);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index af77e17..720f338 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -319,6 +319,34 @@ struct vm_struct *remove_vm_area(void *a
return v;
}

+static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask,
+ unsigned int nr_pages, int node)
+{
+ struct page **pages;
+ unsigned int array_size;
+
+ array_size = (nr_pages * sizeof(struct page *));
+
+ area->nr_pages = nr_pages;
+ /* Please note that the recursion is strictly bounded. */
+ if (array_size > PAGE_SIZE) {
+ pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
+ PAGE_KERNEL, node);
+ area->flags |= VM_VPAGES;
+ } else {
+ pages = kmalloc_node(array_size,
+ (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
+ node);
+ }
+ area->pages = pages;
+ if (!area->pages) {
+ remove_vm_area(area->addr);
+ kfree(area);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
static void __vunmap(void *addr, int deallocate_pages)
{
struct vm_struct *area;
@@ -347,7 +375,7 @@ static void __vunmap(void *addr, int dea

for (i = 0; i < area->nr_pages; i++) {
BUG_ON(!area->pages[i]);
- __free_page(area->pages[i]);
+ put_page(area->pages[i]);
}

if (area->flags & VM_VPAGES)
@@ -394,6 +422,23 @@ void vunmap(void *addr)
EXPORT_SYMBOL(vunmap);

/**
+ * vunmap_pages - release virtual mapping obtained by vmap_pages()
+ * @addr: memory base address
+ *
+ * Free the virtually contiguous memory area starting at @addr,
+ * which was created from the page array passed to vmap_pages(),
+ * releasing the reference on the pages gained in vmap_pages().
+ *
+ * Must not be called in interrupt context.
+ */
+void vunmap_pages(void *addr)
+{
+ BUG_ON(in_interrupt());
+ __vunmap(addr, 1);
+}
+EXPORT_SYMBOL(vunmap_pages);
+
+/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
* @count: number of pages to map
@@ -423,32 +468,63 @@ void *vmap(struct page **pages, unsigned
}
EXPORT_SYMBOL(vmap);

+/**
+ * vmap_pages - map an array of pages into virtually contiguous space
+ * @pages: array of page pointers
+ * @count: number of pages to map
+ * @flags: vm_area->flags
+ * @prot: page protection for the mapping
+ * @gfp_mask: flags for the page level allocator
+ *
+ * Maps @count pages from @pages into contiguous kernel virtual
+ * space taking a reference to each page and keeping track of all
+ * the pages within the vm area structure.
+ */
+void *vmap_pages(struct page **pages, unsigned int count,
+ unsigned long flags, pgprot_t prot, gfp_t gfp_mask)
+{
+ struct vm_struct *area;
+ struct page **pgp;
+ int i;
+
+ if (count > num_physpages)
+ return NULL;
+
+ area = get_vm_area((count << PAGE_SHIFT), flags);
+ if (!area)
+ return NULL;
+ if (vm_area_alloc_pagearray(area, gfp_mask, count, -1))
+ return NULL;
+
+ /* map_vm_area modifies pgp */
+ pgp = pages;
+ if (map_vm_area(area, prot, &pgp)) {
+ vunmap(area->addr);
+ return NULL;
+ }
+ /*
+ * now that the region is mapped, take a reference to each
+ * page and store them in the area page array.
+ */
+ for (i = 0; i < area->nr_pages; i++) {
+ get_page(pages[i]);
+ area->pages[i] = pages[i];
+ }
+
+ return area->addr;
+}
+EXPORT_SYMBOL(vmap_pages);
+
void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
pgprot_t prot, int node)
{
struct page **pages;
- unsigned int nr_pages, array_size, i;
+ unsigned int nr_pages;
+ int i;

nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
- array_size = (nr_pages * sizeof(struct page *));
-
- area->nr_pages = nr_pages;
- /* Please note that the recursion is strictly bounded. */
- if (array_size > PAGE_SIZE) {
- pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
- PAGE_KERNEL, node);
- area->flags |= VM_VPAGES;
- } else {
- pages = kmalloc_node(array_size,
- (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
- node);
- }
- area->pages = pages;
- if (!area->pages) {
- remove_vm_area(area->addr);
- kfree(area);
+ if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node))
return NULL;
- }

for (i = 0; i < area->nr_pages; i++) {
if (node < 0)
@@ -462,6 +538,8 @@ void *__vmalloc_area_node(struct vm_stru
}
}

+ /* map_vm_area modifies pages */
+ pages = area->pages;
if (map_vm_area(area, prot, &pages))
goto fail;
return area->addr;