2004-04-08 13:41:49

by Hugh Dickins

[permalink] [raw]
Subject: rmap: parisc __flush_dcache_page

Something to notice about that parisc __flush_dcache_page I sent you:
there's no locking around searching the tree for vmas; there was never
any locking around searching the list for vmas. arm is similar, but
at least has no CONFIG_SMP, just a preemption issue. Any ideas?

Thanks,
Hugh


2004-04-08 13:53:04

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 08:41, Hugh Dickins wrote:
> Something to notice about that parisc __flush_dcache_page I sent you:
> there's no locking around searching the tree for vmas; there was never
> any locking around searching the list for vmas. arm is similar, but
> at least has no CONFIG_SMP, just a preemption issue. Any ideas?

I don't think you sent it to the parisc list?

I'm afraid we've just been pretty heavily updating flush_dcache_page
recently to fill a number of holes in the implementation.

As far as list traversal goes...we don't require the list to freeze:
acidentally flushing dead vmas would be harmless and added ones wouldn't
need flushing, so all we need would probably be a safe traversal and a
reference to prevent the vma being deallocated.

James


2004-04-08 14:16:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On 8 Apr 2004, James Bottomley wrote:
> On Thu, 2004-04-08 at 08:41, Hugh Dickins wrote:
> > Something to notice about that parisc __flush_dcache_page I sent you:
> > there's no locking around searching the tree for vmas; there was never
> > any locking around searching the list for vmas. arm is similar, but
> > at least has no CONFIG_SMP, just a preemption issue. Any ideas?
>
> I don't think you sent it to the parisc list?

That's right, at present it's just something in Andrea's -aa tree
and Martin's -mjb tree. Will try to remember to copy maintainers
when sending to Andrew. But the problem was there before the patch.

> I'm afraid we've just been pretty heavily updating flush_dcache_page
> recently to fill a number of holes in the implementation.

Don't be afraid, that's good! Is it still going vertically down
i_mmap_shared and i_mmap? Whereas it's only interested in vmas of
the one mm, so could go horizontally along it. Just an option to
play with, but I don't believe it solves anything, just as unsafe
when threaded.

> As far as list traversal goes...we don't require the list to freeze:
> acidentally flushing dead vmas would be harmless and added ones wouldn't
> need flushing,

Yes.

> so all we need would probably be a safe traversal and a
> reference to prevent the vma being deallocated.

Which we're not giving you at all at present. I guess another layer
of spinlocking/nopreemption, for parisc and arm, dissolving to nothing
on other arches.

Hugh

2004-04-08 15:02:33

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 09:16, Hugh Dickins wrote:
> Don't be afraid, that's good! Is it still going vertically down
> i_mmap_shared and i_mmap? Whereas it's only interested in vmas of
> the one mm, so could go horizontally along it. Just an option to
> play with, but I don't believe it solves anything, just as unsafe
> when threaded.

Yes, I've attached the current code. I'm afraid we have space
separation in our caches and tlbs, so we're interested in all mm's not
just the current one (that was the bug that just got fixed).

> Which we're not giving you at all at present. I guess another layer
> of spinlocking/nopreemption, for parisc and arm, dissolving to nothing
> on other arches.

Whatever seems most convenient that won't impact the flushing fast path,
I suppose. It's one of the hottest paths in the system since all data
transfers go through it for user visibility.

James

__flush_dcache_page(struct page *page)
{
struct mm_struct *mm = current->active_mm;
struct list_head *l;
struct vm_area_struct *anyvma = NULL;

flush_kernel_dcache_page(page_address(page));

if (!page->mapping)
return;

/* We have ensured in arch_get_unmapped_area() that all shared
* mappings are mapped at equivalent addresses, so we only need
* to flush one for them all to become coherent */
list_for_each(l, &page->mapping->i_mmap_shared) {
struct vm_area_struct *mpnt;
unsigned long off, addr;

mpnt = list_entry(l, struct vm_area_struct, shared);

if (page->index < mpnt->vm_pgoff)
continue;

off = page->index - mpnt->vm_pgoff;
if (off >= (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT)
continue;

addr = mpnt->vm_start + (off << PAGE_SHIFT);

/* flush instructions produce non access tlb misses.
* On PA, we nullify these instructions rather than
* taking a page fault if the pte doesn't exist, so we
* have to find a congruent address with an existing
* translation */

if (!translation_exists(mpnt, addr))
continue;

anyvma = mpnt;

/*
* We try first to find a page in our current user process
*/
if (mpnt->vm_mm != mm)
continue;


__flush_cache_page(mpnt, addr);

/* All user shared mappings should be equivalently mapped,
* so once we've flushed one we should be ok
*/
goto flush_unshared;
}

/* OK, shared page but not in our current process' address space */
if (anyvma) {
unsigned long addr = anyvma->vm_start
+ ((page->index - anyvma->vm_pgoff) << PAGE_SHIFT);
__flush_cache_page(anyvma, addr);
}

flush_unshared:

/* Private mappings will not have congruent addresses, so we
* have to flush each of them individually to make the change
* in the kernel page visible */
list_for_each(l, &page->mapping->i_mmap) {
struct vm_area_struct *mpnt;
unsigned long off, addr;

mpnt = list_entry(l, struct vm_area_struct, shared);

if (page->index < mpnt->vm_pgoff)
continue;

off = page->index - mpnt->vm_pgoff;
if (off >= (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT)
continue;

addr = mpnt->vm_start + (off << PAGE_SHIFT);

/* This is just for speed. If the page translation isn't
* there there's no point exciting the nadtlb handler into
* a nullification frenzy */
if(!translation_exists(mpnt, addr))
continue;

__flush_cache_page(mpnt, addr);
}
}

2004-04-08 15:14:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 09:40:37AM -0500, James Bottomley wrote:
> Whatever seems most convenient that won't impact the flushing fast path,
> I suppose. It's one of the hottest paths in the system since all data
> transfers go through it for user visibility.

you'd need to take a semaphore there to be safe, so it's basically
unfixable since you can't sleep or just trylock.

2004-04-08 15:29:02

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 10:14, Andrea Arcangeli wrote:
> you'd need to take a semaphore there to be safe, so it's basically
> unfixable since you can't sleep or just trylock.

That's a bit of an unhelpful suggestion.

flush_dcache_page() exists to support coherency problems with virtual
aliasing and a feature of that is that you have to flush every
inequivalent user address which might be cached, hence the need for list
traversal.

Exactly why wouldn't a simple spinlock to protect page->mapping work? I
know we don't want to bloat struct page, but such a thing could go in
struct address_space?

James


2004-04-08 15:35:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 8 Apr 2004, Andrea Arcangeli wrote:
> On Thu, Apr 08, 2004 at 09:40:37AM -0500, James Bottomley wrote:
> > Whatever seems most convenient that won't impact the flushing fast path,
> > I suppose. It's one of the hottest paths in the system since all data
> > transfers go through it for user visibility.
>
> you'd need to take a semaphore there to be safe, so it's basically
> unfixable since you can't sleep or just trylock.

It's not fixable via the i_shared_sem, but we can add another layer
of spin_lock around the i_mmap* list/tree manipulations, one that
preprocesses away to nothing on all arches but parisc and arm, and
is used in their __flush_dcache_page. *Not* the page_table_lock ;)
Unappealing, but a lot better than leaving them unsafe.

Hugh

2004-04-08 15:34:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 10:28:44AM -0500, James Bottomley wrote:
> Exactly why wouldn't a simple spinlock to protect page->mapping work? I
> know we don't want to bloat struct page, but such a thing could go in
> struct address_space?

yes, the spinlock in struct address_space would be enough, and that's
what 2.4 does too, Andrew changed it to a semaphore in 2.6 but it can be
made a spinlock again. Then you can fix it (as far as you never call it
from an irq and as far as you don't generate exceptions inside the
critical section, but I'm sure you don't).

2004-04-08 15:47:29

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 10:34, Andrea Arcangeli wrote:
> yes, the spinlock in struct address_space would be enough, and that's
> what 2.4 does too, Andrew changed it to a semaphore in 2.6 but it can be
> made a spinlock again. Then you can fix it (as far as you never call it
> from an irq and as far as you don't generate exceptions inside the
> critical section, but I'm sure you don't).

Well, yes, of course we do. We're a sofware tlb arch, so we generate
exceptions on tlb misses, which can occur anywhere (even in critical
sections). However, the exceptions are carefully crafted not to take
spinlocks, so everything should be safe.

I'm not sure about the no in irq assertion. The biggest use of
flush_dcache_page is on the I/O return path ... that looks like a good
candidate for being in interrupt (even though most drivers should be
offloading to softirq/tasklets).

James




2004-04-08 16:13:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 04:35:12PM +0100, Hugh Dickins wrote:
> It's not fixable via the i_shared_sem, but we can add another layer

I meant it's unfixable unless we change the VM common code.

> of spin_lock around the i_mmap* list/tree manipulations, one that
> preprocesses away to nothing on all arches but parisc and arm, and
> is used in their __flush_dcache_page. *Not* the page_table_lock ;)

I'd prefer to use only a spinlock then to carry around two overlapping
locks, the need_resched() check is needed anyways even with preempt in
the real costly paths.

2004-04-08 16:16:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 10:47:23AM -0500, James Bottomley wrote:
> candidate for being in interrupt (even though most drivers should be
> offloading to softirq/tasklets).

softirq tasklets would be unsafe too, oh well, if you take it really
from irq context (irq/softirq/tasklet) then just a spinlock isn't
enough, it'd need to be an irq safe lock or whatever similar plus you
must be sure to never generate exceptions triggering the call inside the
critical section. sounds like we need some per-arch abstraction to cover
this, we for sure don't want an irq spinlock for this, then we can as
well leave the semaphore for all archs but parisc.

2004-04-08 16:29:57

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 11:16, Andrea Arcangeli wrote:
> softirq tasklets would be unsafe too, oh well, if you take it really
> from irq context (irq/softirq/tasklet) then just a spinlock isn't
> enough, it'd need to be an irq safe lock or whatever similar plus you
> must be sure to never generate exceptions triggering the call inside the
> critical section. sounds like we need some per-arch abstraction to cover
> this, we for sure don't want an irq spinlock for this, then we can as
> well leave the semaphore for all archs but parisc.

Erm, well, I think this is a global problem. All VI archs have to use
the flush_ APIs in cachetlb.txt to ensure coherence. It's just that
sparc seems to have some nice cache manipulation instructions that
relieve it of the necessity of traversing the mappings.

Why don't we want an irq safe spinlock? As Hugh said, we'd abstract it
so as to be a nop on PI archs.

James


2004-04-08 17:10:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 11:29:50AM -0500, James Bottomley wrote:
> On Thu, 2004-04-08 at 11:16, Andrea Arcangeli wrote:
> > softirq tasklets would be unsafe too, oh well, if you take it really
> > from irq context (irq/softirq/tasklet) then just a spinlock isn't
> > enough, it'd need to be an irq safe lock or whatever similar plus you
> > must be sure to never generate exceptions triggering the call inside the
> > critical section. sounds like we need some per-arch abstraction to cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > this, we for sure don't want an irq spinlock for this, then we can as
> > well leave the semaphore for all archs but parisc.
>
> Erm, well, I think this is a global problem. All VI archs have to use
> the flush_ APIs in cachetlb.txt to ensure coherence. It's just that
> sparc seems to have some nice cache manipulation instructions that
> relieve it of the necessity of traversing the mappings.
>
> Why don't we want an irq safe spinlock? As Hugh said, we'd abstract it
> so as to be a nop on PI archs.

I said above per-arch abstraction, a per-arch abstraction isn't an irq
safe spinlock, we cannot add an irq safe spinlock there, it'd be too bad
for all the common archs that don't need to walk those lists (actually
trees in my -aa tree) from irq context.

if you implement the locking abstraction with an irq safe spinlock it's
your own business then.

2004-04-08 17:44:00

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 12:10, Andrea Arcangeli wrote:
> I said above per-arch abstraction, a per-arch abstraction isn't an irq
> safe spinlock, we cannot add an irq safe spinlock there, it'd be too bad
> for all the common archs that don't need to walk those lists (actually
> trees in my -aa tree) from irq context.

I think we agree on the abstraction thing. I was more wondering what
you thought was so costly about an irq safe spinlock as opposed to an
ordinary one? Is there something adding to this cost I don't know
about? i.e. should we be thinking about something like RCU or phased
tree approach to walking the mapping lists?

James


2004-04-08 17:52:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 12:43:45PM -0500, James Bottomley wrote:
> On Thu, 2004-04-08 at 12:10, Andrea Arcangeli wrote:
> > I said above per-arch abstraction, a per-arch abstraction isn't an irq
> > safe spinlock, we cannot add an irq safe spinlock there, it'd be too bad
> > for all the common archs that don't need to walk those lists (actually
> > trees in my -aa tree) from irq context.
>
> I think we agree on the abstraction thing. I was more wondering what
> you thought was so costly about an irq safe spinlock as opposed to an
> ordinary one? Is there something adding to this cost I don't know
> about? i.e. should we be thinking about something like RCU or phased
> tree approach to walking the mapping lists?

that path can take as long as timeslice to run, not taking interrupts
for a whole scheduler timeslice is pretty bad.

Note that the data structure will become a tree soon, but a prio-tree,
walking it with RCU lockless sounds very tricky to me, but it may be
doable. For the short term I doubt you want the RCU prio-tree, I guess
you want to stabilze the kernel first with the irq safe spinlock, then
you can try to hack on the prio-tree to read it in a lockless fascion.
If you can make the reading lockless we can giveup the abstraction too,
since we can make all archs walk with lockless, but warning, freeing
vmas in rcu callbacks means freeing mm in rcu callbacks, that then means
freeing pgd in rcu callbacks, the whole mm layer will collapse on you as
soon as you try to read that tree without any locking, only the inode
will be still there as far as you've a reference on the page (and as far
as you don't use nonlinear :-/ ).

2004-04-08 18:07:44

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 12:51, Andrea Arcangeli wrote:
> that path can take as long as timeslice to run, not taking interrupts
> for a whole scheduler timeslice is pretty bad.

OK, now I'm confused. Where's the whole timeslice bit coming from? the
parisc flush_dcache_code better not take a timeslice to execute
otherwise we're in very serious performance trouble.

Does it take as long as a timeslice to do mmap[_shared] list insertion?

James


2004-04-08 18:18:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 01:07:31PM -0500, James Bottomley wrote:
> On Thu, 2004-04-08 at 12:51, Andrea Arcangeli wrote:
> > that path can take as long as timeslice to run, not taking interrupts
> > for a whole scheduler timeslice is pretty bad.
>
> OK, now I'm confused. Where's the whole timeslice bit coming from? the
> parisc flush_dcache_code better not take a timeslice to execute
> otherwise we're in very serious performance trouble.
>
> Does it take as long as a timeslice to do mmap[_shared] list insertion?

it enterely depends on the workload. On a desktop machine there may be
only some hundred entries in those lists at maximum with glibc being the
biggest offender:

cat /proc/*/maps | grep libc.so.6 | wc -l

with shared memory on some server there can be easily several thousand
entries for some inode even on 64bit, but a timeslice was probably
exaggerated (the timeslice was for the walking of the ptes in each
mapping too, I don't think you need to look at every pte).

2004-04-08 18:28:26

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 13:18, Andrea Arcangeli wrote:
> it enterely depends on the workload. On a desktop machine there may be
> only some hundred entries in those lists at maximum with glibc being the
> biggest offender:
>
> cat /proc/*/maps | grep libc.so.6 | wc -l
>
> with shared memory on some server there can be easily several thousand
> entries for some inode even on 64bit, but a timeslice was probably
> exaggerated (the timeslice was for the walking of the ptes in each
> mapping too, I don't think you need to look at every pte).

So you're worried about our code? OK, if you look, you'll see we only
have to flush one address in the mmap_shared list, (which is usually the
long list).

I'd constructed it on the predicate that flushing a non-current space is
more expensive than finding a current one, but I can alter it to flush
the first vma it comes to with a present translation.

The mmap list is usually empty. We only excite that case for multiple
private mappings of a file which for some reason gets updated.

I'd be very surprised if flush_dcache_page executes more than a few
hundred instructions all told...that's certainly nowhere close to a
timeslice.

James


2004-04-08 18:42:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 01:28:17PM -0500, James Bottomley wrote:
> So you're worried about our code? OK, if you look, you'll see we only
> have to flush one address in the mmap_shared list, (which is usually the
> long list).

if you need to flush just one address, the prio-tree may give you an
huge boost, overall flush_dcache_page should become pretty quick in most
situations.

> I'd be very surprised if flush_dcache_page executes more than a few
> hundred instructions all told...that's certainly nowhere close to a
> timeslice.

What you miss is that the problem is not in flush_dcache_page, the
problem is that the _other_ users of the prio-tree may take as long as a
timeslice. So it's the _other_ user that you've no control about (i.e.
truncate) that may take timeslices with irq disabled.

But I've an fairly optimal solution for you, you should make it a
read_write spinlock, with the readers not disabling interrupts, and the
writer disabling interrupts, the writer of the prio-tree will not take a
timeslice, the readers instead will take a timeslice, but since they're
readers and you've only to read in the flush_dcache_page irq context,
you don't need to disable irqs for the readers. I don't have better
solutions than this one at the moment (yeah there's the rcu reading of
the prio-tree but I'd leave it for later...)

2004-04-08 18:50:13

by James Bottomley

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, 2004-04-08 at 13:42, Andrea Arcangeli wrote:
> What you miss is that the problem is not in flush_dcache_page, the
> problem is that the _other_ users of the prio-tree may take as long as a
> timeslice. So it's the _other_ user that you've no control about (i.e.
> truncate) that may take timeslices with irq disabled.

So the problem isn't in the parisc code, it's in the generic vm code,
OK.

> But I've an fairly optimal solution for you, you should make it a
> read_write spinlock, with the readers not disabling interrupts, and the
> writer disabling interrupts, the writer of the prio-tree will not take a
> timeslice, the readers instead will take a timeslice, but since they're
> readers and you've only to read in the flush_dcache_page irq context,
> you don't need to disable irqs for the readers. I don't have better
> solutions than this one at the moment (yeah there's the rcu reading of
> the prio-tree but I'd leave it for later...)

Yes, I'll go for that. The write need only be done on vma insert, which
should be very fast. So do we agree this is a generic solution, or were
you still thinking of trying to abstract it per-arch?

James


2004-04-08 19:02:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 01:49:55PM -0500, James Bottomley wrote:
> Yes, I'll go for that. The write need only be done on vma insert, which
> should be very fast. So do we agree this is a generic solution, or were
> you still thinking of trying to abstract it per-arch?

I'm unsure, the semaphore simplifies a lot the need_resched() in the
vmtruncate/zap_pte path, plus it avoids to waste time in the other cpus
while vmtruncate it working on it (potentially for more than a timeslice).

btw, I already considered making the semaphore a rw semaphore (note not
a rwspinlock) to boost scalability of the paging too, but OTOH the
paging has a so small critical section under the lock (objrmap) that I
wasn't sure if it would payoff, the biggest cost will still be the
bouncing of the cacheline, so I desisted from the idea of making it a
rwsem and I thought the semaphore was ideal as Andrew told me a few days
before I had the rwsem idea. Plus concurrent truncate aren't worth
optimizing since they're serialized from the i_sem in the first place.

Ideally it should be a semaphore for all archs but the ones who needs
to walk it from irqs that wants a rw_spinlock.

2004-04-10 01:21:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [parisc-linux] rmap: parisc __flush_dcache_page

On Thu, Apr 08, 2004 at 08:42:45PM +0200, Andrea Arcangeli wrote:
> But I've an fairly optimal solution for you, you should make it a
> read_write spinlock, with the readers not disabling interrupts, and the
> writer disabling interrupts, the writer of the prio-tree will not take a
> timeslice, the readers instead will take a timeslice, but since they're
> readers and you've only to read in the flush_dcache_page irq context,
> you don't need to disable irqs for the readers. I don't have better
> solutions than this one at the moment (yeah there's the rcu reading of
> the prio-tree but I'd leave it for later...)

FWIW, agreed. Past attempts at RCU-based tree algorithms have been
a bit on the complex side. While I believe that simpler versions are
possible, RCU-based trees should be approached with caution and with
long lead times.

Thanx, Paul