2004-03-03 07:09:01

by Andrea Arcangeli

[permalink] [raw]
Subject: 230-objrmap fixes for 2.6.3-mjb2

While merging 230-objrmap in my tree I spotted 2 bugs potentially
generating random memory corruption and 1 superflous bit that I dropped
(mostly for documentation reasons, I like strict and in turn self
documenting). Here below the fixes.

in the first file we needs the page_table_lock while changing the rbtree
etc... both the page_table_lock and the down_write must be held during
all writes, so the reader can choose between a down_read or a spin_lock.

The second one is a bug in mainline 2.6 too apparently, maybe I'm
missing something but I don't see how you prevent the vm to swapout a
reserved region without my fix. A reserved region must not be messed
from the vm since it's a dma hardware region that we page lazily instead
of using PG_reserved (or MMIO) + remap_page_range. It's different from
VM_LOCKED so you can't clear that bit IIRC but that's the same,
VM_LOCKED == VM_RESERVED in VM terms. As said I believe you inherit
this bug from mainline 2.6 (2.4 has always been safe instead).

The third is a superflous down_read, it's not needed because the
page_table_lock is held during the call and it seems not to need to drop
it to schedule (and either we use the spinlock or the semaphore, both
doesn't make much sense for a reader).

Please double check, thanks.

I'm running some shm swap regression test on this right now and I'll
leave it running for a day. In a few hours I will proceed starting
dropping the pte_chain from the page sturcture and then I'll test the
anon swapout. I will also follow the 6 great-effort anobjrmap posted by
Hugh against objrmap while doing that, they're quite old (almost 1 year)
but they still apply cleanly by hand so they're useful.

--- sles-objrmap/mm/mmap.c.~1~ 2004-03-03 06:45:38.980596736 +0100
+++ sles-objrmap/mm/mmap.c 2004-03-03 06:53:46.945414808 +0100
@@ -1284,8 +1284,8 @@ int do_munmap(struct mm_struct *mm, unsi
/*
* Remove the vma's, and unmap the actual pages
*/
- detach_vmas_to_be_unmapped(mm, mpnt, prev, end);
spin_lock(&mm->page_table_lock);
+ detach_vmas_to_be_unmapped(mm, mpnt, prev, end);
unmap_region(mm, mpnt, prev, start, end);
spin_unlock(&mm->page_table_lock);

--- sles-objrmap/mm/rmap.c.~1~ 2004-03-03 06:45:38.995594456 +0100
+++ sles-objrmap/mm/rmap.c 2004-03-03 07:01:39.200621104 +0100
@@ -470,7 +470,7 @@ try_to_unmap_obj_one(struct vm_area_stru
if (!pte)
goto out;

- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
ret = SWAP_FAIL;
goto out_unmap;
}
--- sles-objrmap/mm/swapfile.c.~1~ 2004-03-03 06:45:39.023590200 +0100
+++ sles-objrmap/mm/swapfile.c 2004-03-03 07:03:33.128301464 +0100
@@ -499,7 +499,6 @@ static int unuse_process(struct mm_struc
/*
* Go through process' page directory.
*/
- down_read(&mm->mmap_sem);
spin_lock(&mm->page_table_lock);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
pgd_t * pgd = pgd_offset(mm, vma->vm_start);
@@ -507,7 +506,6 @@ static int unuse_process(struct mm_struc
break;
}
spin_unlock(&mm->page_table_lock);
- up_read(&mm->mmap_sem);
pte_chain_free(pte_chain);
return 0;
}



About 2.5:1.5 it seems not everybody is happy to lose 512m (and it's not
Oracle), but before ruling it out I'd like to get some real life number,
to be sure the performance of 2.0^W4:4 are really close (if not
"better") than 3:1 as someone said. If we go with 4:4 IMHO at the very
least the vgettimeofday backport from x86-64 is a must. In the meantime
I keep going with the rmap removal to fixup the fork and to get back
the 128m of normal zone useful on the 32G boxes. Could be also that new
cpus are a lot better at reloading the tlbs from the pagetables dunno,
the first numbers I recall about 4:4 predates to 2000 when PII was quite
optimal. I'd only like to see an opteron and a xeon dealing with 4:4.


2004-03-03 10:59:49

by Andrew Morton

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

Andrea Arcangeli <[email protected]> wrote:
>
> --- sles-objrmap/mm/rmap.c.~1~ 2004-03-03 06:45:38.995594456 +0100
> +++ sles-objrmap/mm/rmap.c 2004-03-03 07:01:39.200621104 +0100
> @@ -470,7 +470,7 @@ try_to_unmap_obj_one(struct vm_area_stru
> if (!pte)
> goto out;
>
> - if (vma->vm_flags & VM_LOCKED) {
> + if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
> ret = SWAP_FAIL;
> goto out_unmap;

I keep on wanting to put that in there too. But pages in a VM_RESERVED vma
should not find their way onto the LRU. Maybe we should be checking for
that in do_no_page().


2004-03-03 15:50:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

--Andrew Morton <[email protected]> wrote (on Wednesday, March 03, 2004 02:58:20 -0800):

> Andrea Arcangeli <[email protected]> wrote:
>>
>> --- sles-objrmap/mm/rmap.c.~1~ 2004-03-03 06:45:38.995594456 +0100
>> +++ sles-objrmap/mm/rmap.c 2004-03-03 07:01:39.200621104 +0100
>> @@ -470,7 +470,7 @@ try_to_unmap_obj_one(struct vm_area_stru
>> if (!pte)
>> goto out;
>>
>> - if (vma->vm_flags & VM_LOCKED) {
>> + if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
>> ret = SWAP_FAIL;
>> goto out_unmap;
>
> I keep on wanting to put that in there too. But pages in a VM_RESERVED vma
> should not find their way onto the LRU. Maybe we should be checking for
> that in do_no_page().

There was talk at one point of moving the "unswappable" state down into
the struct page. Is that still realistic? It would seem rather more
efficient, but I forget what problem we ran into with it.

M.

2004-03-03 16:00:13

by Dave McCracken

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2


--On Wednesday, March 03, 2004 07:46:32 -0800 "Martin J. Bligh"
<[email protected]> wrote:

> There was talk at one point of moving the "unswappable" state down into
> the struct page. Is that still realistic? It would seem rather more
> efficient, but I forget what problem we ran into with it.

It'd mean the page struct would have to have a count of the number of
mlock()ed regions it belongs to, and we'd have to update all the pages each
time we call it.

Dave McCracken

2004-03-03 15:59:54

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2



--Andrea Arcangeli <[email protected]> wrote (on Wednesday, March 03, 2004 08:09:33 +0100):

> While merging 230-objrmap in my tree I spotted 2 bugs potentially
> generating random memory corruption and 1 superflous bit that I dropped
> (mostly for documentation reasons, I like strict and in turn self
> documenting). Here below the fixes.

Looks good to me, but Dave is more familiar with this stuff ... Dave?

> I'm running some shm swap regression test on this right now and I'll
> leave it running for a day. In a few hours I will proceed starting
> dropping the pte_chain from the page sturcture and then I'll test the
> anon swapout. I will also follow the 6 great-effort anobjrmap posted by
> Hugh against objrmap while doing that, they're quite old (almost 1 year)
> but they still apply cleanly by hand so they're useful.

Bill has been keeping that up to date - he may have some more recent
changes somewhere?

> About 2.5:1.5 it seems not everybody is happy to lose 512m (and it's not
> Oracle), but before ruling it out I'd like to get some real life number,
> to be sure the performance of 2.0^W4:4 are really close (if not
> "better") than 3:1 as someone said. If we go with 4:4 IMHO at the very
> least the vgettimeofday backport from x86-64 is a must. In the meantime

John has that going - there's a copy in 2.6.3-mjb1, but it has a problem
at the moment on some boxes, which he's trying to fix up. I've been
pushing for the same thing - fixing the most common syscall will help ;-)
We're trying to get some numbers here, but have had a small setback with
some disk problems - should be this week still though.

> I keep going with the rmap removal to fixup the fork and to get back
> the 128m of normal zone useful on the 32G boxes. Could be also that new
> cpus are a lot better at reloading the tlbs from the pagetables dunno,
> the first numbers I recall about 4:4 predates to 2000 when PII was quite
> optimal. I'd only like to see an opteron and a xeon dealing with 4:4.

Makes sense. But why would you want 4:4 on opteron? I'm not sure we care
about the perf for anyone nutty enough to run a 32 bit kernel there ;-)

M.

2004-03-03 16:08:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 09:58:44AM -0600, Dave McCracken wrote:
> It'd mean the page struct would have to have a count of the number of
> mlock()ed regions it belongs to, and we'd have to update all the pages each
> time we call it.

That would add another atomic_t to struct pages..

But if we did it it would help some xfs fixes I'm doing currently a whole lot

2004-03-03 16:13:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 07:54:39AM -0800, Martin J. Bligh wrote:
> Makes sense. But why would you want 4:4 on opteron? I'm not sure we care
> about the perf for anyone nutty enough to run a 32 bit kernel there ;-)

it would be purerly a theorical benchmark, to see if the current state
of the art technology is hurted more or less by 4:4. If it's hurted less
it's reasonable to assume newer cpus will be hurted less too. However
we'll probably deal with only with a few more generation of 32bit cpus
and it will be only from intel so I agree with you the xeon test is the
most interesting for this.

2004-03-03 16:57:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 07:46:32AM -0800, Martin J. Bligh wrote:
> --Andrew Morton <[email protected]> wrote (on Wednesday, March 03, 2004 02:58:20 -0800):
>
> > Andrea Arcangeli <[email protected]> wrote:
> >>
> >> --- sles-objrmap/mm/rmap.c.~1~ 2004-03-03 06:45:38.995594456 +0100
> >> +++ sles-objrmap/mm/rmap.c 2004-03-03 07:01:39.200621104 +0100
> >> @@ -470,7 +470,7 @@ try_to_unmap_obj_one(struct vm_area_stru
> >> if (!pte)
> >> goto out;
> >>
> >> - if (vma->vm_flags & VM_LOCKED) {
> >> + if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
> >> ret = SWAP_FAIL;
> >> goto out_unmap;
> >
> > I keep on wanting to put that in there too. But pages in a VM_RESERVED vma
> > should not find their way onto the LRU. Maybe we should be checking for
> > that in do_no_page().
>
> There was talk at one point of moving the "unswappable" state down into
> the struct page. Is that still realistic? It would seem rather more
> efficient, but I forget what problem we ran into with it.

that already exists and it's PG_reserved, but it's inefficient compared
to VM_RESERVED, since it forces the vm to check all ptes.

2004-03-03 17:05:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 02:58:20AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > --- sles-objrmap/mm/rmap.c.~1~ 2004-03-03 06:45:38.995594456 +0100
> > +++ sles-objrmap/mm/rmap.c 2004-03-03 07:01:39.200621104 +0100
> > @@ -470,7 +470,7 @@ try_to_unmap_obj_one(struct vm_area_stru
> > if (!pte)
> > goto out;
> >
> > - if (vma->vm_flags & VM_LOCKED) {
> > + if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
> > ret = SWAP_FAIL;
> > goto out_unmap;
>
> I keep on wanting to put that in there too. But pages in a VM_RESERVED vma
> should not find their way onto the LRU. Maybe we should be checking for
> that in do_no_page().

you're right, so we can turn this into a BUG_ON, I prefer it here, since
this is really the place that VM_RESERVED is mean to guard against (plus
the vma merging).

thanks for the explanation.

2004-03-03 17:08:52

by Dave McCracken

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2


--On Wednesday, March 03, 2004 17:57:46 +0100 Andrea Arcangeli
<[email protected]> wrote:

>> There was talk at one point of moving the "unswappable" state down into
>> the struct page. Is that still realistic? It would seem rather more
>> efficient, but I forget what problem we ran into with it.
>
> that already exists and it's PG_reserved, but it's inefficient compared
> to VM_RESERVED, since it forces the vm to check all ptes.

What we've actually discussed before was more along the lines of PG_locked
to match VM_LOCKED, but the main idea was to be able to skip over
ineligible pages without having ot look up their mappings during pageout.

Dave McCracken

2004-03-03 18:38:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 11:07:39AM -0600, Dave McCracken wrote:
>
> --On Wednesday, March 03, 2004 17:57:46 +0100 Andrea Arcangeli
> <[email protected]> wrote:
>
> >> There was talk at one point of moving the "unswappable" state down into
> >> the struct page. Is that still realistic? It would seem rather more
> >> efficient, but I forget what problem we ran into with it.
> >
> > that already exists and it's PG_reserved, but it's inefficient compared
> > to VM_RESERVED, since it forces the vm to check all ptes.
>
> What we've actually discussed before was more along the lines of PG_locked
> to match VM_LOCKED, but the main idea was to be able to skip over
> ineligible pages without having ot look up their mappings during pageout.

I'm not very excited using PG_locked for that, that doesn't only mean
"unswappable", it means also that the page is under I/O (or uner some
other operation that needs serialization like unmapping the page) which
is quite a different concept from VM_LOCKED. a wait_on_page would
deadlock on such a PG_locked page, while wait_on_page on a page of a
mlocked vma doesn't normally deadlock.

But I see what you want to do here, and PG_reserved would be too much
for that since it changes the semantics for cow and free_pages, but
PG_locked doesn't look good enough either, the VM_LOCKED and PG_locked
concept is quite different, PG_reserved and VM_RESERVED is quite close
instead (except for the page->count accounting).

2004-03-03 18:44:40

by Dave McCracken

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2


--On Wednesday, March 03, 2004 19:39:01 +0100 Andrea Arcangeli
<[email protected]> wrote:

>> What we've actually discussed before was more along the lines of
>> PG_locked to match VM_LOCKED, but the main idea was to be able to skip
>> over ineligible pages without having ot look up their mappings during
>> pageout.
>
> I'm not very excited using PG_locked for that, that doesn't only mean
> "unswappable", it means also that the page is under I/O (or uner some
> other operation that needs serialization like unmapping the page) which
> is quite a different concept from VM_LOCKED. a wait_on_page would
> deadlock on such a PG_locked page, while wait_on_page on a page of a
> mlocked vma doesn't normally deadlock.
>
> But I see what you want to do here, and PG_reserved would be too much
> for that since it changes the semantics for cow and free_pages, but
> PG_locked doesn't look good enough either, the VM_LOCKED and PG_locked
> concept is quite different, PG_reserved and VM_RESERVED is quite close
> instead (except for the page->count accounting).

Sorry, I misspoke. I didn't mean the current PG_locked. I meant a new
flag, or more probably a counter, that represents all the VM_LOCKED regions
a page is in.

Dave McCracken

2004-03-03 18:50:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 12:44:07PM -0600, Dave McCracken wrote:
>
> --On Wednesday, March 03, 2004 19:39:01 +0100 Andrea Arcangeli
> <[email protected]> wrote:
>
> >> What we've actually discussed before was more along the lines of
> >> PG_locked to match VM_LOCKED, but the main idea was to be able to skip
> >> over ineligible pages without having ot look up their mappings during
> >> pageout.
> >
> > I'm not very excited using PG_locked for that, that doesn't only mean
> > "unswappable", it means also that the page is under I/O (or uner some
> > other operation that needs serialization like unmapping the page) which
> > is quite a different concept from VM_LOCKED. a wait_on_page would
> > deadlock on such a PG_locked page, while wait_on_page on a page of a
> > mlocked vma doesn't normally deadlock.
> >
> > But I see what you want to do here, and PG_reserved would be too much
> > for that since it changes the semantics for cow and free_pages, but
> > PG_locked doesn't look good enough either, the VM_LOCKED and PG_locked
> > concept is quite different, PG_reserved and VM_RESERVED is quite close
> > instead (except for the page->count accounting).
>
> Sorry, I misspoke. I didn't mean the current PG_locked. I meant a new
> flag, or more probably a counter, that represents all the VM_LOCKED regions
> a page is in.

ok, you used PG_locked that already exists for another purpose so it was
not clear, another bitflag would be ok.

the main remaining issue to solve (and run at runtime) is the logic is
to keep this flag consistent with all vmas pointing to the page having
VM_LOCKED set. I'm not sure if it's worth it.

what we do in 2.4 and that works pretty well, that is simply to refile
pages into the active list if they're mlocked, so we don't waste too
much cpu on them since we don't analyze them too often. this should work
pretty well for everybody, or peraphs google may prefer to have a fully
consistent PG_mlocked.

2004-03-03 19:02:25

by Dave McCracken

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2


--On Wednesday, March 03, 2004 19:51:22 +0100 Andrea Arcangeli
<[email protected]> wrote:

> ok, you used PG_locked that already exists for another purpose so it was
> not clear, another bitflag would be ok.

I forgot PG_locked already existed :)

> the main remaining issue to solve (and run at runtime) is the logic is
> to keep this flag consistent with all vmas pointing to the page having
> VM_LOCKED set. I'm not sure if it's worth it.

That's been the main stumbling block to anyone actually trying it. It
involves per-page housekeeping at every mlock() and every time a locked vma
gets unmapped.

> what we do in 2.4 and that works pretty well, that is simply to refile
> pages into the active list if they're mlocked, so we don't waste too
> much cpu on them since we don't analyze them too often. this should work
> pretty well for everybody, or peraphs google may prefer to have a fully
> consistent PG_mlocked.

The point would be to have a way of finding and skipping the locked page
before we go look up the vmas for it. 2.4 doesn't have that problem
because it's working from the vma.

Dave McCracken

2004-03-03 19:04:45

by Mike Kravetz

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 04:08:02PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 03, 2004 at 09:58:44AM -0600, Dave McCracken wrote:
> > It'd mean the page struct would have to have a count of the number of
> > mlock()ed regions it belongs to, and we'd have to update all the pages each
> > time we call it.
>
> That would add another atomic_t to struct pages..
>
> But if we did it it would help some xfs fixes I'm doing currently a whole lot
>

As someone looking into hotplug memory, I can imagine this also helping
out in that area.

--
Mike

2004-03-03 21:05:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 01:01:21PM -0600, Dave McCracken wrote:
> The point would be to have a way of finding and skipping the locked page
> before we go look up the vmas for it. 2.4 doesn't have that problem
> because it's working from the vma.

it has the same problem because we scan those pages always at the
physical side and always at the pagetable side. we could avoid scanning
them at the pagetable side, but we scan them anyways exactly to be able
to call mark_page_accessed that will move the page in the active list,
so the physical side doesn't waste an excessive amount of cpu on
unfreeable mlocked pages.

the trick is to keep marking those mlocked pages as accessed (so they go
into the active list) every time we encounter them and we find them
mlocked, so they're not in our way all the time.

2004-03-03 21:41:14

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

> what we do in 2.4 and that works pretty well, that is simply to refile
> pages into the active list if they're mlocked, so we don't waste too
> much cpu on them since we don't analyze them too often. this should work
> pretty well for everybody, or peraphs google may prefer to have a fully
> consistent PG_mlocked.

If the page is actually mlocked, wouldn't it make more sense to remove
it from both the active and inactive lists altogether? Scanning it seems
like it'd be less than fruitful.

M.

2004-03-03 23:16:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 01:39:35PM -0800, Martin J. Bligh wrote:
> > what we do in 2.4 and that works pretty well, that is simply to refile
> > pages into the active list if they're mlocked, so we don't waste too
> > much cpu on them since we don't analyze them too often. this should work
> > pretty well for everybody, or peraphs google may prefer to have a fully
> > consistent PG_mlocked.
>
> If the page is actually mlocked, wouldn't it make more sense to remove
> it from both the active and inactive lists altogether? Scanning it seems
> like it'd be less than fruitful.

yes, that's probably better, but the brainer part of the code is
probably the same as for maintaining the bitflag.

2004-03-04 15:42:02

by Rik van Riel

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, 3 Mar 2004, Christoph Hellwig wrote:
> On Wed, Mar 03, 2004 at 09:58:44AM -0600, Dave McCracken wrote:
> > It'd mean the page struct would have to have a count of the number of
> > mlock()ed regions it belongs to, and we'd have to update all the pages each
> > time we call it.
>
> That would add another atomic_t to struct pages..

No need for that. If a page is mlocked, it shouldn't be on any
of the LRU lists (since it can't be swapped out yadda yadda).

That means the locked count can share space in the struct page
with the list head used for the lru.

The only reason I haven't done this yet is that I didn't get
around to it...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-03-04 15:47:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Thu, Mar 04, 2004 at 10:41:51AM -0500, Rik van Riel wrote:
> On Wed, 3 Mar 2004, Christoph Hellwig wrote:
> > On Wed, Mar 03, 2004 at 09:58:44AM -0600, Dave McCracken wrote:
> > > It'd mean the page struct would have to have a count of the number of
> > > mlock()ed regions it belongs to, and we'd have to update all the pages each
> > > time we call it.
> >
> > That would add another atomic_t to struct pages..
>
> No need for that. If a page is mlocked, it shouldn't be on any
> of the LRU lists (since it can't be swapped out yadda yadda).
>
> That means the locked count can share space in the struct page
> with the list head used for the lru.
>
> The only reason I haven't done this yet is that I didn't get
> around to it...

so would you be okay with an inkernel interface like:

void mlock_page(struct page *page)
{
if (!test_and_set_bit(PG_mlocked, &page->flags)
remove_from_lru_if_there();
atomic_inc(&page.some_union->mlock_count);
}

if so that would help me greatly for xfs, but I'd also need a 2.4 variant..

2004-03-04 16:21:27

by Rik van Riel

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Thu, 4 Mar 2004, Christoph Hellwig wrote:

> void mlock_page(struct page *page)
> {
> if (!test_and_set_bit(PG_mlocked, &page->flags)
> remove_from_lru_if_there();
> atomic_inc(&page.some_union->mlock_count);
> }

Looks ok to me.

> if so that would help me greatly for xfs, but I'd also need a 2.4 variant..

I don't see why the same thing wouldn't work for 2.4 ...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-03-04 17:03:41

by Matt Mackall

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Thu, Mar 04, 2004 at 10:41:51AM -0500, Rik van Riel wrote:
> On Wed, 3 Mar 2004, Christoph Hellwig wrote:
> > On Wed, Mar 03, 2004 at 09:58:44AM -0600, Dave McCracken wrote:
> > > It'd mean the page struct would have to have a count of the number of
> > > mlock()ed regions it belongs to, and we'd have to update all the pages each
> > > time we call it.
> >
> > That would add another atomic_t to struct pages..
>
> No need for that. If a page is mlocked, it shouldn't be on any
> of the LRU lists (since it can't be swapped out yadda yadda).
>
> That means the locked count can share space in the struct page
> with the list head used for the lru.

And we can drop the VM_RESERVED flag, which is currently used in a
bunch of places where it's not on lists already.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting

2004-03-25 21:44:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Wed, Mar 03, 2004 at 08:09:33AM +0100, Andrea Arcangeli wrote:
> --- sles-objrmap/mm/mmap.c.~1~ 2004-03-03 06:45:38.980596736 +0100
> +++ sles-objrmap/mm/mmap.c 2004-03-03 06:53:46.945414808 +0100
> @@ -1284,8 +1284,8 @@ int do_munmap(struct mm_struct *mm, unsi
> /*
> * Remove the vma's, and unmap the actual pages
> */
> - detach_vmas_to_be_unmapped(mm, mpnt, prev, end);
> spin_lock(&mm->page_table_lock);
> + detach_vmas_to_be_unmapped(mm, mpnt, prev, end);
> unmap_region(mm, mpnt, prev, start, end);
> spin_unlock(&mm->page_table_lock);
>
> --- sles-objrmap/mm/swapfile.c.~1~ 2004-03-03 06:45:39.023590200 +0100
> +++ sles-objrmap/mm/swapfile.c 2004-03-03 07:03:33.128301464 +0100
> @@ -499,7 +499,6 @@ static int unuse_process(struct mm_struc
> /*
> * Go through process' page directory.
> */
> - down_read(&mm->mmap_sem);
> spin_lock(&mm->page_table_lock);
> for (vma = mm->mmap; vma; vma = vma->vm_next) {
> pgd_t * pgd = pgd_offset(mm, vma->vm_start);
> @@ -507,7 +506,6 @@ static int unuse_process(struct mm_struc
> break;
> }
> spin_unlock(&mm->page_table_lock);
> - up_read(&mm->mmap_sem);
> pte_chain_free(pte_chain);
> return 0;
> }

Martin, I just found I was wrong about the above, I would been right in
the 2.4 VM, but in 2.6 the above is not needed. So you can delete the
above part from your tree too. it seems 2.6 really splitted the
page_table_lock from the mmap_sem and the only way to touch vmas is to
get the mmap_sem, the page_table_lock is unrelated to the vmas. This
wasn't the case in 2.4 where readers were allowed to take the
page_table_lock only.

2004-03-26 11:59:42

by Hugh Dickins

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Thu, 25 Mar 2004, Andrea Arcangeli wrote:
> On Wed, Mar 03, 2004 at 08:09:33AM +0100, Andrea Arcangeli wrote:
> > --- sles-objrmap/mm/mmap.c.~1~ 2004-03-03 06:45:38.980596736 +0100
> > +++ sles-objrmap/mm/mmap.c 2004-03-03 06:53:46.945414808 +0100
> > @@ -1284,8 +1284,8 @@ int do_munmap(struct mm_struct *mm, unsi
> > /*
> > * Remove the vma's, and unmap the actual pages
> > */
> > - detach_vmas_to_be_unmapped(mm, mpnt, prev, end);
> > spin_lock(&mm->page_table_lock);
> > + detach_vmas_to_be_unmapped(mm, mpnt, prev, end);
> > unmap_region(mm, mpnt, prev, start, end);
> > spin_unlock(&mm->page_table_lock);
> >
> > --- sles-objrmap/mm/swapfile.c.~1~ 2004-03-03 06:45:39.023590200 +0100
> > +++ sles-objrmap/mm/swapfile.c 2004-03-03 07:03:33.128301464 +0100
> > @@ -499,7 +499,6 @@ static int unuse_process(struct mm_struc
> > /*
> > * Go through process' page directory.
> > */
> > - down_read(&mm->mmap_sem);
> > spin_lock(&mm->page_table_lock);
> > for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > pgd_t * pgd = pgd_offset(mm, vma->vm_start);
> > @@ -507,7 +506,6 @@ static int unuse_process(struct mm_struc
> > break;
> > }
> > spin_unlock(&mm->page_table_lock);
> > - up_read(&mm->mmap_sem);
> > pte_chain_free(pte_chain);
> > return 0;
> > }
>
> Martin, I just found I was wrong about the above, I would been right in
> the 2.4 VM, but in 2.6 the above is not needed. So you can delete the
> above part from your tree too. it seems 2.6 really splitted the
> page_table_lock from the mmap_sem and the only way to touch vmas is to
> get the mmap_sem, the page_table_lock is unrelated to the vmas. This
> wasn't the case in 2.4 where readers were allowed to take the
> page_table_lock only.

Yes you were wrong, but no you're not right ;)
Leaving 2.4 out of it, taking it tree by tree:

Linus' 2.6.5-rc2: rmap.c's try_to_unmap_one uses find_vma under
protection of page_table_lock alone, swapfile.c's unuse_process
walks mm->mmap vma list under protection of page_table_lock alone
(other cases of interest? I've not looked further), mmap.c uses
page_table_lock to protect vma list manipulations, mmap_sem also.

Andrew's 2.6.5-rc2-mm3: as Linus.

Martin's 2.6.4-mjb1: unsafe: try_to_unmap_one (only used for anon
in this tree) and unuse_process rely on page_table_lock as in Linus,
but page_table_lock has been removed from some of the vma manipulation
in mmap.c e.g. vma_merge does __vma_unlink without page_table_lock.
Previous -mjb safer in unuse_process (using mmap_sem), unsafer in
detach_vmas_to_be_unmapped (no page_table_lock held).

Andrea's 2.6.5-rc2-aa4 (anon_vma): based on Martin's, but very
likely safe since it does not use find_vma at all in swapout, and
unuse_process downs mmap_sem as Martin's used to before.

Hugh's anobjrmap patches: as Linus (I left out -mjb mmap.c mods).

I believe that Martin should revert all the mmap.c page_table_lock
diffs from Linus in his tree (instead of reverting the patches you
show above), but that you Andrea can probably keep them with yours
(I've not reviewed enough to be sure).

(And of course, perhaps later on Martin will merge yours into
his and do one thing, or merge mine into his and do another.)

Hugh

2004-03-26 18:01:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Fri, Mar 26, 2004 at 11:57:23AM +0000, Hugh Dickins wrote:
> Andrea's 2.6.5-rc2-aa4 (anon_vma): based on Martin's, but very
> likely safe since it does not use find_vma at all in swapout, and
> unuse_process downs mmap_sem as Martin's used to before.

Hugh, thanks for the review! I also don't see locking bugs in this area
in my tree and I like not to take the page_table_lock during vma
manipulations since I don't seem to need it.

2004-03-26 23:23:25

by Andrew Morton

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

Andrea Arcangeli <[email protected]> wrote:
>
> On Fri, Mar 26, 2004 at 11:57:23AM +0000, Hugh Dickins wrote:
> > Andrea's 2.6.5-rc2-aa4 (anon_vma): based on Martin's, but very
> > likely safe since it does not use find_vma at all in swapout, and
> > unuse_process downs mmap_sem as Martin's used to before.
>
> Hugh, thanks for the review! I also don't see locking bugs in this area
> in my tree and I like not to take the page_table_lock during vma
> manipulations since I don't seem to need it.

It would be really, really nice if we could clean this crap up. mmap_sem
protects the vma tree, i_shared_sem protects the per-address_space vma
lists and page_table_lock protects the pagetables.

Does this sound like something we can achieve?

(Could page_table_lock then become per-vma?)

2004-03-27 14:23:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2

On Fri, Mar 26, 2004 at 03:25:23PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Fri, Mar 26, 2004 at 11:57:23AM +0000, Hugh Dickins wrote:
> > > Andrea's 2.6.5-rc2-aa4 (anon_vma): based on Martin's, but very
> > > likely safe since it does not use find_vma at all in swapout, and
> > > unuse_process downs mmap_sem as Martin's used to before.
> >
> > Hugh, thanks for the review! I also don't see locking bugs in this area
> > in my tree and I like not to take the page_table_lock during vma
> > manipulations since I don't seem to need it.
>
> It would be really, really nice if we could clean this crap up. mmap_sem
> protects the vma tree, i_shared_sem protects the per-address_space vma
> lists and page_table_lock protects the pagetables.
>
> Does this sound like something we can achieve?

yes, and I already achieved this in my tree, so if you plan to merge my
stuff you don't need to touch it in your tree (so it also avoids me to
reject on stuff I already did).

> (Could page_table_lock then become per-vma?)

I probably could make it per-vma, but it's low prio at this time.

2004-03-27 15:30:00

by Dave McCracken

[permalink] [raw]
Subject: Re: 230-objrmap fixes for 2.6.3-mjb2


--On Friday, March 26, 2004 15:25:23 -0800 Andrew Morton <[email protected]>
wrote:

> It would be really, really nice if we could clean this crap up. mmap_sem
> protects the vma tree, i_shared_sem protects the per-address_space vma
> lists and page_table_lock protects the pagetables.
>
> Does this sound like something we can achieve?
>
> (Could page_table_lock then become per-vma?)

The page_table_lock protection includes pgd and pmd. Would they receive
adequate protection with a per-vma lock? Could we come up with some
lockless scheme for managing them? They're pretty much write-once, as it
stands.

Dave McCracken