2001-10-30 11:32:42

by Frank Dekervel

[permalink] [raw]
Subject: need help interpreting 'free' output.

hello,

since i saw strange things happening with my free memory numbers, i tried
this:
- i compiled and booted a fresh kernel (no proprietary modules, no patches,
just 2.4.14-pre4)
- i did free.

bakvis:~# free
total used free shared buffers cached
Mem: 384912 55644 329268 0 3652 29880
-/+ buffers/cache: 22112 362800
Swap: 136512 0 136512

so i have 22 meg used right ?

- i started the daily cron jobs (updatedb and htdig and some minor things
like log rotation)

- i did 'free' again.

bakvis:~# free
total used free shared buffers cached
Mem: 384912 377060 7852 0 29424 125660
-/+ buffers/cache: 221976 162936
Swap: 136512 752 135760

so now there is 220 meg used memory right ?
and the memory is definitely used, because as soon as i start a memory hog
the system hits swap ...

so what am i missing here ?
should i provide more info about my kernel configuration ? vmstat numbers ?

greetings,
Frank


2001-10-30 11:46:43

by Mike Fedyk

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, Oct 30, 2001 at 12:32:52PM +0100, Frank Dekervel wrote:
> so now there is 220 meg used memory right ?
> and the memory is definitely used, because as soon as i start a memory hog
> the system hits swap ...
>
> so what am i missing here ?
> should i provide more info about my kernel configuration ? vmstat numbers ?
>

Ahh, are you a new convert from a 2.2 kernel?

In 2.4 the kernel will swap out much earlier to make room for the running
programs, and disk cache. This is normal.

Earlier 2.4 kernels didn't do so well, but I won't go into detail because
there is already enough about that in the archives...

When you watch vmstat, if you see a lot of swapping traffic without much
good reason, then you should probably report something...

Mike

2001-10-30 14:02:38

by Frank Dekervel

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

Op dinsdag 30 oktober 2001 12:46, schreef Mike Fedyk:
> Ahh, are you a new convert from a 2.2 kernel?
>
> In 2.4 the kernel will swap out much earlier to make room for the running
> programs, and disk cache. ?This is normal.
>
> Earlier 2.4 kernels didn't do so well, but I won't go into detail because
> there is already enough about that in the archives...
>
> When you watch vmstat, if you see a lot of swapping traffic without much
> good reason, then you should probably report something...

Hi,

i already use 2.4 for some time. the thing that bugs me is the 'used' figures
go up, and no processes actually use that memory (not the buffered/cached,
well, they go up , but thats normal) , so it seems the memory is 'lost'
somewhere, and i don't see any processes using it up, and 200 meg ram in 70
seconds is a lot ...
So or i am misinterpreting something, or i am completely clueless, or there
is a leak somewhere..

greetings,
Frank

2001-10-30 16:05:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, 30 Oct 2001, Frank Dekervel wrote:
>
> since i saw strange things happening with my free memory numbers, i tried
> this:
> - i compiled and booted a fresh kernel (no proprietary modules, no patches,
> just 2.4.14-pre4)
> - i did free.
>
> bakvis:~# free
> total used free shared buffers cached
> Mem: 384912 55644 329268 0 3652 29880
> -/+ buffers/cache: 22112 362800
> Swap: 136512 0 136512
>
> so i have 22 meg used right ?
>
> - i started the daily cron jobs (updatedb and htdig and some minor things
> like log rotation)
>
> - i did 'free' again.
>
> bakvis:~# free
> total used free shared buffers cached
> Mem: 384912 377060 7852 0 29424 125660
> -/+ buffers/cache: 221976 162936
> Swap: 136512 752 135760
>
> so now there is 220 meg used memory right ?
> and the memory is definitely used, because as soon as i start a memory hog
> the system hits swap ...
>
> so what am i missing here ?
> should i provide more info about my kernel configuration ? vmstat numbers ?

I'm fairly sure /proc/slabinfo will show large inode_cache and large
dentry_cache: which is natural after updatedb, nothing wrong with that.

However, unlike 2.4.13, 2.4.14-pre (you tried pre4, I just tried pre5)
seems much too unwilling to shrink_dcache and shrink_icache: your
memory hog should shrink them, but it seems not to. Linus?

Hugh

2001-10-30 16:53:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, Oct 30, 2001 at 04:07:45PM +0000, Hugh Dickins wrote:
> On Tue, 30 Oct 2001, Frank Dekervel wrote:
> >
> > since i saw strange things happening with my free memory numbers, i tried
> > this:
> > - i compiled and booted a fresh kernel (no proprietary modules, no patches,
> > just 2.4.14-pre4)
> > - i did free.
> >
> > bakvis:~# free
> > total used free shared buffers cached
> > Mem: 384912 55644 329268 0 3652 29880
> > -/+ buffers/cache: 22112 362800
> > Swap: 136512 0 136512
> >
> > so i have 22 meg used right ?
> >
> > - i started the daily cron jobs (updatedb and htdig and some minor things
> > like log rotation)
> >
> > - i did 'free' again.
> >
> > bakvis:~# free
> > total used free shared buffers cached
> > Mem: 384912 377060 7852 0 29424 125660
> > -/+ buffers/cache: 221976 162936
> > Swap: 136512 752 135760
> >
> > so now there is 220 meg used memory right ?
> > and the memory is definitely used, because as soon as i start a memory hog
> > the system hits swap ...
> >
> > so what am i missing here ?
> > should i provide more info about my kernel configuration ? vmstat numbers ?
>
> I'm fairly sure /proc/slabinfo will show large inode_cache and large
> dentry_cache: which is natural after updatedb, nothing wrong with that.
>
> However, unlike 2.4.13, 2.4.14-pre (you tried pre4, I just tried pre5)
> seems much too unwilling to shrink_dcache and shrink_icache: your
> memory hog should shrink them, but it seems not to. Linus?

2.4.14pre5aa1 has a logic to try to shrink those caches at a better
time. Frank could you try again with pre5aa1 and see if it goes better?

Not shrinking the vfs caches when shrink_cache failed is wrong,
allocations from ZONE_NORMAL will fail without way to recover as soon as
all ZONE_NORMAL is eat in vfs caches.

Andrea

2001-10-30 16:55:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.


On Tue, 30 Oct 2001, Hugh Dickins wrote:
>
> However, unlike 2.4.13, 2.4.14-pre (you tried pre4, I just tried pre5)
> seems much too unwilling to shrink_dcache and shrink_icache: your
> memory hog should shrink them, but it seems not to. Linus?

Yes. It's next on my list.

My _preferred_ approach would actually be to move the slab pages to the
LRU list too, and have a special "slab" address space (we don't need to
actually hash them, we just make page->mapping point to it), and have the
cache shrink be done naturally as part of writepage().

That way "shrink_cache()" reacts very naturally to slab pressure, while
right now it's more of a random behaviour. That's what the "anonymous
pages in the LRU" approach fixes - the VM scanning reacts very naturally
(instead of with subtle tweaking and almost random behaviour) to mapped
page pressure.

The "slab address space" is a longer-range plan, though. It migth be
really simple (the writepage would just move the page to the active list
and try to shrink the slab that was hit), but I think the current stuff is
"good enough".

So in the short range, I haven't come up with any really good approaches,
but I suspect I'll just have to move the shrink_[di]cache() back to the
caller, which will at least shrink them on swapouts (a bit too much, I
think, but on the other hand maybe not).

Patch attached,

Linus


Attachments:
slab (1.17 kB)

2001-10-30 17:06:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, Oct 30, 2001 at 08:52:58AM -0800, Linus Torvalds wrote:
> So in the short range, I haven't come up with any really good approaches,
> but I suspect I'll just have to move the shrink_[di]cache() back to the
> caller, which will at least shrink them on swapouts (a bit too much, I
> think, but on the other hand maybe not).

Agreed.

It is still interesting to hear if it makes a big performance differece
under swap though. In particular it would be very nice to keep inodes
with pagecache in it out of the unused-inode-list, but it would need
additional bitkeeping in inode.c.

I'm also wondering why you dropped the early-cow for the write swapins,
just to avoid managing the anon pages in the lru in do_swap_page and to
have the logic only in once place? I kept the early-cow logic so I only
get 1 page fault for every write-swapped-in pages.

Andrea

2001-10-30 17:30:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.


On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
>
> It is still interesting to hear if it makes a big performance differece
> under swap though. In particular it would be very nice to keep inodes
> with pagecache in it out of the unused-inode-list, but it would need
> additional bitkeeping in inode.c.

Yes. I'm worried about the fact that icache shrinking was one of the top
CPU users under heavy swapout, so I'd like to do _something_. The LRU
approach is probably the cleanest and least random approach.

> I'm also wondering why you dropped the early-cow for the write swapins,
> just to avoid managing the anon pages in the lru in do_swap_page and to
> have the logic only in once place? I kept the early-cow logic so I only
> get 1 page fault for every write-swapped-in pages.

I only dropped it because the locking rules for how exclusive swap pages
work were too unclear, and I wanted to have the "remove on write" in just
one place.

Then I cleaned up the logic and made the thing use the pagecache lock
properly and turned it into "remove_exclusive_swap_page()", and now I'm
not worried about it any more, so I'm considering moving it back again.

HOWEVER, _then_ I started wondering about whether the thing needs to be
removed from the swap cache at all, and came to the conclusion that for
the only case we really care about (and the only case where we _can_
re-use the swap cache page), we don't actually need to remove it from the
cache in the first place.

I think we should just share the page, and make the WP (and early-COW in
do_swap_page()) logic just be

/* Are we now the only user? */
if (swap_count(page) == 1 && page_count(page) == 2) {
pte = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))
install_pte();
return;
}

There's no real reason to remove the page from the swap cache - that only
means that we have to wait for the page to unlock (because you need to
lock the page in order to remove the buffers that you need to remove
_before_ you free the swap entry) and other crap that has no real point to
it.

When we fork() and possibly share the page non-exclusively, we will
_already_ mark the page read-only and do the COW - so after that point we
will correctly just copy the page on demand.

Much simpler, I think.

Does anybody see why we have to remove it from the swap cache at all?

Linus

2001-10-30 17:39:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, Oct 30, 2001 at 09:28:29AM -0800, Linus Torvalds wrote:
> Does anybody see why we have to remove it from the swap cache at all?

the only reason is to avoid wasting the swap space, so at least Rik's
vm_swap_full logic should be added to it. The only advantage of dirty
swap cache persistence is that it will maintain the same position on
disk across a swapin/swapout cycle.

But anyways you can do that "swap persistence" work in do_swap_page too
to save a page fault for the write swapins. Ok, it's in one more place
but it will be less costly than running into another pagefault just
after returning to userspace.

Andrea

2001-10-30 17:55:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.


On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
>
> On Tue, Oct 30, 2001 at 09:28:29AM -0800, Linus Torvalds wrote:
> > Does anybody see why we have to remove it from the swap cache at all?
>
> the only reason is to avoid wasting the swap space, so at least Rik's
> vm_swap_full logic should be added to it.

I agree, but that's true both for reads and writes, and then we want to
delete it. So the logic might be something like

remove = 0;
if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) ||
only_swap_user()) {
pte = mk_pte(page, vma->vm_page_prot);
if (remove || write_access)
pte = pte_mkdirty(pte);
if (vma->vm_page_prot & VM_WRITE)
pte = pte_mkwrite(pte);
install_pte();
return;
}

ie we _remove_ it if we're low on swap entries and it is exclusive (that
doesn't really save memory, but it allows us to re-use the swap entries
for "better" pages), and we just re-use it without removing it if we're
the only users (it doesn't even have to be a write access - we can do it
even for reads, as if we're the only user we might as well just give the
page to the process anyway - and let fork() do the thing it does in any
case.

Then we'll just trust the dirty bit when shared, like we always have done
before anyway (we need to set it on removal, and we want to set it early
on a write access to avoid unnecessary faults on architectures which do
the dirty bit in software - that's why we have the "remove ||
write_access" test there.

> The only advantage of dirty swap cache persistence is that it will
> maintain the same position on disk across a swapin/swapout cycle.

Well, the _big_ advantage is not the persistence, but the fact that the
page might be in-flight when the user wants to use it, and the swap cache
is just busy. Right now we _wait_ for the write to complete, which is
silly. We might as well just let the user start using the page (including
writing more stuff to it), and later on write it again.

So right now the "remove from swap cache" is actually a IO-serializing
operation, and we're doing it for no really good reason.

Linus

2001-10-30 18:10:56

by Frank Dekervel

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.


Op dinsdag 30 oktober 2001 17:07, schreef Hugh Dickins:
> I'm fairly sure /proc/slabinfo will show large inode_cache and large
> dentry_cache: which is natural after updatedb, nothing wrong with that.

indeed.

before updatedb:

inode_cache 10594 10605 512 1515 1515 1
dentry_cache 18239 18240 128 608 608 1

after:

inode_cache 220883 220913 512 31558 31559 1
dentry_cache 229471 229500 128 7650 7650 1

but i guess this comes a bit late :)

greetings,
frank

2001-10-30 18:17:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, Oct 30, 2001 at 09:53:28AM -0800, Linus Torvalds wrote:
>
> On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
> >
> > On Tue, Oct 30, 2001 at 09:28:29AM -0800, Linus Torvalds wrote:
> > > Does anybody see why we have to remove it from the swap cache at all?
> >
> > the only reason is to avoid wasting the swap space, so at least Rik's
> > vm_swap_full logic should be added to it.
>
> I agree, but that's true both for reads and writes, and then we want to

yes.

> delete it. So the logic might be something like
>
> remove = 0;
> if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) ||
> only_swap_user()) {

I preferred the previous exclusive_swap_page logic. It couldn't race
because we had the lock on the page, it's equivalent and it looked
cleaner and simpler to me, we had to bother about the rest only if the
page was exclusive. Now this only_swap_user replaces the
exclusive_swap_cache check basically and you will end doing the double
of the work if the vm is full and the page isn't exclusive, so both
exclusive_swap_cache_delete and only_swap_user will have to work and
fail.

> pte = mk_pte(page, vma->vm_page_prot);
> if (remove || write_access)
> pte = pte_mkdirty(pte);
> if (vma->vm_page_prot & VM_WRITE)
> pte = pte_mkwrite(pte);
> install_pte();
> return;
> }
>
> ie we _remove_ it if we're low on swap entries and it is exclusive (that
> doesn't really save memory, but it allows us to re-use the swap entries
> for "better" pages), and we just re-use it without removing it if we're
> the only users (it doesn't even have to be a write access - we can do it
> even for reads, as if we're the only user we might as well just give the
> page to the process anyway - and let fork() do the thing it does in any
> case.
>
> Then we'll just trust the dirty bit when shared, like we always have done
> before anyway (we need to set it on removal, and we want to set it early
> on a write access to avoid unnecessary faults on architectures which do
> the dirty bit in software - that's why we have the "remove ||
> write_access" test there.

ok.

>
> > The only advantage of dirty swap cache persistence is that it will
> > maintain the same position on disk across a swapin/swapout cycle.
>
> Well, the _big_ advantage is not the persistence, but the fact that the
> page might be in-flight when the user wants to use it, and the swap cache
> is just busy. Right now we _wait_ for the write to complete, which is
> silly. We might as well just let the user start using the page (including
> writing more stuff to it), and later on write it again.

if we remove all write-swapins from the swap cache those pages cannot be
in flight, we cannot do I/O on anon memory if it's not in the swapcache
or we would race badly. all I/O to the swap space have to pass through
the swap cache to be safe. So I don't see how an anonymous page can be
in flight.

> So right now the "remove from swap cache" is actually a IO-serializing
> operation, and we're doing it for no really good reason.

I think this is not true. remove_from_swap_cache can be run only if:

1) we hold the lock on the page
2) this mean all I/O is complete and so we can safely convert this
non in-flight page to an anonymous page clean where any further
I/O will be impossible

So I still think the only advantage is to keep the swap position
persistent across a swapin/swapout cycle.

Andrea

2001-10-30 18:19:51

by ebiederman

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

Linus Torvalds <[email protected]> writes:

> HOWEVER, _then_ I started wondering about whether the thing needs to be
> removed from the swap cache at all, and came to the conclusion that for
> the only case we really care about (and the only case where we _can_
> re-use the swap cache page), we don't actually need to remove it from the
> cache in the first place.

There is a second case, though you may be handling it differently now.
Typically the case is swap < RAM. But basically when we don't have
enough have enough swap pages it pays to drop pages from the swap
cache. So in as many places as we can figuring out how to drop
swap pages when the swap space is practically full is important.

The other alternative implementation is to create a logical backing
store for anonymous pages (so the don't need a presence in the page
table) and then we could just walk that backing store and free up swap
space on demand. Though if you can put anonymous pages in the page
cache now, a variation on that idea may be possible. We don't
want to remove the swap from pages that aren't in ram.


> Does anybody see why we have to remove it from the swap cache at all?

Not just for cow.

Eric

2001-10-30 18:30:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.


On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
> > delete it. So the logic might be something like
> >
> > remove = 0;
> > if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) ||
> > only_swap_user()) {
>
> I preferred the previous exclusive_swap_page logic. It couldn't race
> because we had the lock on the page, it's equivalent

It is _not_ equivalent.

Think for five seconds about what you just wrote..

"It couldn't race because we had the lock on the page.."

In short: the old code needed to get the page lock. In fact, it needed to
get the page lock even for reads that don't need it at all - only because
there could be a write from another process that shared the swap page.

Ie we optimized for the very very uncommon case. Sharing swap pages is
uncommon in itself, and it only happens when they _really_ aren't accessed
over a fork() etc. In short, writing the code to deal with that by default
is the wrong optimization.

Now, the _common_ case is that the page is truly exclusive, and you don't
want to lock the page - because locking the page means that you can pause
for a _long_ time waiting for the page to be written out when there is IO
pending.

This is especially true since we need to get the swap device lock
_anyway_, so locking the page is (a) inefficient and (b) overkill.

The new re-org gets no new locks, and drops the page lock, allowing people
to do the optimization without holding the page locked, which in turn
means that you don't need to wait for potential IO to complete just to
read a value from a page that you already have in memory.

> if we remove all write-swapins from the swap cache those pages cannot be
> in flight,

What?

The page is busy being written out by another process - the page is locked
but up-to-date. We have _no_ reason to not give it immediately.

This is something we do for all page cache pages - go read
filemap_nopage() etc. They don't wait for data that is up-to-date.

> So I don't see how an anonymous page can be in flight.

It's being swapped out. What's so hard to see about that? Look at
mm/vmscan.c: writepage-> swap_writepage().

The page is up-to-date but locked (it's obviously up-to-date, or we
wouldn't be able to write it out). It won't be unlocked until the IO has
completed, which is, under heavy swap load, easily half a second. Why do
you want to wait for that?

Linus

2001-10-30 18:58:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, Oct 30, 2001 at 10:28:29AM -0800, Linus Torvalds wrote:
> want to lock the page - because locking the page means that you can pause
> for a _long_ time waiting for the page to be written out when there is IO
> pending.

ok I see what you mean, I agree (going to merge those important bits
into my tree! :)

however those locking bits have nothing to do with exclusive_swap_page
and the ealry cow I believe. exclusive_swap_page is faster than
remove_exclusive_swap_page + only_swap_page as said in the earlier email
and don't forget you somehow need the page lock too for
remove_exclusive_swap_page.

The magic word here is "_trylock_" after your wait_on_page if the page
wasn't uptodate, it's not that avoiding the early-cow or your
remove_exclusive_swap_cache will change anything (they only slowdowns).

So in short we only need to replace the lock_page with a TryLockPage
(plus your wait_on_page if page is not uptodate to catch the major
faults) and here we go, faster than pre5.

In previous emails I was thinking at major faults, of course the whole
optimization here is for the _minor_ faults were we don't need to block
and where pre5aa1 blocks and where pre5 vanilla doesn't block! Very good
point.

Andrea

2001-10-30 19:23:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.


On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
>
> So in short we only need to replace the lock_page with a TryLockPage
> (plus your wait_on_page if page is not uptodate to catch the major
> faults) and here we go, faster than pre5.

Wrong.

If _anybody_ accesses the page unlocked, you cannot do the swap_count() at
all, because then you don't have anything that serializes the accesses to
swap_count vs page_count any more.

Sure, it will _look_ like it is working (because 99.9% of the time we tend
to have exclusive pages anyway), but the fact is that the old scheme
_depended_ on swap_in getting the page lock - not just for testing, but
for everybody else who wasn't even _interested_ in testing, but just
wanted to increment the page could and decrement the swap count.

See?

Do you _now_ understand why pre5 does this atomically? It needs to test
the swap count _and_ the page count atomically under the same lock.

The page lock ha NOTHING to do with anything. If we ever have any user
that does not take the page lock (and you now seem to realize why we want
to have such users), the pagelock is WORTHLESS, because suddenly it
doesn't end up protecting the counts at all.

So making it a trylock doesn't help. See?

Linus

2001-10-30 20:08:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

On Tue, Oct 30, 2001 at 11:21:46AM -0800, Linus Torvalds wrote:
>
> On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
> >
> > So in short we only need to replace the lock_page with a TryLockPage
> > (plus your wait_on_page if page is not uptodate to catch the major
> > faults) and here we go, faster than pre5.
>
> Wrong.
>
> If _anybody_ accesses the page unlocked, you cannot do the swap_count() at
> all, because then you don't have anything that serializes the accesses to
> swap_count vs page_count any more.

incidentally if trylock fails do_wp_page doesn't even try to check the
swap count, it just lefts the swap cache there. same thing do_swap_page
can do at the early-cow stage. this is the only point I'm making.

and as said if you want to do any remove_exclusive_swap_page() in
do_swap_page as you claimed in earlier email you also need to get the
page lock.

As far I can tell here the magic key is "trylock" and nothing else, it's
not that the remove_exclusive_swap_page or the avoidance of the
early-cow per se can make any difference (let's ignore swapoff) except
running slower, here the only improvement during swapout load is that
you're delegating the work of remove_exclusive_swap_page to do_wp_page
that will do a trylock instead of a lock_page as far I can tell. This is
the only point I'm making.

Go ahead and implement this thing in do_swap_page:

remove = 0;
if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) ||
only_swap_user()) {
pte = mk_pte(page, vma->vm_page_prot);
if (remove || write_access)
pte = pte_mkdirty(pte);
if (vma->vm_page_prot & VM_WRITE)
pte = pte_mkwrite(pte);
install_pte();
return;
}

and you'll find yourself grabbing the page lock somehow first in the
do_swap_page path, or exclusive_swap_cache_delete will obviously BUG()
on you.

This is why I'm saying the real magic is to conver the lock_page of pre4
in a TryLockPage, all other changes are not interesting in real load and
I obviously agree that's very good idea to fix the minor faults, that
in pre4 (and all previous kernels including all -ac and -aa) are running
as slow as major faults!

Now about the real need of exclusive_swap_cache_delete compared to
exclusive_swap_page I need to think a little more about it to be sure.

In sort previously we run exclusive_swap_page only with the page lock,
page->buffers is constant if the page is locked. And swap count and page
count _can't_ increase under us if the page happen to be exclusive once.
This was the previous rule at least, but as usual there's the swapoff
evil caming out and doing the lookup on a exclusive swap page... Hugh
may provide more hints on this case.

Andrea

2001-10-30 20:26:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.



On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
>
> incidentally if trylock fails do_wp_page doesn't even try to check the
> swap count, it just lefts the swap cache there. same thing do_swap_page
> can do at the early-cow stage. this is the only point I'm making.

Yes.

At some point we need to lock the page _if_ we actually decide we have to
do something with it. The current strategy is along the lines: if we can
obviously share it, let's do so, but let's not wait for it to be
unsharable.

> Go ahead and implement this thing in do_swap_page:
>
> remove = 0;
> if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) ||
> only_swap_user()) {
> pte = mk_pte(page, vma->vm_page_prot);
> if (remove || write_access)
> pte = pte_mkdirty(pte);
> if (vma->vm_page_prot & VM_WRITE)
> pte = pte_mkwrite(pte);
> install_pte();
> return;
> }
>
> and you'll find yourself grabbing the page lock somehow first in the
> do_swap_page path, or exclusive_swap_cache_delete will obviously BUG()
> on you.

We'll trylock it yes, but that has nothing to do with the page count
protections. We'll trylock it if we end up _deleting_ the page, but not
for count reasons, but because deletion needs the lock in order to wait
for pages.

And realize that that is the really rare case, where we really don't care
for performance - we've just realized that we don't even have enough swap
for the kind of load the machine is under.

So your point is that the "we're out of swap space" case is a bit slower
because we potentially take the swapspace spinlocks twice? Sure.

But look at the fast paths: no waiting anywhere, and no unnecessary locks.

> This is why I'm saying the real magic is to conver the lock_page of pre4
> in a TryLockPage, all other changes are not interesting in real load

NO NO NO.

Read my mails again.

Th epage lock used to protect the integrity of the "swap_count()" test
(which is part of the old "exclusive_swap_page()"). The ruls was: the
swap count cannot change on a page when it is locked.

Making the lock_page() be a TryLockPage, and doing the "swap_free()"
without holding the page lock means that that integrity NO LONEGR EXISTS.

Which means that the old "exclusive_swap_page()" DOES NOT WORK RELIABLY.
It tested "page_count()" and "swap_count()" in ways that were no longer
guaranteed to be meaningful - swap_count() could go down to 1 _after_
somebody else had incremented "page_count()" on another CPU due to a
swap-in of another process that shared the swap entry (or even another
thread on the same MM - we don't hold the page table spinlock there)..

Do you get it now?

By making the unconditional "always lock the page on swap-in" be a "try to
lock the page if you _need_ to", exclusive_swap_page() no longer worked,
and had to be gotten rid of or at least changed to do the right thing.
Considering that all users of the function also wanted to remove the page,
the change was obvious.

Might we want to split it up differently if we do the "only_user()"?
Maybe. But _please_ realize that the changes in pre5 are correctness
fixes, not some random movement of code.

Linus

2001-10-30 20:47:50

by David Miller

[permalink] [raw]
Subject: Re: need help interpreting 'free' output.

From: Linus Torvalds <[email protected]>
Date: Tue, 30 Oct 2001 08:52:58 -0800 (PST)

My _preferred_ approach would actually be to move the slab pages to the
LRU list too, and have a special "slab" address space (we don't need to
actually hash them, we just make page->mapping point to it), and have the
cache shrink be done naturally as part of writepage().

This is a cool idea.

So when a SLAB block gets allocated from, we "reference" the
underlying page?

Franks a lot,
David S. Miller
[email protected]