LinuxLists.cc - 2.5.33-mm1

2002-09-03 04:00:29

Subject: 2.5.33-mm1

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.33/2.5.33-mm1/

Seven new patches - mostly just code cleanups.

+slablru-speedup.patch

A patch to improve slablru cpu efficiency. Ed is
redoing this.

+oom-fix.patch

Fix an OOM-killing episode on large highmem machines.

+tlb-cleanup.patch

Remove debug code from the tlb_gather rework, tidy up a couple of
things.

+dump-stack.patch

Arch-independent stack-dumping debug function

+madvise-move.patch

Move the madvise implementation out of filemap.c into madvise.c

+split-vma.patch

Rationalise lots of the VMA-manipulation code.

+buffer-ops-move.patch

Move the buffer_head IO functions out of ll_rw_blk.c, into buffer.c

scsi_hack.patch
Fix block-highmem for scsi

ext3-htree.patch
Indexed directories for ext3

rmap-locking-move.patch
move rmap locking inlines into their own header file.

discontig-paddr_to_pfn.patch
Convert page pointers into pfns for i386 NUMA

discontig-setup_arch.patch
Rework setup_arch() for i386 NUMA

discontig-mem_init.patch
Restructure mem_init for i386 NUMA

discontig-i386-numa.patch
discontigmem support for i386 NUMA

cleanup-mem_map-1.patch
Clean up lots of open-coded uese of mem_map[]. For ia32 NUMA

zone-pages-reporting.patch
Fix the boot-time reporting of each zone's available pages

enospc-recovery-fix.patch
Fix the __block_write_full_page() error path.

fix-faults.patch
Back out the initial work for atomic copy_*_user()

spin-lock-check.patch
spinlock/rwlock checking infrastructure

refill-rate.patch
refill the inactive list more quickly

copy_user_atomic.patch

kmap_atomic_reads.patch
Use kmap_atomic() for generic_file_read()

kmap_atomic_writes.patch
Use kmap_atomic() for generic_file_write()

throttling-fix.patch
Fix throttling of heavy write()rs.

dirty-state-accounting.patch
Make the global dirty memory accounting more accurate

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

discontig-cleanup-1.patch
i386 discontigmem coding cleanups

discontig-cleanup-2.patch
i386 discontigmem cleanups

writeback-thresholds.patch
Downward adjustments to the default dirtymemory thresholds

buffer-strip.patch
Limit the consumption of ZONE_NORMAL by buffer_heads

rmap-speedup.patch
rmap pte_chain space and CPU reductions

wli-highpte.patch
Resurrect CONFIG_HIGHPTE - ia32 pagetables in highmem

readv-writev.patch
O_DIRECT support for readv/writev

slablru.patch
age slab pages on the LRU

slablru-speedup.patch
slablru optimisations

llzpr.patch
Reduce scheduling latency across zap_page_range

buffermem.patch
Resurrect buffermem accounting

config-PAGE_OFFSET.patch
Configurable kenrel/user memory split

lpp.patch
ia32 huge tlb pages

ext3-sb.patch
u.ext3_sb -> generic_sbp

oom-fix.patch
Fix an OOM condition on big highmem machines

tlb-cleanup.patch
Clean up the tlb gather code

dump-stack.patch
arch-neutral dump_stack() function

madvise-move.patch
move mdavise implementation into mm/madvise.c

split-vma.patch
VMA splitting patch

buffer-ops-move.patch
Move submit_bh() and ll_rw_block() into fs/buffer.c

2002-09-04 00:42:57

by William Lee Irwin III

[permalink] [raw]

Subject: Re: 2.5.33-mm1

On Mon, Sep 02, 2002 at 09:16:44PM -0700, Andrew Morton wrote:
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.33/2.5.33-mm1/
> Seven new patches - mostly just code cleanups.
> +slablru-speedup.patch
> A patch to improve slablru cpu efficiency. Ed is
> redoing this.

count_list() appears to be the largest consumer of cpu after this is
done, or so say the profiles after running updatedb by hand on
2.5.33-mm1 on a 900MHz P-III T21 Thinkpad with 256MB of RAM.

4608 __rdtsc_delay 164.5714
2627 __generic_copy_to_user 36.4861
2401 count_list 42.8750
1415 find_inode_fast 29.4792
1325 do_anonymous_page 3.3801

It also looks like there's either a bit of internal fragmentation or a
missing kmem_cache_reap() somewhere:

ext3_inode_cache: 20001KB 51317KB 38.97
dentry_cache: 4734KB 18551KB 25.52
radix_tree_node: 1811KB 1923KB 94.20
buffer_head: 1132KB 1378KB 82.12

It does stay quite a bit more nicely bounded than without slablru though.

Maybe it's old news. Just thought I'd try running a test on something tiny
for once. (new kbd/mouse config options were a PITA BTW)

Cheers,
Bill

2002-09-04 00:49:10

by Rik van Riel

[permalink] [raw]

Subject: Re: 2.5.33-mm1

On Tue, 3 Sep 2002, William Lee Irwin III wrote:

> count_list() appears to be the largest consumer of cpu after this is
> done, or so say the profiles after running updatedb by hand on
> 2.5.33-mm1 on a 900MHz P-III T21 Thinkpad with 256MB of RAM.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> Maybe it's old news. Just thought I'd try running a test on something
> tiny for once. (new kbd/mouse config options were a PITA BTW)

You've got an interesting idea of tiny ;)

Somehow I have the idea that the Linux users with 64 MB
of RAM or less have _more_ memory together than what's
present in all the >8GB Linux servers together...

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-04 01:10:55

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.33-mm1

William Lee Irwin III wrote:
>
> On Mon, Sep 02, 2002 at 09:16:44PM -0700, Andrew Morton wrote:
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.33/2.5.33-mm1/
> > Seven new patches - mostly just code cleanups.
> > +slablru-speedup.patch
> > A patch to improve slablru cpu efficiency. Ed is
> > redoing this.
>
> count_list() appears to be the largest consumer of cpu after this is
> done, or so say the profiles after running updatedb by hand on
> 2.5.33-mm1 on a 900MHz P-III T21 Thinkpad with 256MB of RAM.

That's my /proc/meminfo:buffermem counter-upper. I said it would suck ;)
Probably count_list(&inode_unused) can just be nuked. I don't think blockdev
inodes ever go onto inode_unused.

> 4608 __rdtsc_delay 164.5714
> 2627 __generic_copy_to_user 36.4861
> 2401 count_list 42.8750
> 1415 find_inode_fast 29.4792
> 1325 do_anonymous_page 3.3801
>
> It also looks like there's either a bit of internal fragmentation or a
> missing kmem_cache_reap() somewhere:
>
> ext3_inode_cache: 20001KB 51317KB 38.97
> dentry_cache: 4734KB 18551KB 25.52
> radix_tree_node: 1811KB 1923KB 94.20
> buffer_head: 1132KB 1378KB 82.12

That's really outside the control of slablru. It's determined
by the cache-specific LRU algorithms, and the allocation order.

You'll need to look at the second-last and third-last columns in
/proc/slabinfo (boy I wish that thing had a heading line, or a nice
program to interpret it):

ext3_inode_cache 959 2430 448 264 270 1

That's 264 pages in use, 270 total. If there's a persistent gap between
these then there is a problem - could well be that slablru is not locating
the pages which were liberated by the pruning sufficiently quickly.

Calling kmem_cache_reap() after running the pruners will fix that up.

2002-09-04 01:17:30

by William Lee Irwin III

[permalink] [raw]

Subject: Re: 2.5.33-mm1

William Lee Irwin III wrote:
>> It also looks like there's either a bit of internal fragmentation or a
>> missing kmem_cache_reap() somewhere:
>> ext3_inode_cache: 20001KB 51317KB 38.97
>> dentry_cache: 4734KB 18551KB 25.52
>> radix_tree_node: 1811KB 1923KB 94.20
>> buffer_head: 1132KB 1378KB 82.12

On Tue, Sep 03, 2002 at 06:13:17PM -0700, Andrew Morton wrote:
> That's really outside the control of slablru. It's determined
> by the cache-specific LRU algorithms, and the allocation order.
> You'll need to look at the second-last and third-last columns in
> /proc/slabinfo (boy I wish that thing had a heading line, or a nice
> program to interpret it):
> ext3_inode_cache 959 2430 448 264 270 1
> That's 264 pages in use, 270 total. If there's a persistent gap between
> these then there is a problem - could well be that slablru is not locating
> the pages which were liberated by the pruning sufficiently quickly.
> Calling kmem_cache_reap() after running the pruners will fix that up.

# grep ext3_inode_cache /proc/slabinfo
ext3_inode_cache 18917 87012 448 7686 9668 1
...
ext3_inode_cache: 8098KB 38052KB 21.28

Looks like a persistent gap from here.

Cheers,
Bill

2002-09-04 01:37:16

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.33-mm1

William Lee Irwin III wrote:
>
> ...
> > Calling kmem_cache_reap() after running the pruners will fix that up.
>
> # grep ext3_inode_cache /proc/slabinfo
> ext3_inode_cache 18917 87012 448 7686 9668 1
> ...
> ext3_inode_cache: 8098KB 38052KB 21.28
>
> Looks like a persistent gap from here.

OK, thanks. We need to reap those pages up-front rather than waiting
for them to come to the tail of the LRU.

What on earth is going on with kmem_cache_reap? Am I missing
something, or is that thing 700% overdesigned? Why not just
free the darn pages in kmem_cache_free_one()? Maybe hang onto
a few pages for cache warmth, but heck.

2002-09-04 02:48:57

by Ed Tomlinson

[permalink] [raw]

Subject: Re: 2.5.33-mm1

On September 3, 2002 09:13 pm, Andrew Morton wrote:
> ext3_inode_cache 959 2430 448 264 270 1
>
> That's 264 pages in use, 270 total. If there's a persistent gap between
> these then there is a problem - could well be that slablru is not locating
> the pages which were liberated by the pruning sufficiently quickly.

Sufficiently quickly is a relative thing. It could also be that by the time
the pages are reclaimed another <n> have been cleaned. IMO its no worst
than have freeable pages on lru from any other source. If we get close to
oom we will call kmem_cache_reap, otherwise we let the lru find the pages.

> Calling kmem_cache_reap() after running the pruners will fix that up.

more specificly kmem_cache_reap will clean the one cache with the most
free pages...

>What on earth is going on with kmem_cache_reap? Am I missing
>something, or is that thing 700% overdesigned? Why not just
>free the darn pages in kmem_cache_free_one()? Maybe hang onto
>a few pages for cache warmth, but heck.

This might be as simple as we can see the free pages in slabs. We
cannot see other freeable pages in the lru. This makes slabs seem
like a problem - just because we can see it.

On the other hand we could setup to call __kmem_cache_shrink_locked
after pruning a cache - as it is now this will use page_cache_release
to free the pages... Need to be careful coding this though.

Andrew, you stated that we need to consider dcache and icache pages
as very important ones. I submit that this is what slablru is doing.
It is keeping more of these objects around than the previous design,
which is what you wanted to see happen.

Still working on a good reply to your design suggestion/questions?

Ed

-------------------------------------------------------

2002-09-04 02:52:39

by Ed Tomlinson

[permalink] [raw]

Subject: Re: 2.5.33-mm1

On September 3, 2002 09:15 pm, William Lee Irwin III wrote:
> William Lee Irwin III <[email protected]>

What are the numbers telling you? Is you test faster or slower
with slablru? Does it page more or less? Is looking at the number
of objects the way to determine if slablru is helping? I submit
the paging and runtimes are much better indications? What do
story do they tell?

Thanks
Ed

2002-09-04 02:57:19

by William Lee Irwin III

[permalink] [raw]

Subject: Re: 2.5.33-mm1

On September 3, 2002 09:15 pm, William Lee Irwin III wrote:
>> William Lee Irwin III <[email protected]>
[something must have gotten snipped]

On Tue, Sep 03, 2002 at 10:55:43PM -0400, Ed Tomlinson wrote:
> What are the numbers telling you? Is you test faster or slower
> with slablru? Does it page more or less? Is looking at the number
> of objects the way to determine if slablru is helping? I submit
> the paging and runtimes are much better indications? What do
> story do they tell?

Everything else is pretty much fine-tuning. Prior to this there was
zero control exerted over the things. Now it's much better behaved
with far less "swapping while buttloads of instantly reclaimable slab
memory is available" going on. Almost no swapping out of user memory
in favor of bloated slabs.

It's really that binary distinction that's most visible.

Cheers,
Bill

2002-09-04 03:16:56

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.33-mm1

Ed Tomlinson wrote:
>
> On September 3, 2002 09:13 pm, Andrew Morton wrote:
>
> > ext3_inode_cache 959 2430 448 264 270 1
> >
> > That's 264 pages in use, 270 total. If there's a persistent gap between
> > these then there is a problem - could well be that slablru is not locating
> > the pages which were liberated by the pruning sufficiently quickly.
>
> Sufficiently quickly is a relative thing.

Those pages are useless! It's silly having slab hanging onto them
while we go and reclaim useful pagecache instead.

I *really* think we need to throw away those pages instantly.

The only possible reason for hanging onto them is because they're
cache-warm. And we need a global-scope cpu-local hot pages queue
anyway.

And once we have that, slab _must_ release its warm pages into it.
It's counterproductive for slab to hang onto warm pages when, say,
a pagefault needs one.

> It could also be that by the time the
> pages are reclaimed another <n> have been cleaned. IMO its no worst than
> have freeable pages on lru from any other source. If we get close to oom
> we will call kmem_cache_reap, otherwise we let the lru find the pages.

As I say, by not releasing those (useless to slab) pages, we're causing
other (useful) stuff to be reclaimed.

2002-09-04 19:21:06

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: 2.5.33-mm1

Hi,

On Tue, Sep 03, 2002 at 08:33:37PM -0700, Andrew Morton wrote:

> I *really* think we need to throw away those pages instantly.
>
> The only possible reason for hanging onto them is because they're
> cache-warm. And we need a global-scope cpu-local hot pages queue
> anyway.

Yep --- except for caches with constructors, for which we do save a
bit more by hanging onto the pages for longer.

--Stephen

2002-09-04 20:15:48

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.33-mm1

"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Tue, Sep 03, 2002 at 08:33:37PM -0700, Andrew Morton wrote:
>
> > I *really* think we need to throw away those pages instantly.
> >
> > The only possible reason for hanging onto them is because they're
> > cache-warm. And we need a global-scope cpu-local hot pages queue
> > anyway.
>
> Yep --- except for caches with constructors, for which we do save a
> bit more by hanging onto the pages for longer.

Ah, of course. Thanks.

We'll still have a significant volume of pre-constructed objects
in the partially-full slabs: it seems that these things are fairly
prone to internal fragmentation, which works to our advantage in
this case.

So yes, perhaps we need to hang onto some preconstructed pages
for these slabs, if the internal fragmentation of the existing
part-filled slabs is low.