2002-09-12 06:09:07

by Andrew Morton

[permalink] [raw]
Subject: 2.5.34-mm2


url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.34/2.5.34-mm2/

-throttling-fix.patch
-sleeping-release_page.patch
-dirty-state-accounting.patch
-discontig-cleanup-1.patch
-discontig-cleanup-2.patch
-writeback-thresholds.patch
-buffer-strip.patch
-rmap-speedup.patch
-wli-highpte.patch

Merged

-lpp2.patch

Folded into lpp.patch - hugetlb fixes

+lpp-update.patch

More hugetlb fixes from Rohit.

+pf_nowarn.patch

Prevent some `page allocation failure' warnings which aren't supposed
to come out.

+jeremy.patch

Spel Jermy's naim wright

-segq.patch

SEGQ had an interaction with the dirty memory management. This interaction
was the source of Badari's IO bandwidth regression. Removed until I have
time to poke at it.

+wake-speedup.patch

Badari's pagecache writeout is back up to 270 megs/sec. The CPUs are pegged
and the hottest functions are

5348 __wake_up 111.4167
6954 unlock_page 72.4375
187676 generic_file_write_nolock 71.9617
9577 __scsi_end_request 54.4148

I cannot reproduce these profiles with mortal numbers of hard disks, but
the wakeup code can be sped up heaps.

The patch implements a new wait/wakeup mechanism which removes wait_queues
from wait_queue_head's within __wake_up(), rather than within the woken
process.

+buddyinfo.patch

/proc/buddyinfo - stats on free page fragmentation.

+free_area.patch

Nail another gratuitous typedef

+radix_tree_gang_lookup.patch

Multipage pagecache scan and lookup.

+truncate_inode_pages.patch

Redo the truncate/invalidate code to use gang lookups.





linus.patch
cset-1.568.17.13-to-1.648.txt.gz

scsi_hack.patch
Fix block-highmem for scsi

ext3-htree.patch
Indexed directories for ext3

spin-lock-check.patch
spinlock/rwlock checking infrastructure

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

readv-writev.patch
O_DIRECT support for readv/writev

llzpr.patch
Reduce scheduling latency across zap_page_range

buffermem.patch
Resurrect buffermem accounting

lpp.patch
ia32 huge tlb pages

lpp-update.patch
hugetlbpage fixes

sharedmem.patch
Add /proc/meminfo:Mapped - tha amount of memory which is mapped into pagetables

ext3-sb.patch
u.ext3_sb -> generic_sbp

oom-fix.patch
Fix an OOM condition on big highmem machines

tlb-cleanup.patch
Clean up the tlb gather code

dump-stack.patch
arch-neutral dump_stack() function

wli-cleanup.patch
random cleanups

madvise-move.patch
move mdavise implementation into mm/madvise.c

split-vma.patch
VMA splitting patch

mmap-fixes.patch
mmap.c cleanup and lock ranking fixes

buffer-ops-move.patch
Move submit_bh() and ll_rw_block() into fs/buffer.c

slab-stats.patch
Display total slab memory in /proc/meminfo

writeback-control.patch
Cleanup and extension of the writeback paths

free_area_init-cleanup.patch
free_area_init() code cleanup

alloc_pages-cleanup.patch
alloc_pages cleanup and optimisation

statm_pgd_range-sucks.patch
Remove the pagetable walk from /proc/stat

remove-sync_thresh.patch
Remove /proc/sys/vm/dirty_sync_thresh

pf_nowarn.patch
Fix up the handling of PF_NOWARN

jeremy.patch
Spel Jermy's naim wright

queue-congestion.patch
Infrastructure for communicating request queue congestion to the VM

nonblocking-ext2-preread.patch
avoid ext2 inode prereads if the queue is congested

nonblocking-pdflush.patch
non-blocking writeback infrastructure, use it for pdflush

nonblocking-vm.patch
Non-blocking page reclaim

wake-speedup.patch
Faster wakeup code

sync-helper.patch
Speed up sys_sync() against multiple spindles

slabasap.patch
Early and smarter shrinking of slabs

write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock

buddyinfo.patch
Add /proc/buddyinfo - stats on the free pages pool

free_area.patch
Remove struct free_area_struct and free_area_t, use `struct free_area'

radix_tree_gang_lookup.patch
radix tree gang lookup

truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite


2002-09-15 03:38:54

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.5.34-mm2

On Thursday 12 September 2002 08:29, Andrew Morton wrote:
> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.34/2.5.34-mm2/
>
> -sleeping-release_page.patch

What's this one? Couldn't find it as a broken-out patch.

On the nonblocking vm front, does it rule or suck? I heard you
mention, on the one hand, huge speedups on some load (dbench I think)
but your in-patch comments mention slowdown by 1.7X on kernel
compile.

--
Daniel

2002-09-15 03:51:49

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.34-mm2

Daniel Phillips wrote:
>
> On Thursday 12 September 2002 08:29, Andrew Morton wrote:
> > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.34/2.5.34-mm2/
> >
> > -sleeping-release_page.patch
>
> What's this one? Couldn't find it as a broken-out patch.

The `-' means it was removed from the patchset. Linus merged it.
See 2.5.34/2.5.34-mm1/broken-out/sleeping-release_page.patch

> On the nonblocking vm front, does it rule or suck?

It rules, until someone finds something at which it sucks.

> I heard you
> mention, on the one hand, huge speedups on some load (dbench I think)
> but your in-patch comments mention slowdown by 1.7X on kernel
> compile.

You misread. Relative times for running `make -j6 bzImage' with mem=512m:

Unloaded system: 1.0
2.5.34-mm4, while running 4 x `dbench 100' 1.7
Any other kernel while running 4 x `dbench 100' basically infinity

2002-09-15 04:16:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.5.34-mm2

On Sunday 15 September 2002 06:12, Andrew Morton wrote:
> Daniel Phillips wrote:
> > I heard you
> > mention, on the one hand, huge speedups on some load (dbench I think)
> > but your in-patch comments mention slowdown by 1.7X on kernel
> > compile.
>
> You misread. Relative times for running `make -j6 bzImage' with mem=512m:
>
> Unloaded system: 1.0
> 2.5.34-mm4, while running 4 x `dbench 100' 1.7
> Any other kernel while running 4 x `dbench 100' basically infinity

Oh good :-)

We can make the rescanning go away in time, with more lru lists, but
that sure looks like the low hanging fruit.

--
Daniel

2002-09-15 05:16:37

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.34-mm2

Daniel Phillips wrote:
>
> On Sunday 15 September 2002 06:12, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > > I heard you
> > > mention, on the one hand, huge speedups on some load (dbench I think)
> > > but your in-patch comments mention slowdown by 1.7X on kernel
> > > compile.
> >
> > You misread. Relative times for running `make -j6 bzImage' with mem=512m:
> >
> > Unloaded system: 1.0
> > 2.5.34-mm4, while running 4 x `dbench 100' 1.7
> > Any other kernel while running 4 x `dbench 100' basically infinity
>
> Oh good :-)
>
> We can make the rescanning go away in time, with more lru lists,

We don't actually need more lists, I expect. Dirty and under writeback
pages just don't go on a list at all - cut them off the LRU and
bring them back at IO completion. We can't do anything useful with
a list of dirty/writeback pages anyway, so why have the list?

It kind of depends whether we want to put swapcache on that list. I
may just give swapper_inode a superblock and let pdflush write swap.

The interrupt-time page motion is of course essential if we are to
avoid long scans of that list.

That, and replacing the blk_congestion_wait() throttling with a per-classzone
wait_for_some_pages_to_come_clean() throttling pretty much eliminates the
remaining pointless scan activity from the VM, and fixes a current false OOM
scenario in -mm4.

> but that sure looks like the low hanging fruit.

It's low alright. AFAIK Linux has always had this problem of
seizing up when there's a lot of dirty data around.

Let me quantify infinity:


With mem=512m, on the quad:

`make -j6 bzImage' takes two minutes and two seconds.

On 2.5.34, a concurrent 4 x `dbench 100' slows that same kernel
build down to 35 minutes and 16 seconds.

On 2.5.34-mm4, while running 4 x `dbench 100' that kernel build
takes three minutes and 45 seconds.



That's with seven disks: four for the dbenches, one for the kernel
build, one for swap and one for the executables. Things would be
worse with less disks because of seek contention. But that's
to be expected. The intent of this work is to eliminate this
crosstalk between different activities. And to avoid blocking things
which aren't touching disk at all.

2002-09-15 14:53:57

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.5.34-mm2

On Sat, 14 Sep 2002, Andrew Morton wrote:
> Daniel Phillips wrote:

> > but that sure looks like the low hanging fruit.
>
> It's low alright. AFAIK Linux has always had this problem of
> seizing up when there's a lot of dirty data around.

Somehow I doubt the "seizing up" problem is caused by too much
scanning. In fact, I'm pretty convinced it is caused by having
too much IO submitted at once (and stalling in __get_request_wait).

The scanning is probably not relevant at all and it may be
beneficial to just ignore the scanning for now and do our best
to keep the pages in better LRU order.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-15 16:53:03

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.34-mm2

Rik van Riel wrote:
>
> On Sat, 14 Sep 2002, Andrew Morton wrote:
> > Daniel Phillips wrote:
>
> > > but that sure looks like the low hanging fruit.
> >
> > It's low alright. AFAIK Linux has always had this problem of
> > seizing up when there's a lot of dirty data around.
>
> Somehow I doubt the "seizing up" problem is caused by too much
> scanning. In fact, I'm pretty convinced it is caused by having
> too much IO submitted at once (and stalling in __get_request_wait).

Yes, the latency is due to request queue contention.

Dirty data reaches the tail of the LRU and "innocent" processes are
forced to write it. But the queue is full. They sleep until 32
requests are free. They wake; but so does the heavy dirtier. The
heavy dirtier immediately fills the queue again. The innocent
page allocator finds some more dirty data. Repeat.

It's DoS-via-request queue. It's made worse by the fact that
kswapd is also DoS'ed, so pretty much all tasks need to perform
direct reclaim.

There are also latency problems, with similar causes, when page-allocating
processes encounter under-writeback pages at the tail of the LRU, but
this happens less often.

> The scanning is probably not relevant at all and it may be
> beneficial to just ignore the scanning for now and do our best
> to keep the pages in better LRU order.
>

Yes, I'm not particularly fussed about (moderate) excess CPU use in these
situations, and nor about page replacement accuracy, really - pages
are being slushed through the system so fast that correct aging of the
ones on the inactive list probably just doesn't count.

The use of "how much did we scan" to determine when we're out
of memory is a bit of a problem; but the main problem (of which
I'm aware) is that the global throttling via blk_congestion_wait()
is not a sufficiently accurate indication that "pages came clean
in ZONE_NORMAL" on big highmem boxes.

Processes which are performing GFP_KERNEL allocations can keep
on getting woken up for ZONE_HIGHMEM completion, and they eventually
decide it's OOM. This has only been observed when the dirty memory
limits are manually increased a lot, but it points to a design problem.

I don't know what's going on in `contest', nor in Alex's X build. We'll
see...

2002-09-15 17:00:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.5.34-mm2

On Sunday 15 September 2002 19:13, Andrew Morton wrote:
> Yes, I'm not particularly fussed about (moderate) excess CPU use in these
> situations, and nor about page replacement accuracy, really - pages
> are being slushed through the system so fast that correct aging of the
> ones on the inactive list probably just doesn't count.

What you really mean is, it hasn't gotten to the top of the list
of things that suck. When we do get around to fashioning a really
effective page ager (LRU-er, more likely) the further improvement
will be obvious, especially under heavy streaming IO load, which
is getting more important all the time.

--
Daniel