LinuxLists.cc - 2.5.40-mm1

2002-10-01 09:27:05

Subject: 2.5.40-mm1

url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.40/2.5.40-mm1/

Mainly a resync.

- A few minor problems in the per-cpu-pages code have been fixed.

- Updated dcache RCU code.

- Significant brain surgery on the SARD patch.

- Decreased the disk scheduling tunable `fifo_batch' from 32 to 16 to
improve disk read latency.

- Updated ext3 htree patch from Ted.

- Included a patch from Mala Anand which _should_ speed up kernel<->userspace
memory copies for Intel ia32 hardware. But I can't measure any difference
with poorly-aligned pagecache copies.

-scsi_hack.patch
-might_sleep-2.patch
-slab-fix.patch
-hugetlb-doc.patch
-get_user_pages-PG_reserved.patch
-move_one_page_fix.patch
-zab-list_heads.patch
-remove-gfp_nfs.patch
-buddyinfo.patch
-free_area.patch
-per-node-kswapd.patch
-topology-api.patch
-topology_fixes.patch

Merged

+misc.patch

Trivia

+ioperm-fix.patch

Fix the sys_ioperm() might-sleep-while-atomic bug

-sard.patch
+bd-sard.patch

Somewhat rewritten to not key everything off minors and majors - use
pointers instead.

+bio-get-nr-vecs.patch

use bio_get_nr_vecs in fs/mpage.c

+dio-nr-segs.patch

use bio_get_nr_vecs in fs/direct-io.c

-per-node-zone_normal.patch
+per-node-mem_map.patch

Renamed

+free_area_init-cleanup.patch

Clean up some mm init code.

+intel-user-copy.patch

Supposedly faster copy_*_user.

ext3-dxdir.patch
ext3 htree

spin-lock-check.patch
spinlock/rwlock checking infrastructure

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

misc.patch
misc

write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock

ioperm-fix.patch
sys_ioperm() atomicity fix

radix_tree_gang_lookup.patch
radix tree gang lookup

truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite

proc_vmstat.patch
Move the vm accounting out of /proc/stat

kswapd-reclaim-stats.patch
Add kswapd_steal to /proc/vmstat

iowait.patch
I/O wait statistics

bd-sard.patch

dio-bio-add-page.patch
Use bio_add_page() in direct-io.c

tcp-wakeups.patch
Use fast wakeups in TCP/IPV4

swapoff-deadlock.patch
Fix a tmpfs swapoff deadlock

dirty-and-uptodate.patch
page state cleanup

shmem_rename.patch
shmem_rename() directory link count fix

dirent-size.patch
tmpfs: show a non-zero size for directories

tmpfs-trivia.patch
tmpfs: small fixlets

per-zone-vm.patch
separate the kswapd and direct reclaim code paths

swsusp-feature.patch
add shrink_all_memory() for swsusp

bio-get-nr-vecs.patch
use bio_get_nr_vecs() in fs/mpage.c

dio-nr-segs.patch
Use bio_get_nr_vecs() in direct-io.c

remove-page-virtual.patch
remove page->virtual for !WANT_PAGE_VIRTUAL

dirty-memory-clamp.patch
sterner dirty-memory clamping

mempool-wakeup-fix.patch
Fix for stuck tasks in mempool_alloc()

remove-write_mapping_buffers.patch
Remove write_mapping_buffers

buffer_boundary-scheduling.patch
IO schduling for indirect blocks

ll_rw_block-cleanup.patch
cleanup ll_rw_block()

lseek-ext2_readdir.patch
remove lock_kernel() from ext2_readdir()

discontig-no-contig_page_data.patch
undefine contif_page_data for discontigmem

per-node-mem_map.patch
ia32 NUMA: per-node ZONE_NORMAL

alloc_pages_node-cleanup.patch
alloc_pages_node cleanup

free_area_init-cleanup.patch
free_area_init_node cleanup

batched-slab-asap.patch
batched slab shrinking

akpm-deadline.patch
deadline scheduler tweaks

rmqueue_bulk.patch
bulk page allocator

free_pages_bulk.patch
Bulk page freeing function

hot_cold_pages.patch
Hot/Cold pages and zone->lock amortisation
EDEC

Hot/Cold pages and zone->lock amortisation

readahead-cold-pages.patch
Use cache-cold pages for pagecache reads.

pagevec-hot-cold-hint.patch
hot/cold hints for truncate and page reclaim

intel-user-copy.patch

read_barrier_depends.patch
extended barrier primitives

rcu_ltimer.patch
RCU core

dcache_rcu.patch
Use RCU for dcache

2002-10-09 23:19:54

by Mala Anand

[permalink] [raw]

Subject: Re: 2.5.40-mm1

>Andrew Morton wrote:

>So. Patch is a huge win as-is. For the PIII it looks like we need
>to enable it at all alignments except mod32. And we need to test
>with aligned dest, unaligned source.

Pentium III (coppermine) 997Mhz 2-way
Read from pagecache to user buffer misaligning the source
Size of copy is 262144 and the number of iterations copied for
each test is 16384.
Patch++ - uses copy_user_int if size > 64
Patch - uses copy_user_int if size > 64, or src and dst
are not aligned on an 8 byte boundary

dst aligned on an 4k and src misaligned

2.5.40 2.5.40+patch 2.5.40+patch++
Align throughout throughput throughput
(bytes) KB/sec KB/sec KB/sec
0 275592 281356 285567
1 124266 197361
2 120157 200270
4 125935 197558
8 157244 156655 162189
16 167296 173202 173702
32 283731 285222 290810

Looks like the patch can be used for all the above tested
alignments on Pentium III.
>Can you please do some P4 testing?

P4 Xeon CPU 1.50 GHz 4-way - hyperthreading disabled
Src is aligned and dst is misaligned as follows:

Dst 2.5.40 2.5.40+patch 2.5.40+patch++
Align throughout throughput throughput
(bytes) KB/sec KB/sec KB/sec
0 1360071 1314783 912359
1 323674 340447
2 329202 336425
4 512955 693170
8 523223 615097 506641
12 517184 558701 553700
16 966598 872080 932736
32 846937 838514 845178

I see too much variance in the test results so I ran
each test 3 times. I tried increasing the iterations
but it did not reduce the variance.

Dst is aligned and src is misaligned as follows:

Dst 2.5.40 2.5.40+patch
Align throughout throughput
(bytes) KB/sec KB/sec
0 1275372 1029815
1 529907 511815
2 534811 530850
4 643196 627013
8 568000 626676
12 574468 658793
16 631707 635979
32 741485 592938

Since there is 5 - 10% variance in these test's results I am not
sure whether we can use this data to validate. I will try
to run this on another pentium 4 machine.

However I have seen using floating point registers instead of integer
registers on Pentium IV improves performance to a greater extent on
some alignments. I need to do more testing and then I will create a
patch for pentium IV.

Regards,
Mala

Mala Anand
IBM Linux Technology Center - Kernel Performance
E-mail:[email protected]
http://www-124.ibm.com/developerworks/opensource/linuxperf
http://www-124.ibm.com/developerworks/projects/linuxperf
Phone:838-8088; Tie-line:678-8088

2002-10-09 23:29:56

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.40-mm1

Mala Anand wrote:
>
> ...
> P4 Xeon CPU 1.50 GHz 4-way - hyperthreading disabled
> Src is aligned and dst is misaligned as follows:
>
> Dst 2.5.40 2.5.40+patch 2.5.40+patch++
> Align throughout throughput throughput
> (bytes) KB/sec KB/sec KB/sec
> 0 1360071 1314783 912359
> 1 323674 340447
> 2 329202 336425
> 4 512955 693170
> 8 523223 615097 506641
> 12 517184 558701 553700
> 16 966598 872080 932736
> 32 846937 838514 845178

Note the tremendous slowdown which the P4 suffers when you're not
cacheline aligned. Even 32-byte-aligned is down a lot.

> I see too much variance in the test results so I ran
> each test 3 times. I tried increasing the iterations
> but it did not reduce the variance.
>
> Dst is aligned and src is misaligned as follows:
>
> Dst 2.5.40 2.5.40+patch
> Align throughout throughput
> (bytes) KB/sec KB/sec
> 0 1275372 1029815
> 1 529907 511815
> 2 534811 530850
> 4 643196 627013
> 8 568000 626676
> 12 574468 658793
> 16 631707 635979
> 32 741485 592938

This differs a little from my P4 testing - the rep;movsl approach
seemed OK for 8,16,32 alignment.

But still, that's something we can tune later.

>
> However I have seen using floating point registers instead of integer
> registers on Pentium IV improves performance to a greater extent on
> some alignments. I need to do more testing and then I will create a
> patch for pentium IV.

I believe there are "issues" using those registers in-kernel. Related
to the need to save/restore them, or errata; not too sure about that.