LinuxLists.cc - 2.5.39-mm1

2002-09-29 20:23:26

Subject: 2.5.39-mm1

url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.39/2.5.39-mm1/

This patchset includes the per-cpu-pages stuff which Martin and I
have been working on. This is designed to:

- increase the probability of the page allocator returning a page
which is cache-warm on the calling CPU

- amortise zone->lock contention via work batching.

- provide the basis for a page reservation API, so we can guarantee
that some troublesome inside-spinlock allocations will succeed.
Mainly pte_chains and radix_tree_nodes (haven't implemented this yet).

I must say that based on a small amount of performance testing the
benefits of the cache warmness thing are disappointing. Maybe 1% if
you squint. Martin, could you please do a before-and-after on the
NUMAQ's, double check that it is actually doing the right thing?

And Anton, could you please test this, make sure that it fixes the
rmqueue() and __free_pages_ok() lock contention as effectively as
the old per-cpu-pages patch did? Thanks.

There is a reiserfs compilation problem at present.

Rick has a modified version of iostat. Please use that for extracting the
SARD info. Also the version at http://linux.inet.hr/ is reasonably uptodate.

New versions of procps are at http://surriel.com/procps/ - the version
from cygnus CVS works for me.

+module-fix.patch

Compile fix for current Linus BK diff (Ingo)

+might_sleep-2.patch

Additional might_sleep checks

+slab-fix.patch

Fix a kmem_cache_destroy problem

+hugetlb-doc.patch

hugetlbpage docco

+get_user_pages-PG_reserved.patch

Don't bump the refcount on PageReserved pages in get_user_pages()

+move_one_page_fix.patch

kmap_atomic atomicity fix

+zab-list_heads.patch

Initialise some uninitialised VMA list_heads

+batched-slab-asap.patch

Batch up the slab shrinking work.

+rmqueue_bulk.patch
+free_pages_bulk.patch

Multipage page allocation and freeing

+hot_cold_pages.patch

Per-cpu hot-n-cold page lists.

+readahead-cold-pages.patch

Select cold pages for reading into pagecache

+pagevec-hot-cold-hint.patch

Pages which are freed by page reclaim and truncate are probably
cache-cold. Don't pollute the cache-hot pool with them.

linus.patch
cset-1.622.1.14-to-1.651.txt.gz

module-fix.patch
compile fixes from Ingo

ide-high-1.patch

scsi_hack.patch
Fix block-highmem for scsi

ext3-dxdir.patch
ext3 htree

spin-lock-check.patch
spinlock/rwlock checking infrastructure

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

might_sleep-2.patch
more sleep-inside-spinlock checks

slab-fix.patch
slab: put the spare page on cachep->pages_free

hugetlb-doc.patch
hugetlbpage documentation

get_user_pages-PG_reserved.patch
Check for PageReserved pages in get_user_pages()

move_one_page_fix.patch
pte_highmem atomicity fix in move_one_page()

zab-list_heads.patch
vm_area_struct list_head initialisation

remove-gfp_nfs.patch
remove GFP_NFS

buddyinfo.patch
Add /proc/buddyinfo - stats on the free pages pool

free_area.patch
Remove struct free_area_struct and free_area_t, use `struct free_area'

per-node-kswapd.patch
Per-node kswapd instance

topology-api.patch
Simple topology API

topology_fixes.patch
topology-api cleanups

write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock

radix_tree_gang_lookup.patch
radix tree gang lookup

truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite

proc_vmstat.patch
Move the vm accounting out of /proc/stat

kswapd-reclaim-stats.patch
Add kswapd_steal to /proc/vmstat

iowait.patch
I/O wait statistics

sard.patch
SARD disk accounting

dio-bio-add-page.patch
Use bio_add_page() in direct-io.c

tcp-wakeups.patch
Use fast wakeups in TCP/IPV4

swapoff-deadlock.patch
Fix a tmpfs swapoff deadlock

dirty-and-uptodate.patch
page state cleanup

shmem_rename.patch
shmem_rename() directory link count fix

dirent-size.patch
tmpfs: show a non-zero size for directories

tmpfs-trivia.patch
tmpfs: small fixlets

per-zone-vm.patch
separate the kswapd and direct reclaim code paths

swsusp-feature.patch
add shrink_all_memory() for swsusp

remove-page-virtual.patch
remove page->virtual for !WANT_PAGE_VIRTUAL

dirty-memory-clamp.patch
sterner dirty-memory clamping

mempool-wakeup-fix.patch
Fix for stuck tasks in mempool_alloc()

remove-write_mapping_buffers.patch
Remove write_mapping_buffers

buffer_boundary-scheduling.patch
IO schduling for indirect blocks

ll_rw_block-cleanup.patch
cleanup ll_rw_block()

lseek-ext2_readdir.patch
remove lock_kernel() from ext2_readdir()

discontig-no-contig_page_data.patch
undefine contif_page_data for discontigmem

per-node-zone_normal.patch
ia32 NUMA: per-node ZONE_NORMAL

alloc_pages_node-cleanup.patch
alloc_pages_node cleanup

batched-slab-asap.patch
batched slab shrinking

akpm-deadline.patch
deadline scheduler tweaks

rmqueue_bulk.patch
bulk page allocator

free_pages_bulk.patch
Bulk page freeing function

hot_cold_pages.patch
Hot/Cold pages and zone->lock amortisation
EDEC

Hot/Cold pages and zone->lock amortisation

readahead-cold-pages.patch
Use cache-cold pages for pagecache reads.

pagevec-hot-cold-hint.patch
hot/cold hints for truncate and page reclaim

read_barrier_depends.patch
extended barrier primitives

rcu_ltimer.patch
RCU core

dcache_rcu.patch
Use RCU for dcache

2002-09-30 01:20:35

by Ed Tomlinson

[permalink] [raw]

Subject: Re: 2.5.39-mm1

On September 29, 2002 04:26 pm, Andrew Morton wrote:
> There is a reiserfs compilation problem at present.

make[2]: Entering directory `/poole/src/39-mm1/fs/reiserfs'
gcc -Wp,-MD,./.bitmap.o.d -D__KERNEL__ -I/poole/src/39-mm1/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=k6 -I/poole/src/39-mm1/arch/i386/mach-generic -nostdinc -iwithprefix include -DKBUILD_BASENAME=bitmap -c -o bitmap.o bitmap.c
In file included from bitmap.c:8:
/poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: parse error before `reiserfs_commit_thread_tq'
/poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: warning: type defaults to `int' in declaration of `reiserfs_commit_thread_tq'
/poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: warning: data definition has no type or storage class
make[2]: *** [bitmap.o] Error 1
make[2]: Leaving directory `/poole/src/39-mm1/fs/reiserfs'
make[1]: *** [reiserfs] Error 2
make[1]: Leaving directory `/poole/src/39-mm1/fs'
make: *** [fs] Error 2

which is:

extern task_queue reiserfs_commit_thread_tq ;

from bk chanages:

[email protected], 2002-09-29 11:00:25-07:00, [email protected]
[PATCH] smptimers, old BH removal, tq-cleanup

<omitted>

- removed the ability to define your own task-queue, what can be done is
to schedule_task() a given task to keventd, and to flush all pending
tasks.

Ingo?

Ed Tomlinson

2002-09-30 01:30:33

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.39-mm1

Ed Tomlinson wrote:
>
> On September 29, 2002 04:26 pm, Andrew Morton wrote:
> > There is a reiserfs compilation problem at present.
>
> make[2]: Entering directory `/poole/src/39-mm1/fs/reiserfs'
> gcc -Wp,-MD,./.bitmap.o.d -D__KERNEL__ -I/poole/src/39-mm1/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=k6 -I/poole/src/39-mm1/arch/i386/mach-generic -nostdinc -iwithprefix include -DKBUILD_BASENAME=bitmap -c -o bitmap.o bitmap.c
> In file included from bitmap.c:8:
> /poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: parse error before `reiserfs_commit_thread_tq'

Ingo sent me the below temp fix. I let it out because it's
probably better to leave the fs broken until we have a firm,
tested solution.

--- linux/drivers/char/drm/radeon_irq.c.orig Sun Sep 29 20:55:34 2002
+++ linux/drivers/char/drm/radeon_irq.c Sun Sep 29 20:56:27 2002
@@ -69,8 +69,7 @@

atomic_inc(&dev_priv->irq_received);
#ifdef __linux__
- queue_task(&dev->tq, &tq_immediate);
- mark_bh(IMMEDIATE_BH);
+ schedule_task(&dev->tq);
#endif /* __linux__ */
#ifdef __FreeBSD__
taskqueue_enqueue(taskqueue_swi, &dev->task);
--- linux/fs/reiserfs/journal.c.orig Sun Sep 29 21:03:48 2002
+++ linux/fs/reiserfs/journal.c Sun Sep 29 21:04:49 2002
@@ -65,13 +65,6 @@
*/
static int reiserfs_mounted_fs_count = 0 ;

-/* wake this up when you add something to the commit thread task queue */
-DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
-
-/* wait on this if you need to be sure you task queue entries have been run */
-static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
-DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
-
#define JOURNAL_TRANS_HALF 1018 /* must be correct to keep the desc and commit
structs at 4k */
#define BUFNR 64 /*read ahead */
@@ -1339,12 +1332,9 @@
do_journal_end(&myth, p_s_sb,1, FLUSH_ALL) ;
}

- /* we decrement before we wake up, because the commit thread dies off
- ** when it has been woken up and the count is <= 0
- */
reiserfs_mounted_fs_count-- ;
- wake_up(&reiserfs_commit_thread_wait) ;
- sleep_on(&reiserfs_commit_thread_done) ;
+ /* wait for all commits to finish */
+ flush_scheduled_tasks();

release_journal_dev( p_s_sb, SB_JOURNAL( p_s_sb ) );
free_journal_ram(p_s_sb) ;
@@ -1815,6 +1805,10 @@
static void reiserfs_journal_commit_task_func(struct reiserfs_journal_commit_task *ct) {

struct reiserfs_journal_list *jl ;
+
+ /* FIXMEL: is this needed? */
+ lock_kernel();
+
jl = SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex ;

flush_commit_list(ct->p_s_sb, SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex, 1) ;
@@ -1824,6 +1818,7 @@
kupdate_one_transaction(ct->p_s_sb, jl) ;
}
reiserfs_kfree(ct->self, sizeof(struct reiserfs_journal_commit_task), ct->p_s_sb) ;
+ unlock_kernel();
}

static void setup_commit_task_arg(struct reiserfs_journal_commit_task *ct,
@@ -1850,8 +1845,7 @@
ct = reiserfs_kmalloc(sizeof(struct reiserfs_journal_commit_task), GFP_NOFS, p_s_sb) ;
if (ct) {
setup_commit_task_arg(ct, p_s_sb, jindex) ;
- queue_task(&(ct->task), &reiserfs_commit_thread_tq);
- wake_up(&reiserfs_commit_thread_wait) ;
+ schedule_task(&ct->task) ;
} else {
#ifdef CONFIG_REISERFS_CHECK
reiserfs_warning("journal-1540: kmalloc failed, doing sync commit\n") ;
@@ -1860,49 +1854,6 @@
}
}

-/*
-** this is the commit thread. It is started with kernel_thread on
-** FS mount, and journal_release() waits for it to exit.
-**
-** It could do a periodic commit, but there is a lot code for that
-** elsewhere right now, and I only wanted to implement this little
-** piece for starters.
-**
-** All we do here is sleep on the j_commit_thread_wait wait queue, and
-** then run the per filesystem commit task queue when we wakeup.
-*/
-static int reiserfs_journal_commit_thread(void *nullp) {
-
- daemonize() ;
-
- spin_lock_irq(&current->sigmask_lock);
- sigfillset(&current->blocked);
- recalc_sigpending();
- spin_unlock_irq(&current->sigmask_lock);
-
- sprintf(current->comm, "kreiserfsd") ;
- lock_kernel() ;
- while(1) {
-
- while(TQ_ACTIVE(reiserfs_commit_thread_tq)) {
- run_task_queue(&reiserfs_commit_thread_tq) ;
- }
-
- /* if there aren't any more filesystems left, break */
- if (reiserfs_mounted_fs_count <= 0) {
- run_task_queue(&reiserfs_commit_thread_tq) ;
- break ;
- }
- wake_up(&reiserfs_commit_thread_done) ;
- if (current->flags & PF_FREEZE)
- refrigerator(PF_IOTHREAD);
- interruptible_sleep_on_timeout(&reiserfs_commit_thread_wait, 5 * HZ) ;
- }
- unlock_kernel() ;
- wake_up(&reiserfs_commit_thread_done) ;
- return 0 ;
-}
-
static void journal_list_init(struct super_block *p_s_sb) {
int i ;
for (i = 0 ; i < JOURNAL_LIST_COUNT ; i++) {
@@ -2175,10 +2126,6 @@
return 0;

reiserfs_mounted_fs_count++ ;
- if (reiserfs_mounted_fs_count <= 1) {
- kernel_thread((void *)(void *)reiserfs_journal_commit_thread, NULL,
- CLONE_FS | CLONE_FILES | CLONE_VM) ;
- }
return 0 ;

}
--- linux/include/linux/reiserfs_fs.h.orig Sun Sep 29 20:58:23 2002
+++ linux/include/linux/reiserfs_fs.h Sun Sep 29 20:58:30 2002
@@ -1632,9 +1632,6 @@
/* 12 */ struct journal_params jh_journal;
} ;

-extern task_queue reiserfs_commit_thread_tq ;
-extern wait_queue_head_t reiserfs_commit_thread_wait ;
-
/* biggest tunable defines are right here */
#define JOURNAL_BLOCK_COUNT 8192 /* number of blocks in the journal */
#define JOURNAL_TRANS_MAX_DEFAULT 1024 /* biggest possible single transaction, don't change for now (8/3/99) */

2002-09-30 07:48:47

by Martin J. Bligh

[permalink] [raw]

Subject: Re: 2.5.39-mm1

> I must say that based on a small amount of performance testing the
> benefits of the cache warmness thing are disappointing. Maybe 1% if
> you squint. Martin, could you please do a before-and-after on the
> NUMAQ's, double check that it is actually doing the right thing?

Seems to work just fine:

2.5.38-mm1 + my original hot/cold code.
Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%

2.5.39-mm1
Elapsed: 19.532s User: 192.25s System: 42.642s CPU: 1203.2%

And it's a lot more than 1% for me ;-) About 12% of systime
on kernel compile, IIRC.

M.

2002-09-30 07:56:24

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.39-mm1

"Martin J. Bligh" wrote:
>
> > I must say that based on a small amount of performance testing the
> > benefits of the cache warmness thing are disappointing. Maybe 1% if
> > you squint. Martin, could you please do a before-and-after on the
> > NUMAQ's, double check that it is actually doing the right thing?
>
> Seems to work just fine:
>
> 2.5.38-mm1 + my original hot/cold code.
> Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%
>
> 2.5.39-mm1
> Elapsed: 19.532s User: 192.25s System: 42.642s CPU: 1203.2%
>
> And it's a lot more than 1% for me ;-) About 12% of systime
> on kernel compile, IIRC.

Well that's still a 1% bottom line. But we don't have a
comparison which shows the effects of this patch alone.

Can you patch -R the five patches and retest sometime?

I just get the feeling that it should be doing better.

2002-09-30 16:26:30

by Martin J. Bligh

[permalink] [raw]

Subject: Re: 2.5.39-mm1

> Well that's still a 1% bottom line. But we don't have a
> comparison which shows the effects of this patch alone.
>
> Can you patch -R the five patches and retest sometime?
>
> I just get the feeling that it should be doing better.

Well, I think something is indeed wrong.

Averages times of 5 kernel compiles on 16-way NUMA-Q:

2.5.38-mm1
Elapsed: 20.44s User: 192.118s System: 46.346s CPU: 1166.6%
2.5.38-mm1 + the original hot/cold stuff I sent you
Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%

Reduction in both system and elapsed time.

2.5.39-mm1 w/o hot/cold stuff
Elapsed: 19.538s User: 191.91s System: 44.746s CPU: 1210.8%
2.5.39-mm1
Elapsed: 19.532s User: 192.25s System: 42.642s CPU: 1203.2%

No change in elapsed time, system time down somewhat.

Looking at differences in averaged profiles:

Going from 38-mm1 to 38-mm1-hot (+ made things worse, - better)
Everything below 50 ticks difference excluded.

960 alloc_percpu_page
355 free_percpu_page
266 page_remove_rmap
96 file_read_actor
89 vm_enough_memory
56 page_add_rmap
-50 do_wp_page
-53 __pagevec_lru_add
-56 schedule
-73 dentry_open
-93 __generic_copy_from_user
-96 atomic_dec_and_lock
-97 get_empty_filp
-131 __fput
-144 __set_page_dirty_buffers
-147 do_softirq
-169 __alloc_pages
-187 .text.lock.file_table
-263 pgd_alloc
-323 pte_alloc_one
-396 zap_pte_range
-408 do_anonymous_page
-733 __free_pages_ok
-1301 rmqueue
-6709 default_idle
-9776 total

Going from 39-mm1 w/o hot to 39-mm1

1600 default_idle
896 buffered_rmqueue
421 free_hot_cold_page
271 page_remove_rmap
197 vm_enough_memory
161 .text.lock.file_table
132 get_empty_filp
95 __fput
90 atomic_dec_and_lock
50 filemap_nopage
-55 do_no_page
-55 __pagevec_lru_add
-62 schedule
-65 fd_install
-70 file_read_actor
-73 find_get_page
-81 d_lookup
-111 __set_page_dirty_buffers
-285 pgd_alloc
-350 pte_alloc_one
-382 do_anonymous_page
-412 zap_pte_range
-508 total
-717 __free_pages_ok
-1285 rmqueue

Which looks about the same to me? Me slightly confused. Will try
adding the original hot/cold stuff onto 39-mm1 if you like?

M.

2002-09-30 18:19:17

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.39-mm1

"Martin J. Bligh" wrote:
>
> Which looks about the same to me? Me slightly confused.

I expect that with the node-local allocations you're not getting
a lot of benefit from the lock amortisation. Anton will.

It's the lack of improvement of cache-niceness which is irksome.
Perhaps the heuristic should be based on recency-of-allocation and
not recency-of-freeing. I'll play with that.

> Will try
> adding the original hot/cold stuff onto 39-mm1 if you like?

Well, it's all in the noise floor, isn't it? Better off trying
broader tests. I had a play with netperf and the chatroom
benchmark. But the latter varied from 80,000 msgs/sec up
to 350,000 between runs.

2002-10-01 03:41:39

by Maneesh Soni

[permalink] [raw]

Subject: Re: 2.5.39-mm1

On Mon, 30 Sep 2002 23:55:50 +0530, Andrew Morton wrote:

> "Martin J. Bligh" wrote:
>>
>> Which looks about the same to me? Me slightly confused.
>
> I expect that with the node-local allocations you're not getting a lot
> of benefit from the lock amortisation. Anton will.
>
> It's the lack of improvement of cache-niceness which is irksome. Perhaps
> the heuristic should be based on recency-of-allocation and not
> recency-of-freeing. I'll play with that.
>
>> Will try
>> adding the original hot/cold stuff onto 39-mm1 if you like?
>
> Well, it's all in the noise floor, isn't it? Better off trying broader
> tests. I had a play with netperf and the chatroom benchmark. But the
> latter varied from 80,000 msgs/sec up to 350,000 between runs. --

Hello Andrew,

chatroom benchmark gives more consistent results with some delay
(sleep 60) between two runs.

Maneesh
--
Maneesh Soni
IBM Linux Technology Center,
IBM India Software Lab, Bangalore.
Phone: +91-80-5044999 email: [email protected]
http://lse.sourceforge.net/