url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.39/2.5.39-mm1/
This patchset includes the per-cpu-pages stuff which Martin and I
have been working on. This is designed to:
- increase the probability of the page allocator returning a page
which is cache-warm on the calling CPU
- amortise zone->lock contention via work batching.
- provide the basis for a page reservation API, so we can guarantee
that some troublesome inside-spinlock allocations will succeed.
Mainly pte_chains and radix_tree_nodes (haven't implemented this yet).
I must say that based on a small amount of performance testing the
benefits of the cache warmness thing are disappointing. Maybe 1% if
you squint. Martin, could you please do a before-and-after on the
NUMAQ's, double check that it is actually doing the right thing?
And Anton, could you please test this, make sure that it fixes the
rmqueue() and __free_pages_ok() lock contention as effectively as
the old per-cpu-pages patch did? Thanks.
There is a reiserfs compilation problem at present.
Rick has a modified version of iostat. Please use that for extracting the
SARD info. Also the version at http://linux.inet.hr/ is reasonably uptodate.
New versions of procps are at http://surriel.com/procps/ - the version
from cygnus CVS works for me.
+module-fix.patch
Compile fix for current Linus BK diff (Ingo)
+might_sleep-2.patch
Additional might_sleep checks
+slab-fix.patch
Fix a kmem_cache_destroy problem
+hugetlb-doc.patch
hugetlbpage docco
+get_user_pages-PG_reserved.patch
Don't bump the refcount on PageReserved pages in get_user_pages()
+move_one_page_fix.patch
kmap_atomic atomicity fix
+zab-list_heads.patch
Initialise some uninitialised VMA list_heads
+batched-slab-asap.patch
Batch up the slab shrinking work.
+rmqueue_bulk.patch
+free_pages_bulk.patch
Multipage page allocation and freeing
+hot_cold_pages.patch
Per-cpu hot-n-cold page lists.
+readahead-cold-pages.patch
Select cold pages for reading into pagecache
+pagevec-hot-cold-hint.patch
Pages which are freed by page reclaim and truncate are probably
cache-cold. Don't pollute the cache-hot pool with them.
linus.patch
cset-1.622.1.14-to-1.651.txt.gz
module-fix.patch
compile fixes from Ingo
ide-high-1.patch
scsi_hack.patch
Fix block-highmem for scsi
ext3-dxdir.patch
ext3 htree
spin-lock-check.patch
spinlock/rwlock checking infrastructure
rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)
might_sleep-2.patch
more sleep-inside-spinlock checks
slab-fix.patch
slab: put the spare page on cachep->pages_free
hugetlb-doc.patch
hugetlbpage documentation
get_user_pages-PG_reserved.patch
Check for PageReserved pages in get_user_pages()
move_one_page_fix.patch
pte_highmem atomicity fix in move_one_page()
zab-list_heads.patch
vm_area_struct list_head initialisation
remove-gfp_nfs.patch
remove GFP_NFS
buddyinfo.patch
Add /proc/buddyinfo - stats on the free pages pool
free_area.patch
Remove struct free_area_struct and free_area_t, use `struct free_area'
per-node-kswapd.patch
Per-node kswapd instance
topology-api.patch
Simple topology API
topology_fixes.patch
topology-api cleanups
write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock
radix_tree_gang_lookup.patch
radix tree gang lookup
truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite
proc_vmstat.patch
Move the vm accounting out of /proc/stat
kswapd-reclaim-stats.patch
Add kswapd_steal to /proc/vmstat
iowait.patch
I/O wait statistics
sard.patch
SARD disk accounting
dio-bio-add-page.patch
Use bio_add_page() in direct-io.c
tcp-wakeups.patch
Use fast wakeups in TCP/IPV4
swapoff-deadlock.patch
Fix a tmpfs swapoff deadlock
dirty-and-uptodate.patch
page state cleanup
shmem_rename.patch
shmem_rename() directory link count fix
dirent-size.patch
tmpfs: show a non-zero size for directories
tmpfs-trivia.patch
tmpfs: small fixlets
per-zone-vm.patch
separate the kswapd and direct reclaim code paths
swsusp-feature.patch
add shrink_all_memory() for swsusp
remove-page-virtual.patch
remove page->virtual for !WANT_PAGE_VIRTUAL
dirty-memory-clamp.patch
sterner dirty-memory clamping
mempool-wakeup-fix.patch
Fix for stuck tasks in mempool_alloc()
remove-write_mapping_buffers.patch
Remove write_mapping_buffers
buffer_boundary-scheduling.patch
IO schduling for indirect blocks
ll_rw_block-cleanup.patch
cleanup ll_rw_block()
lseek-ext2_readdir.patch
remove lock_kernel() from ext2_readdir()
discontig-no-contig_page_data.patch
undefine contif_page_data for discontigmem
per-node-zone_normal.patch
ia32 NUMA: per-node ZONE_NORMAL
alloc_pages_node-cleanup.patch
alloc_pages_node cleanup
batched-slab-asap.patch
batched slab shrinking
akpm-deadline.patch
deadline scheduler tweaks
rmqueue_bulk.patch
bulk page allocator
free_pages_bulk.patch
Bulk page freeing function
hot_cold_pages.patch
Hot/Cold pages and zone->lock amortisation
EDEC
Hot/Cold pages and zone->lock amortisation
readahead-cold-pages.patch
Use cache-cold pages for pagecache reads.
pagevec-hot-cold-hint.patch
hot/cold hints for truncate and page reclaim
read_barrier_depends.patch
extended barrier primitives
rcu_ltimer.patch
RCU core
dcache_rcu.patch
Use RCU for dcache
On September 29, 2002 04:26 pm, Andrew Morton wrote:
> There is a reiserfs compilation problem at present.
make[2]: Entering directory `/poole/src/39-mm1/fs/reiserfs'
gcc -Wp,-MD,./.bitmap.o.d -D__KERNEL__ -I/poole/src/39-mm1/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=k6 -I/poole/src/39-mm1/arch/i386/mach-generic -nostdinc -iwithprefix include -DKBUILD_BASENAME=bitmap -c -o bitmap.o bitmap.c
In file included from bitmap.c:8:
/poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: parse error before `reiserfs_commit_thread_tq'
/poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: warning: type defaults to `int' in declaration of `reiserfs_commit_thread_tq'
/poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: warning: data definition has no type or storage class
make[2]: *** [bitmap.o] Error 1
make[2]: Leaving directory `/poole/src/39-mm1/fs/reiserfs'
make[1]: *** [reiserfs] Error 2
make[1]: Leaving directory `/poole/src/39-mm1/fs'
make: *** [fs] Error 2
which is:
extern task_queue reiserfs_commit_thread_tq ;
from bk chanages:
[email protected], 2002-09-29 11:00:25-07:00, [email protected]
[PATCH] smptimers, old BH removal, tq-cleanup
<omitted>
- removed the ability to define your own task-queue, what can be done is
to schedule_task() a given task to keventd, and to flush all pending
tasks.
Ingo?
Ed Tomlinson
Ed Tomlinson wrote:
>
> On September 29, 2002 04:26 pm, Andrew Morton wrote:
> > There is a reiserfs compilation problem at present.
>
> make[2]: Entering directory `/poole/src/39-mm1/fs/reiserfs'
> gcc -Wp,-MD,./.bitmap.o.d -D__KERNEL__ -I/poole/src/39-mm1/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=k6 -I/poole/src/39-mm1/arch/i386/mach-generic -nostdinc -iwithprefix include -DKBUILD_BASENAME=bitmap -c -o bitmap.o bitmap.c
> In file included from bitmap.c:8:
> /poole/src/39-mm1/include/linux/reiserfs_fs.h:1635: parse error before `reiserfs_commit_thread_tq'
Ingo sent me the below temp fix. I let it out because it's
probably better to leave the fs broken until we have a firm,
tested solution.
--- linux/drivers/char/drm/radeon_irq.c.orig Sun Sep 29 20:55:34 2002
+++ linux/drivers/char/drm/radeon_irq.c Sun Sep 29 20:56:27 2002
@@ -69,8 +69,7 @@
atomic_inc(&dev_priv->irq_received);
#ifdef __linux__
- queue_task(&dev->tq, &tq_immediate);
- mark_bh(IMMEDIATE_BH);
+ schedule_task(&dev->tq);
#endif /* __linux__ */
#ifdef __FreeBSD__
taskqueue_enqueue(taskqueue_swi, &dev->task);
--- linux/fs/reiserfs/journal.c.orig Sun Sep 29 21:03:48 2002
+++ linux/fs/reiserfs/journal.c Sun Sep 29 21:04:49 2002
@@ -65,13 +65,6 @@
*/
static int reiserfs_mounted_fs_count = 0 ;
-/* wake this up when you add something to the commit thread task queue */
-DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
-
-/* wait on this if you need to be sure you task queue entries have been run */
-static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
-DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
-
#define JOURNAL_TRANS_HALF 1018 /* must be correct to keep the desc and commit
structs at 4k */
#define BUFNR 64 /*read ahead */
@@ -1339,12 +1332,9 @@
do_journal_end(&myth, p_s_sb,1, FLUSH_ALL) ;
}
- /* we decrement before we wake up, because the commit thread dies off
- ** when it has been woken up and the count is <= 0
- */
reiserfs_mounted_fs_count-- ;
- wake_up(&reiserfs_commit_thread_wait) ;
- sleep_on(&reiserfs_commit_thread_done) ;
+ /* wait for all commits to finish */
+ flush_scheduled_tasks();
release_journal_dev( p_s_sb, SB_JOURNAL( p_s_sb ) );
free_journal_ram(p_s_sb) ;
@@ -1815,6 +1805,10 @@
static void reiserfs_journal_commit_task_func(struct reiserfs_journal_commit_task *ct) {
struct reiserfs_journal_list *jl ;
+
+ /* FIXMEL: is this needed? */
+ lock_kernel();
+
jl = SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex ;
flush_commit_list(ct->p_s_sb, SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex, 1) ;
@@ -1824,6 +1818,7 @@
kupdate_one_transaction(ct->p_s_sb, jl) ;
}
reiserfs_kfree(ct->self, sizeof(struct reiserfs_journal_commit_task), ct->p_s_sb) ;
+ unlock_kernel();
}
static void setup_commit_task_arg(struct reiserfs_journal_commit_task *ct,
@@ -1850,8 +1845,7 @@
ct = reiserfs_kmalloc(sizeof(struct reiserfs_journal_commit_task), GFP_NOFS, p_s_sb) ;
if (ct) {
setup_commit_task_arg(ct, p_s_sb, jindex) ;
- queue_task(&(ct->task), &reiserfs_commit_thread_tq);
- wake_up(&reiserfs_commit_thread_wait) ;
+ schedule_task(&ct->task) ;
} else {
#ifdef CONFIG_REISERFS_CHECK
reiserfs_warning("journal-1540: kmalloc failed, doing sync commit\n") ;
@@ -1860,49 +1854,6 @@
}
}
-/*
-** this is the commit thread. It is started with kernel_thread on
-** FS mount, and journal_release() waits for it to exit.
-**
-** It could do a periodic commit, but there is a lot code for that
-** elsewhere right now, and I only wanted to implement this little
-** piece for starters.
-**
-** All we do here is sleep on the j_commit_thread_wait wait queue, and
-** then run the per filesystem commit task queue when we wakeup.
-*/
-static int reiserfs_journal_commit_thread(void *nullp) {
-
- daemonize() ;
-
- spin_lock_irq(¤t->sigmask_lock);
- sigfillset(¤t->blocked);
- recalc_sigpending();
- spin_unlock_irq(¤t->sigmask_lock);
-
- sprintf(current->comm, "kreiserfsd") ;
- lock_kernel() ;
- while(1) {
-
- while(TQ_ACTIVE(reiserfs_commit_thread_tq)) {
- run_task_queue(&reiserfs_commit_thread_tq) ;
- }
-
- /* if there aren't any more filesystems left, break */
- if (reiserfs_mounted_fs_count <= 0) {
- run_task_queue(&reiserfs_commit_thread_tq) ;
- break ;
- }
- wake_up(&reiserfs_commit_thread_done) ;
- if (current->flags & PF_FREEZE)
- refrigerator(PF_IOTHREAD);
- interruptible_sleep_on_timeout(&reiserfs_commit_thread_wait, 5 * HZ) ;
- }
- unlock_kernel() ;
- wake_up(&reiserfs_commit_thread_done) ;
- return 0 ;
-}
-
static void journal_list_init(struct super_block *p_s_sb) {
int i ;
for (i = 0 ; i < JOURNAL_LIST_COUNT ; i++) {
@@ -2175,10 +2126,6 @@
return 0;
reiserfs_mounted_fs_count++ ;
- if (reiserfs_mounted_fs_count <= 1) {
- kernel_thread((void *)(void *)reiserfs_journal_commit_thread, NULL,
- CLONE_FS | CLONE_FILES | CLONE_VM) ;
- }
return 0 ;
}
--- linux/include/linux/reiserfs_fs.h.orig Sun Sep 29 20:58:23 2002
+++ linux/include/linux/reiserfs_fs.h Sun Sep 29 20:58:30 2002
@@ -1632,9 +1632,6 @@
/* 12 */ struct journal_params jh_journal;
} ;
-extern task_queue reiserfs_commit_thread_tq ;
-extern wait_queue_head_t reiserfs_commit_thread_wait ;
-
/* biggest tunable defines are right here */
#define JOURNAL_BLOCK_COUNT 8192 /* number of blocks in the journal */
#define JOURNAL_TRANS_MAX_DEFAULT 1024 /* biggest possible single transaction, don't change for now (8/3/99) */
> I must say that based on a small amount of performance testing the
> benefits of the cache warmness thing are disappointing. Maybe 1% if
> you squint. Martin, could you please do a before-and-after on the
> NUMAQ's, double check that it is actually doing the right thing?
Seems to work just fine:
2.5.38-mm1 + my original hot/cold code.
Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%
2.5.39-mm1
Elapsed: 19.532s User: 192.25s System: 42.642s CPU: 1203.2%
And it's a lot more than 1% for me ;-) About 12% of systime
on kernel compile, IIRC.
M.
"Martin J. Bligh" wrote:
>
> > I must say that based on a small amount of performance testing the
> > benefits of the cache warmness thing are disappointing. Maybe 1% if
> > you squint. Martin, could you please do a before-and-after on the
> > NUMAQ's, double check that it is actually doing the right thing?
>
> Seems to work just fine:
>
> 2.5.38-mm1 + my original hot/cold code.
> Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%
>
> 2.5.39-mm1
> Elapsed: 19.532s User: 192.25s System: 42.642s CPU: 1203.2%
>
> And it's a lot more than 1% for me ;-) About 12% of systime
> on kernel compile, IIRC.
Well that's still a 1% bottom line. But we don't have a
comparison which shows the effects of this patch alone.
Can you patch -R the five patches and retest sometime?
I just get the feeling that it should be doing better.
> Well that's still a 1% bottom line. But we don't have a
> comparison which shows the effects of this patch alone.
>
> Can you patch -R the five patches and retest sometime?
>
> I just get the feeling that it should be doing better.
Well, I think something is indeed wrong.
Averages times of 5 kernel compiles on 16-way NUMA-Q:
2.5.38-mm1
Elapsed: 20.44s User: 192.118s System: 46.346s CPU: 1166.6%
2.5.38-mm1 + the original hot/cold stuff I sent you
Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%
Reduction in both system and elapsed time.
2.5.39-mm1 w/o hot/cold stuff
Elapsed: 19.538s User: 191.91s System: 44.746s CPU: 1210.8%
2.5.39-mm1
Elapsed: 19.532s User: 192.25s System: 42.642s CPU: 1203.2%
No change in elapsed time, system time down somewhat.
Looking at differences in averaged profiles:
Going from 38-mm1 to 38-mm1-hot (+ made things worse, - better)
Everything below 50 ticks difference excluded.
960 alloc_percpu_page
355 free_percpu_page
266 page_remove_rmap
96 file_read_actor
89 vm_enough_memory
56 page_add_rmap
-50 do_wp_page
-53 __pagevec_lru_add
-56 schedule
-73 dentry_open
-93 __generic_copy_from_user
-96 atomic_dec_and_lock
-97 get_empty_filp
-131 __fput
-144 __set_page_dirty_buffers
-147 do_softirq
-169 __alloc_pages
-187 .text.lock.file_table
-263 pgd_alloc
-323 pte_alloc_one
-396 zap_pte_range
-408 do_anonymous_page
-733 __free_pages_ok
-1301 rmqueue
-6709 default_idle
-9776 total
Going from 39-mm1 w/o hot to 39-mm1
1600 default_idle
896 buffered_rmqueue
421 free_hot_cold_page
271 page_remove_rmap
197 vm_enough_memory
161 .text.lock.file_table
132 get_empty_filp
95 __fput
90 atomic_dec_and_lock
50 filemap_nopage
-55 do_no_page
-55 __pagevec_lru_add
-62 schedule
-65 fd_install
-70 file_read_actor
-73 find_get_page
-81 d_lookup
-111 __set_page_dirty_buffers
-285 pgd_alloc
-350 pte_alloc_one
-382 do_anonymous_page
-412 zap_pte_range
-508 total
-717 __free_pages_ok
-1285 rmqueue
Which looks about the same to me? Me slightly confused. Will try
adding the original hot/cold stuff onto 39-mm1 if you like?
M.
"Martin J. Bligh" wrote:
>
> Which looks about the same to me? Me slightly confused.
I expect that with the node-local allocations you're not getting
a lot of benefit from the lock amortisation. Anton will.
It's the lack of improvement of cache-niceness which is irksome.
Perhaps the heuristic should be based on recency-of-allocation and
not recency-of-freeing. I'll play with that.
> Will try
> adding the original hot/cold stuff onto 39-mm1 if you like?
Well, it's all in the noise floor, isn't it? Better off trying
broader tests. I had a play with netperf and the chatroom
benchmark. But the latter varied from 80,000 msgs/sec up
to 350,000 between runs.
On Mon, 30 Sep 2002 23:55:50 +0530, Andrew Morton wrote:
> "Martin J. Bligh" wrote:
>>
>> Which looks about the same to me? Me slightly confused.
>
> I expect that with the node-local allocations you're not getting a lot
> of benefit from the lock amortisation. Anton will.
>
> It's the lack of improvement of cache-niceness which is irksome. Perhaps
> the heuristic should be based on recency-of-allocation and not
> recency-of-freeing. I'll play with that.
>
>> Will try
>> adding the original hot/cold stuff onto 39-mm1 if you like?
>
> Well, it's all in the noise floor, isn't it? Better off trying broader
> tests. I had a play with netperf and the chatroom benchmark. But the
> latter varied from 80,000 msgs/sec up to 350,000 between runs. --
Hello Andrew,
chatroom benchmark gives more consistent results with some delay
(sleep 60) between two runs.
Maneesh
--
Maneesh Soni
IBM Linux Technology Center,
IBM India Software Lab, Bangalore.
Phone: +91-80-5044999 email: [email protected]
http://lse.sourceforge.net/