2002-09-23 04:15:57

by Andrew Morton

[permalink] [raw]
Subject: 2.5.38-mm2


url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm2/

+linus.patch

Linus's current diff.

-filemap-fixes.patch

Merged

+unbreak-writeback-mode.patch

ext3 in data=writeback mode was oopsing on writeback of MAP_SHARED
data.

+read-latency.patch

Fix the writer-starves-reader elevator problem. This is basically
the read_latency2 patch from -ac kernels.

On IDE it provides a 100x improvement in read throughput when there
is heavy writeback happening. 40x on SCSI. You need to disable
tagged command queueing on scsi - it appears to be quite stupidly
implemented.


linus.patch
cset-1.580.1.4-to-1.597.txt.gz

ide-high-1.patch

ide-block-fix-1.patch

scsi_hack.patch
Fix block-highmem for scsi

ext3-htree.patch
Indexed directories for ext3

spin-lock-check.patch
spinlock/rwlock checking infrastructure

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

might_sleep.patch
debug code to detect might-sleep-inside-spinlock bugs

unbreak-writeback-mode.patch
Fix ext3's data=writeback mode

queue-congestion.patch
Infrastructure for communicating request queue congestion to the VM

nonblocking-ext2-preread.patch
avoid ext2 inode prereads if the queue is congested

nonblocking-pdflush.patch
non-blocking writeback infrastructure, use it for pdflush

nonblocking-vm.patch
Non-blocking page reclaim

set_page_dirty-locking-fix.patch
don't call __mark_inode_dirty under spinlock

prepare_to_wait.patch
prepare_to_wait/finish_wait: new sleep/wakeup API

vm-wakeups.patch
Use the faster wakeups in the VM and block layers

sync-helper.patch
Speed up sys_sync() against multiple spindles

slabasap.patch
Early and smarter shrinking of slabs

write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock

buddyinfo.patch
Add /proc/buddyinfo - stats on the free pages pool

free_area.patch
Remove struct free_area_struct and free_area_t, use `struct free_area'

per-node-kswapd.patch
Per-node kswapd instance

topology-api.patch
Simple topology API

radix_tree_gang_lookup.patch
radix tree gang lookup

truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite

proc_vmstat.patch
Move the vm accounting out of /proc/stat

kswapd-reclaim-stats.patch
Add kswapd_steal to /proc/vmstat

iowait.patch
I/O wait statistics

sard.patch
SARD disk accounting

remove-gfp_nfs.patch
remove GFP_NFS

tcp-wakeups.patch
Use fast wakeups in TCP/IPV4

swapoff-deadlock.patch
Fix a tmpfs swapoff deadlock

dirty-and-uptodate.patch
page state cleanup

shmem_rename.patch
shmem_rename() directory link count fix

dirent-size.patch
tmpfs: show a non-zero size for directories

tmpfs-trivia.patch
tmpfs: small fixlets

per-zone-vm.patch
separate the kswapd and direct reclaim code paths

swsusp-feature.patch
add shrink_all_memory() for swsusp

adaptec-fix.patch
partial fix for aic7xxx error recovery

remove-page-virtual.patch
remove page->virtual for !WANT_PAGE_VIRTUAL

dirty-memory-clamp.patch
sterner dirty-memory clamping

mempool-wakeup-fix.patch
Fix for stuck tasks in mempool_alloc()

remove-write_mapping_buffers.patch
Remove write_mapping_buffers

buffer_boundary-scheduling.patch
IO schduling for indirect blocks

ll_rw_block-cleanup.patch
cleanup ll_rw_block()

lseek-ext2_readdir.patch
remove lock_kernel() from ext2_readdir()

discontig-no-contig_page_data.patch
undefine contif_page_data for discontigmem

per-node-zone_normal.patch
ia32 NUMA: per-node ZONE_NORMAL

alloc_pages_node-cleanup.patch
alloc_pages_node cleanup

read_barrier_depends.patch
extended barrier primitives

rcu_ltimer.patch
RCU core

dcache_rcu.patch
Use RCU for dcache

read-latency.patch
Elevator fix for writes-starving-reads


2002-09-23 07:11:34

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.38-mm2

On Sun, Sep 22 2002, Andrew Morton wrote:
> +read-latency.patch
>
> Fix the writer-starves-reader elevator problem. This is basically
> the read_latency2 patch from -ac kernels.
>
> On IDE it provides a 100x improvement in read throughput when there
> is heavy writeback happening. 40x on SCSI. You need to disable

Ah interesting. I do still think that it is worth to investigate _why_
both elevator_linus and deadline does not prevent the read starvation.
The read-latency is a hack, not a solution imo.

> tagged command queueing on scsi - it appears to be quite stupidly
> implemented.

Ahem I think you are being excessively harsh, or maybe passing judgement
on something you haven't even looked at. Did you consider that you
_drive_ may be the broken component? Excessive turn-around times for
request when using deep tcq is not unusual, by far.

--
Jens Axboe

2002-09-23 07:38:28

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.38-mm2

Jens Axboe wrote:
>
> On Sun, Sep 22 2002, Andrew Morton wrote:
> > +read-latency.patch
> >
> > Fix the writer-starves-reader elevator problem. This is basically
> > the read_latency2 patch from -ac kernels.
> >
> > On IDE it provides a 100x improvement in read throughput when there
> > is heavy writeback happening. 40x on SCSI. You need to disable
>
> Ah interesting. I do still think that it is worth to investigate _why_
> both elevator_linus and deadline does not prevent the read starvation.

I did. See below.

> The read-latency is a hack, not a solution imo.

Well it clearly _is_ a solution. To a grave problem. But hopefully not
the best solution. Really, this is just me saying "ouch". This is
your stuff ;)

> > tagged command queueing on scsi - it appears to be quite stupidly
> > implemented.
>
> Ahem I think you are being excessively harsh, or maybe passing judgement
> on something you haven't even looked at. Did you consider that you
> _drive_ may be the broken component? Excessive turn-around times for
> request when using deep tcq is not unusual, by far.

It's a Fujitsu SCA-2 thing. Could be that other drive manufacturers
have a slight clue, but I doubt it. I bet they just went and designed
the queueing for optimum throughput, with the assumption that reads
and writes are muchly the same thing.

But they're not. They are vastly different things. Your fancy 2GHz
processor twiddles thumbs waiting for reads. But not for writes.
The "hack" _recognises_ this fact - that reads are very different
things from writes.


Let's run the numbers. 128 slot write request queue. 512k writes.
30 mbyte/sec bandwidth. That's two seconds worth of writes in the
request queue.

The reads have basically no chance of getting inserted between those
writes, so the first read has a two second latency, and that's before
adding in any of the passovers which additional writes will enjoy.

It works out that the latency per read is about three seconds. I
have all the traces of this.

Now think about what userspace wants to do. It reads a block from
the directory. Three seconds. Parse the directory, go read an
inode block. Three seconds. Go read the file. Three seconds
if it's less than 56k. Six seconds otherwise.

That's nine seconds since we read the directory block. I'm running
with mem=192m. So by now, the directory block has been reclaimed.

Move onto the next file.


So there is no bug or coding error present in the elevator. Everything
is working as it is designed to. But a streaming write slows read
performance by a factor of 4000.

2002-09-23 09:35:30

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm2 [PATCH]

On Mon, Sep 23, 2002 at 04:22:28AM +0000, Andrew Morton wrote:
> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm2/
> read_barrier_depends.patch
> extended barrier primitives
>
> rcu_ltimer.patch
> RCU core
>
> dcache_rcu.patch
> Use RCU for dcache
>

Hi Andrew,

The following patch fixes a typo for preemptive kernels.

Later I will submit a full rcu_ltimer patch that contains
the call_rcu_preempt() interface which can be useful for
module unloading and the likes. This doesn't affect
the non-preemption path.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.


--- include/linux/rcupdate.h Mon Sep 23 11:47:26 2002
+++ /tmp/rcupdate.h Mon Sep 23 12:45:21 2002
@@ -116,7 +116,7 @@
return 0;
}

-#ifdef CONFIG_PREEMPTION
+#ifdef CONFIG_PREEMPT
#define rcu_read_lock() preempt_disable()
#define rcu_read_unlock() preempt_enable()
#else

2002-09-23 09:45:48

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm2 [PATCH] (dcache)

On Mon, Sep 23, 2002 at 04:22:28AM +0000, Andrew Morton wrote:
> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm2/
>
> read_barrier_depends.patch
> extended barrier primitives
>
> rcu_ltimer.patch
> RCU core
>
> dcache_rcu.patch
> Use RCU for dcache
>

Hi Andrew,

dcache_rcu orders writes using wmb() (list_del_rcu) while deleting from
the hash list and the d_lookup() hash list traversal requires an rmb() for
alpha. So, we need to use the read_barrier_depends() interface there.
This isn't a problem with any other archs AFAIK.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.


--- fs/dcache.c Mon Sep 23 11:47:26 2002
+++ /tmp/dcache.c Mon Sep 23 12:54:33 2002
@@ -870,7 +870,9 @@
rcu_read_lock();
tmp = head->next;
for (;;) {
- struct dentry * dentry = list_entry(tmp, struct dentry, d_hash);
+ struct dentry * dentry;
+ read_barrier_depends();
+ dentry = list_entry(tmp, struct dentry, d_hash);
if (tmp == head)
break;
tmp = tmp->next;

2002-09-24 04:48:01

by Rusty Russell

[permalink] [raw]
Subject: Re: 2.5.38-mm2 [PATCH]

On Mon, 23 Sep 2002 15:15:59 +0530
Dipankar Sarma <[email protected]> wrote:
> Later I will submit a full rcu_ltimer patch that contains
> the call_rcu_preempt() interface which can be useful for
> module unloading and the likes. This doesn't affect
> the non-preemption path.

You don't need this: I've dropped the requirement for module
unload.

Cheers!
Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2002-09-24 10:13:53

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm2 [PATCH]

On Tue, Sep 24, 2002 at 02:41:09PM +1000, Rusty Russell wrote:
> On Mon, 23 Sep 2002 15:15:59 +0530
> Dipankar Sarma <[email protected]> wrote:
> > Later I will submit a full rcu_ltimer patch that contains
> > the call_rcu_preempt() interface which can be useful for
> > module unloading and the likes. This doesn't affect
> > the non-preemption path.
>
> You don't need this: I've dropped the requirement for module
> unload.

Isn't wait_for_later() similar to synchornize_kernel() or has the
entire module unloading design been changed since ?

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-24 14:59:10

by Rusty Russell

[permalink] [raw]
Subject: Re: 2.5.38-mm2 [PATCH]

In message <[email protected]> you write:
> On Tue, Sep 24, 2002 at 02:41:09PM +1000, Rusty Russell wrote:
> > On Mon, 23 Sep 2002 15:15:59 +0530
> > Dipankar Sarma <[email protected]> wrote:
> > > Later I will submit a full rcu_ltimer patch that contains
> > > the call_rcu_preempt() interface which can be useful for
> > > module unloading and the likes. This doesn't affect
> > > the non-preemption path.
> >
> > You don't need this: I've dropped the requirement for module
> > unload.
>
> Isn't wait_for_later() similar to synchornize_kernel() or has the
> entire module unloading design been changed since ?

Yes, that was *days* ago 8)

I now just use a synchronize_kernel() which schedules on every CPU,
and disable preempt in magic places.

Ingo growled at me...
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-24 21:13:13

by Bill Davidsen

[permalink] [raw]
Subject: Re: 2.5.38-mm2

On Mon, 23 Sep 2002, Jens Axboe wrote:

> Ah interesting. I do still think that it is worth to investigate _why_
> both elevator_linus and deadline does not prevent the read starvation.
> The read-latency is a hack, not a solution imo.
>
> > tagged command queueing on scsi - it appears to be quite stupidly
> > implemented.
>
> Ahem I think you are being excessively harsh, or maybe passing judgement
> on something you haven't even looked at. Did you consider that you
> _drive_ may be the broken component? Excessive turn-around times for
> request when using deep tcq is not unusual, by far.

I do think that's what he meant! I think most drives are optimized this
way, and performance would be better if the kernel used the queueing more
sparingly, so the drive couldn't just run with the writes and let the
reads take the leftovers.

I think that's a better long run solution, although the fix addresses the
immediate problem.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.