2002-08-11 07:24:44

by Andrew Morton

[permalink] [raw]
Subject: [patch 1/21] random fixes

Sorry, but there's a ton of stuff here. It ends up as a 4600 line
diff. Some code dating back to 2.5.24. It's almost all performance
work and it has been very painful getting its effectiveness tested
on the big machines; the main problem has been getting them booting
2.5 at all. The results still are not as conclusive as I'd like,
but the signs are good, and there are no other proposals around to
fix these problems.



This one is mainly a resend.

- I changed the sector_t thing in max_block to use davem's approach.
I agree with Anton, but making it explicit doesn't hurt.

- Remove a dead comment in copy_strings.

Old stuff:

- Remove the IO error warning in end_buffer_io_sync(). Failed READA
attempts trigger it.

- Emit a warning when an ext2 is mounting an ext3 filesystem.

We have had quite a few problem reports related to this, mainly
arising from initrd problems. And mount(8) tends to report the
fstype from /etc/fstab rather than reporting what has really
happened.

Fixes some bogosity which I added to max_block():

- `size' doesn't need to be sector_t

- `retval' should not be initialised to "~0UL" because that is
0x00000000ffffffff with 64-bit sector_t.

- Allocate task_structs with GFP_KERNEL, as discussed.

- Convert the EXPORT_SYMBOL for generic_file_direct_IO() to
EXPORT_SYMBOL_GPL. That was only exported as a practicality for the
raw driver.

- Make the loop thread run balance_dirty_pages() after dirtying the
backing file. So it will perform writeback of the backing file when
dirty memory levels are high. Export balance_dirty_pages to GPL
modules for this.

This makes loop work a lot better - I suspect it broke when callers
of balance_dirty_pages() started writing back only their own queue.

There are many page allocation failures under heavy loop writeout.
Coming from blk_queue_bounce()'s allocation from the page_pool
mempool. So...

- Disable page allocation warnings around the initial atomic
allocation attempt in mempool_alloc() - the one where __GFP_WAIT and
__GFP_IO were turned off. That one can easily fail.

- Add some commentary in block_write_full_page()


drivers/block/loop.c | 2 ++
fs/block_dev.c | 6 +++---
fs/buffer.c | 13 +++++++++++--
fs/exec.c | 5 -----
fs/ext2/super.c | 3 +++
kernel/fork.c | 5 +++--
kernel/ksyms.c | 2 +-
mm/mempool.c | 2 ++
mm/page-writeback.c | 1 +
9 files changed, 26 insertions(+), 13 deletions(-)

--- 2.5.31/fs/buffer.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/fs/buffer.c Sat Aug 10 23:23:35 2002
@@ -180,7 +180,10 @@ void end_buffer_io_sync(struct buffer_he
if (uptodate) {
set_buffer_uptodate(bh);
} else {
- buffer_io_error(bh);
+ /*
+ * This happens, due to failed READA attempts.
+ * buffer_io_error(bh);
+ */
clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
@@ -2283,7 +2286,13 @@ int block_write_full_page(struct page *p
return -EIO;
}

- /* The page straddles i_size */
+ /*
+ * The page straddles i_size. It must be zeroed out on each and every
+ * writepage invokation because it may be mmapped. "A file is mapped
+ * in multiples of the page size. For a file that is not a multiple of
+ * the page size, the remaining memory is zeroed when mapped, and
+ * writes to that region are not written out to the file."
+ */
kaddr = kmap(page);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
--- 2.5.31/fs/block_dev.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/fs/block_dev.c Sat Aug 10 23:23:35 2002
@@ -26,12 +26,12 @@

static sector_t max_block(struct block_device *bdev)
{
- sector_t retval = ~0U;
+ sector_t retval = ~((sector_t)0);
loff_t sz = bdev->bd_inode->i_size;

if (sz) {
- sector_t size = block_size(bdev);
- unsigned sizebits = blksize_bits(size);
+ unsigned int size = block_size(bdev);
+ unsigned int sizebits = blksize_bits(size);
retval = (sz >> sizebits);
}
return retval;
--- 2.5.31/fs/ext2/super.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/fs/ext2/super.c Sat Aug 10 23:23:35 2002
@@ -698,6 +698,9 @@ static int ext2_fill_super(struct super_
printk(KERN_ERR "EXT2-fs: get root inode failed\n");
goto failed_mount2;
}
+ if (EXT2_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL))
+ ext2_warning(sb, __FUNCTION__,
+ "mounting ext3 filesystem as ext2\n");
ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY);
return 0;
failed_mount2:
--- 2.5.31/kernel/fork.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/kernel/fork.c Sat Aug 10 23:23:35 2002
@@ -106,9 +106,10 @@ static struct task_struct *dup_task_stru
struct thread_info *ti;

ti = alloc_thread_info();
- if (!ti) return NULL;
+ if (!ti)
+ return NULL;

- tsk = kmem_cache_alloc(task_struct_cachep,GFP_ATOMIC);
+ tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
if (!tsk) {
free_thread_info(ti);
return NULL;
--- 2.5.31/kernel/ksyms.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/kernel/ksyms.c Sat Aug 10 23:23:35 2002
@@ -340,7 +340,7 @@ EXPORT_SYMBOL(register_disk);
EXPORT_SYMBOL(read_dev_sector);
EXPORT_SYMBOL(init_buffer);
EXPORT_SYMBOL(wipe_partitions);
-EXPORT_SYMBOL(generic_file_direct_IO);
+EXPORT_SYMBOL_GPL(generic_file_direct_IO);

/* tty routines */
EXPORT_SYMBOL(tty_hangup);
--- 2.5.31/drivers/block/loop.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/drivers/block/loop.c Sat Aug 10 23:23:35 2002
@@ -74,6 +74,7 @@
#include <linux/slab.h>
#include <linux/loop.h>
#include <linux/suspend.h>
+#include <linux/writeback.h>
#include <linux/buffer_head.h> /* for invalidate_bdev() */

#include <asm/uaccess.h>
@@ -235,6 +236,7 @@ do_lo_send(struct loop_device *lo, struc
up(&mapping->host->i_sem);
out:
kunmap(bvec->bv_page);
+ balance_dirty_pages(mapping);
return ret;

unlock:
--- 2.5.31/mm/page-writeback.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/mm/page-writeback.c Sat Aug 10 23:23:35 2002
@@ -133,6 +133,7 @@ void balance_dirty_pages(struct address_
if (!writeback_in_progress(bdi) && ps.nr_dirty > background_thresh)
pdflush_operation(background_writeout, 0);
}
+EXPORT_SYMBOL_GPL(balance_dirty_pages);

/**
* balance_dirty_pages_ratelimited - balance dirty memory state
--- 2.5.31/mm/mempool.c~misc Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/mm/mempool.c Sat Aug 10 23:23:35 2002
@@ -189,7 +189,9 @@ void * mempool_alloc(mempool_t *pool, in
int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);

repeat_alloc:
+ current->flags |= PF_NOWARN;
element = pool->alloc(gfp_nowait, pool->pool_data);
+ current->flags &= ~PF_NOWARN;
if (likely(element != NULL))
return element;

--- 2.5.31/fs/exec.c~misc Sat Aug 10 23:23:40 2002
+++ 2.5.31-akpm/fs/exec.c Sat Aug 10 23:24:12 2002
@@ -209,11 +209,6 @@ int copy_strings(int argc,char ** argv,
/* XXX: add architecture specific overflow check here. */
pos = bprm->p;

- /*
- * The only sleeping function which we are allowed to call in
- * this loop is copy_from_user(). Otherwise, copy_user_state
- * could get trashed.
- */
while (len > 0) {
int i, new, err;
int offset, bytes_to_copy;

.


2002-08-11 07:53:12

by Alexander Viro

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes



On Sun, 11 Aug 2002, Andrew Morton wrote:

> flush_dcache_page(page);
> --- 2.5.31/fs/block_dev.c~misc Sat Aug 10 23:23:35 2002
> +++ 2.5.31-akpm/fs/block_dev.c Sat Aug 10 23:23:35 2002
> @@ -26,12 +26,12 @@
>
> static sector_t max_block(struct block_device *bdev)
> {
> - sector_t retval = ~0U;
> + sector_t retval = ~((sector_t)0);
> loff_t sz = bdev->bd_inode->i_size;
>
> if (sz) {
> - sector_t size = block_size(bdev);
> - unsigned sizebits = blksize_bits(size);
> + unsigned int size = block_size(bdev);
> + unsigned int sizebits = blksize_bits(size);
> retval = (sz >> sizebits);

Ugh. Why do we have all that stuff, anyway?

bdev->bd_inode->i_size >> bdev->bd_inode->i_blkbits

should work just fine...

2002-08-11 14:25:55

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> Sorry, but there's a ton of stuff here. It ends up as a 4600 line
> diff. Some code dating back to 2.5.24. It's almost all performance

Andrew,

Nearly all the patches against mm/vmscan.c are failing when applied
to the 2.5.31 Linus just released. Are these patches against a
slightly older BK rev?

--Adam

2002-08-11 17:55:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

Adam Kropelin wrote:
>
> On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> > Sorry, but there's a ton of stuff here. It ends up as a 4600 line
> > diff. Some code dating back to 2.5.24. It's almost all performance
>
> Andrew,
>
> Nearly all the patches against mm/vmscan.c are failing when applied
> to the 2.5.31 Linus just released. Are these patches against a
> slightly older BK rev?

Gee I hope not.

Try getting them from http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/,
or the big rollup http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/everything.gz

2002-08-12 00:23:58

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Sun, Aug 11, 2002 at 11:09:02AM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> >
> > On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> > > Sorry, but there's a ton of stuff here. It ends up as a 4600 line
> > > diff. Some code dating back to 2.5.24. It's almost all performance
> >
> > Andrew,
> >
> > Nearly all the patches against mm/vmscan.c are failing when applied
> > to the 2.5.31 Linus just released. Are these patches against a
> > slightly older BK rev?
>
> Gee I hope not.
>
> Try getting them from http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/,
> or the big rollup http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/everything.gz

The big rollup applied fine, thanks.

I did a bit of testing since I've always thought 2.4 (and 2.5) writeout behavior
left something to be desired. Testbed was a SMP x86 (2xPPro-200) with 160 MB
of RAM. I used everyone's favorite 2.5 scapegoat: IDE, with a single not-very-
fast IBM disk. Filesystem was ext3 in data=ordered mode. Test workload was an
inbound (from the point of view of the system under test) FTP transfer of a
600 MB iso image. All test runs were from a clean boot with all unnecessary
services shut down.

Results (average of 4 runs):

2.5.31-akpm: 2m 43s
2.5.31: 2m 33s
2.4.19: 2m 18s

`vmstat 1` shows some differences, expecially with respect to 2.4 vs. 2.5. In
about 40% of the cases when the bo drops to (near) 0, the machine stalled (FTP
transfer halted, vmstat output paused, etc.). With 2.5.31-akpm, the stalls were
about 3-4 seconds in length. With 2.5.31, the stalls were of the same duration,
but slightly less frequent. With 2.4.19, the stalls were very frequent (closer
to 70% of the time bo hit 0), but were only 1-2 seconds in duration.

Below are representative samples of `vmstat 1` for each kernel during the test. (Note that the low cache usage in the 2.5.31 sample is because the snapshot is
from early in the run when the cache is still filling.)

Let me know if I can provide more information...

--Adam

2.5.31-akpm:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 2 1 112 3436 0 140956 0 0 4 15480 5454 400 0 39 60
0 2 1 112 3436 0 140956 0 0 0 7696 1093 69 0 2 98
0 2 1 112 3436 0 140956 0 0 0 6268 1084 85 0 31 69
1 0 2 112 2476 0 142012 0 0 0 4 2863 250 0 23 77
1 0 0 112 3080 0 142080 0 0 0 68 6730 485 0 46 53
0 1 1 112 2940 0 141968 0 0 0 11720 5025 340 1 33 67
0 1 1 112 2936 0 141968 0 0 0 264 1085 45 0 1 99
1 0 1 112 2812 0 142344 0 0 0 52 3104 203 0 18 82
0 0 0 112 3300 0 141972 0 0 0 4 6761 469 1 42 57
1 0 0 112 3492 0 141684 0 0 0 0 6859 495 1 42 56
0 1 1 112 3548 0 141204 0 0 0 15508 4769 328 0 31 69
0 1 1 112 3544 0 141204 0 0 0 2268 1081 63 0 2 98
0 0 0 112 2436 0 142248 0 0 0 56 2006 147 0 10 90
1 0 0 112 2952 0 142328 0 0 0 4 6760 452 1 43 56
1 0 1 112 3432 0 141716 0 0 0 0 6955 464 1 42 57
0 1 1 112 2940 0 141816 0 0 0 15612 4301 262 0 28 72
0 1 1 112 2932 0 141816 0 0 0 588 1095 78 0 2 98
1 0 0 112 2620 0 142660 0 0 0 52 4554 314 1 30 69
1 0 0 112 3420 0 141808 0 0 0 4 6673 465 0 43 57
0 0 0 112 2628 0 142456 0 0 0 4 6931 491 1 44 55

2.5.31:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 118940 0 28240 0 0 4 0 4171 256 1 21 78
1 0 0 0 110904 0 36036 0 0 0 0 8937 590 1 53 46
1 0 0 0 103260 0 43452 0 0 0 0 8558 559 1 50 49
0 0 0 0 97100 0 49424 0 0 0 0 6919 460 1 41 58
0 1 1 0 96048 0 50104 0 0 0 21036 1798 67 0 9 90
0 1 1 0 96044 0 50104 0 0 0 3888 1087 55 0 2 98
0 1 1 0 96044 0 50104 0 0 0 0 1081 65 0 1 99
1 0 0 0 91516 0 54544 0 0 0 72 5305 352 0 33 67
0 0 0 0 85392 0 60560 0 0 0 0 6972 458 0 44 56
0 0 1 0 79344 0 66476 0 0 0 10788 6384 3173 1 48 50
1 0 0 0 73296 0 72416 0 0 0 44 6705 1392 1 49 50
0 0 0 0 67156 0 78444 0 0 0 0 6975 475 1 62 37
1 0 0 0 61392 0 84104 0 0 0 0 6603 442 0 37 62
0 1 1 0 55272 0 90016 0 0 0 15500 6940 451 1 42 57
0 1 1 0 55272 0 90016 0 0 0 7696 1123 13 0 3 97

2.4.19:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 4384 2124 140132 0 0 0 52 6961 645 0 54 45
1 0 0 0 4372 2132 140024 0 0 0 0 6994 653 1 50 50
0 1 1 0 4360 2136 139916 0 0 0 3956 6189 577 1 44 55
0 1 1 0 4360 2136 139916 0 0 0 8196 223 14 0 2 97
0 0 1 0 4344 2140 139908 0 0 0 6080 1189 90 0 9 91
0 1 1 0 4440 2140 139764 0 0 4 7296 5902 557 0 43 57
1 0 0 0 4360 2144 140044 0 0 0 56 3515 307 0 29 71
0 1 1 0 4468 2144 139936 0 0 0 4036 5672 519 0 42 57
0 1 1 0 4468 2144 139936 0 0 0 7960 220 14 0 1 99
1 0 1 0 4464 2144 139980 0 0 0 5160 2073 178 0 17 82
1 0 0 0 4396 2164 140092 0 0 0 3148 6965 656 1 51 48
1 0 0 0 4396 2164 140068 0 0 0 0 7193 656 1 44 54
0 2 1 0 4384 2164 139996 0 0 0 5848 4923 454 1 37 62
0 2 1 0 4384 2164 139996 0 0 0 6148 222 10 0 0 99
1 0 1 0 4400 2168 139900 0 0 0 7400 2961 258 0 24 75
1 0 0 0 4464 2184 140004 0 0 0 52 7076 659 1 51 48
1 0 0 0 4452 2184 139936 0 0 0 0 6960 638 0 54 46
0 1 2 0 4404 2188 139932 0 0 0 5968 4332 399 0 30 69
0 1 1 0 4404 2188 139932 0 0 0 4804 222 12 0 1 99

2002-08-12 00:37:49

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Sun, 11 Aug 2002, Adam Kropelin wrote:

> fast IBM disk. Filesystem was ext3 in data=ordered mode. Test workload
> was an inbound (from the point of view of the system under test) FTP
> transfer of a 600 MB iso image. All test runs were from a clean boot
> with all unnecessary services shut down.

> machine stalled (FTP transfer halted, vmstat output paused, etc.). With
> 2.5.31-akpm, the stalls were about 3-4 seconds in length. With 2.5.31,
> the stalls were of the same duration, but slightly less frequent. With

Definately some writeout sillyness. Why would we ever stop
writing pages to disk while a transfer is going on and then
suddenly decide to stall the system because pages are being
dirtied at a rate faster than we write them ?

If we can smooth out the writing we can keep the disks busy
all the time and should in theory perform better. I wonder
why Andrew made the writeout in 2.5 _more_ bursty ...

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-08-12 02:50:48

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

FYI, just got this while un-tarring a kernel tree with 2.5.31+everything.gz:
(no nvidia ;)

--Adam

ksymoops 2.4.1 on i686 2.5.31-akpm. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.5.31-akpm/ (default)
-m /boot/System.map-2.5.31-akpm (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

No modules in ksyms, skipping objects
Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod file?
Warning (compare_maps): ksyms_base symbol GPLONLY___wake_up_sync not found in System.map. Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_balance_dirty_pages not found in System.map. Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_generic_file_direct_IO not found in System.map. Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_idle_cpu not found in System.map. Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_set_cpus_allowed not found in System.map. Ignoring ksyms_base entry
kernel BUG at page_alloc.c:98!
invalid operand: 0000
CPU: 1
EIP: 0010:[<c0132503>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: c89d5840 ebx: c10c7000 ecx: 00000000 edx: 00000000
esi: c51f5e70 edi: 00000005 ebp: 00000010 esp: c51f5e14
ds: 0018 es: 0018 ss: 0018
Stack: 00009000 c1000018 c1123238 c1028018 c0313c60 00000206 ffffffff 00001a66
00000000 00000008 c51f5e70 00000005 00000010 c0132f7a c10caa48 00000009
c0130e1b c51f5e6c 00000000 c89d5e20 c2f88dd0 00000000 00000009 c10570e8
Call Trace: [<c0132f7a>] [<c0130e1b>] [<c0129f01>] [<c0114791>] [<c0165c04>]
[<c0116569>] [<c011b3f9>] [<c0111370>] [<c0107183>]
Code: 0f 0b 62 00 85 b6 2c c0 8b 03 ba 04 00 00 00 83 e0 10 74 1d

>>EIP; c0132503 <__free_pages_ok+93/300> <=====
Trace; c0132f7a <__pagevec_free+1a/20>
Trace; c0130e1b <__pagevec_release+fb/110>
Trace; c0129f01 <exit_mmap+1a1/280>
Trace; c0114791 <default_wake_function+21/40>
Trace; c0165c04 <ext3_release_file+14/20>
Trace; c0116569 <mmput+49/70>
Trace; c011b3f9 <do_exit+d9/2c0>
Trace; c0111370 <smp_apic_timer_interrupt+e0/120>
Trace; c0107183 <syscall_call+7/b>
Code; c0132503 <__free_pages_ok+93/300>
00000000 <_EIP>:
Code; c0132503 <__free_pages_ok+93/300> <=====
0: 0f 0b ud2a <=====
Code; c0132505 <__free_pages_ok+95/300>
2: 62 00 bound %eax,(%eax)
Code; c0132507 <__free_pages_ok+97/300>
4: 85 b6 2c c0 8b 03 test %esi,0x38bc02c(%esi)
Code; c013250d <__free_pages_ok+9d/300>
a: ba 04 00 00 00 mov $0x4,%edx
Code; c0132512 <__free_pages_ok+a2/300>
f: 83 e0 10 and $0x10,%eax
Code; c0132515 <__free_pages_ok+a5/300>
12: 74 1d je 31 <_EIP+0x31> c0132534 <__free_pages_ok+c4/300>


7 warnings issued. Results may not be reliable.

2002-08-12 03:26:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

Adam Kropelin wrote:
>
> FYI, just got this while un-tarring a kernel tree with 2.5.31+everything.gz:
> (no nvidia ;)
>

That'll be this one:

BUG_ON(page->pte.chain != NULL);

we've had a few reports of this dribbling in since rmap went in. But
nothing repeatable enough for it to be hunted down.

But we do have a repeatable inconsistency happening with ntpd and
memory pressure. That may be related, but in that case it's probably
related to mlock().

So. An open bug, alas.

2002-08-12 04:44:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

Adam Kropelin wrote:
>
> ...
> I did a bit of testing since I've always thought 2.4 (and 2.5) writeout behavior
> left something to be desired. Testbed was a SMP x86 (2xPPro-200) with 160 MB
> of RAM. I used everyone's favorite 2.5 scapegoat: IDE, with a single not-very-
> fast IBM disk. Filesystem was ext3 in data=ordered mode.

ext3 performs its own writeback alongside the core kernel's writeback
decisions, so that complicates things.

> Test workload was an
> inbound (from the point of view of the system under test) FTP transfer of a
> 600 MB iso image. All test runs were from a clean boot with all unnecessary
> services shut down.
>
> Results (average of 4 runs):
>
> 2.5.31-akpm: 2m 43s
> 2.5.31: 2m 33s
> 2.4.19: 2m 18s

yes. For this workload (10 mbyte/sec ftp transfer onto a >20 meg/sec
disk) the application should never block on IO - all writeback should
happen via pdflush.

2.4 starts background writeback at 30% dirty and synchronous writeback
at 60% dirty.

2.5 starts background writeback at 40% dirty and synchronous writeback
at 50% dirty.

You can make 2.5 use the 2.4 settings with

cd /proc/sys/vm
echo 30 > dirty_background_ratio
echo 60 > dirty_async_ratio
echo 70 > dirty_sync_ratio

and I expect you'll find that fixes it up. Setting dirty_background_ratio
to 10% will make it even better. But it will hurt dbench numbers at
certain client counts, which is a national emergency.

Sigh. I don't know what the right numbers are. There aren't any; that's
the problem with magic numbers. That part of the kernel is making writeback
and throttling decisions in total ignorance of the overall state of
the system.

Worst comes to worst, we can set the 2.5 knobs at the same level as the
2.4 ones, but I'd rather prefer that we can some up with something dynamic.

In fact, I'd be inclined to set the background ratio much lower than
2.4, and to hell with dbench. Because the lower level is better for
real programs, as you've observed.

Care to tune and retest?

2002-08-13 00:22:30

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Sun, Aug 11, 2002 at 09:58:22PM -0700, Andrew Morton wrote:
> ext3 performs its own writeback alongside the core kernel's writeback
> decisions, so that complicates things.

I ran the test after mounting the partition as ext2 and saw a slight
decrease in performance (7-10 seconds over the duration of the test), but I
did not have time to run more than once so this could be a fluke. In general,
the `vmstat 1` output looked the same to me.

> > Results (average of 4 runs):
> >
> > 2.5.31-akpm: 2m 43s
> > 2.5.31: 2m 33s
> > 2.4.19: 2m 18s
>
> yes. For this workload (10 mbyte/sec ftp transfer onto a >20 meg/sec
> disk) the application should never block on IO - all writeback should
> happen via pdflush.

> You can make 2.5 use the 2.4 settings with
>
> cd /proc/sys/vm
> echo 30 > dirty_background_ratio
> echo 60 > dirty_async_ratio
> echo 70 > dirty_sync_ratio

These settings bring -akpm in line with stock 2.5.31, but they are both
still slower than 2.4.19 (which itself could do better, I think).

> and I expect you'll find that fixes it up. Setting dirty_background_ratio
> to 10% will make it even better. But it will hurt dbench numbers at

No real change at 10%. It's consistently a second or two faster than -akpm is at
30%, but not a drastic change.

> certain client counts, which is a national emergency.
>
> Sigh. I don't know what the right numbers are. There aren't any; that's
> the problem with magic numbers. That part of the kernel is making writeback
> and throttling decisions in total ignorance of the overall state of
> the system.

It certainly seems something is amiss. If we could actually manage to keep
the disk busy (and this is a fairly slow disk), we'd do wonderfully. But with
a 2-3 second pause every 4-5 seconds, we're transferring data barely 50% of the
time. (Yes, the pause is long enough the disk activity LED actually goes out.)
The short-term average transfer rate over the FTP connection is very
respectable for older hardware: 7-8 MB/s. But with the stalls, the overall
throughput is just over 4 MB/s.

> In fact, I'd be inclined to set the background ratio much lower than
> 2.4, and to hell with dbench. Because the lower level is better for
> real programs, as you've observed.

> Care to tune and retest?

Absolutely. I'll try whatever ideas/patches you want to throw at me.

BTW, full `vmstat 1` logs are available for all these tests if you want them.

--Adam

2002-08-13 00:48:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

Adam Kropelin wrote:
>
> ...
> > You can make 2.5 use the 2.4 settings with
> >
> > cd /proc/sys/vm
> > echo 30 > dirty_background_ratio
> > echo 60 > dirty_async_ratio
> > echo 70 > dirty_sync_ratio
>
> These settings bring -akpm in line with stock 2.5.31, but they are both
> still slower than 2.4.19 (which itself could do better, I think).

In that case I'm confounded. It worked sweetly for me. Just

wget ftp://other-machine/600-meg-file

on a machine booted with mem=160m. Took 63 seconds over 100bT,
steady column of writes in vmstat.

Which ftp client are you using? And can you strace it, to see how
much data it's writing per system call?

2002-08-13 02:22:04

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Mon, Aug 12, 2002 at 05:49:40PM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> >
> > ...
> > > You can make 2.5 use the 2.4 settings with
> > >
> > > cd /proc/sys/vm
> > > echo 30 > dirty_background_ratio
> > > echo 60 > dirty_async_ratio
> > > echo 70 > dirty_sync_ratio
> >
> > These settings bring -akpm in line with stock 2.5.31, but they are both
> > still slower than 2.4.19 (which itself could do better, I think).
>
> In that case I'm confounded. It worked sweetly for me. Just

> Which ftp client are you using? And can you strace it, to see how
> much data it's writing per system call?

Actually, I'm running an FTP server on the testbed machine and pushing the
data from a client on another (much faster) machine. I straced the server
(redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.

After the transfer gets going...

1329 read(8, "v&X\205:\327.+\310/a\335\24Sa\361c\243\r\244\260~\264z"..., 8192) = 8192
1329 write(7, "v&X\205:\327.+\310/a\335\24Sa\361c\243\r\244\260~\264z"..., 8192) = 8192
1329 rt_sigaction(SIGALRM, {0x804b030, [ALRM], SA_RESTART|0x4000000}, {0x804b030, [ALRM], SA_RESTART|0x4000000}, 8) = 0
1329 alarm(1200) = 1200
1329 read(8, "\335\235\335\35}\335]\375\17\373|\324VS[\r\266Af\333\246"..., 8192) = 8192
1329 write(7, "\335\235\335\35}\335]\375\17\373|\324VS[\r\266Af\333\246"..., 8192) = 8192
1329 rt_sigaction(SIGALRM, {0x804b030, [ALRM], SA_RESTART|0x4000000}, {0x804b030, [ALRM], SA_RESTART|0x4000000}, 8) = 0
1329 alarm(1200) = 1200
1329 read(8, "\302\365SV4\24{*\341\336\24\213\242\363\307\36\274\377"..., 8192) = 8192
1329 write(7, "\302\365SV4\24{*\341\336\24\213\242\363\307\36\274\377"..., 8192) = 8192
1329 rt_sigaction(SIGALRM, {0x804b030, [ALRM], SA_RESTART|0x4000000}, {0x804b030, [ALRM], SA_RESTART|0x4000000}, 8) = 0
1329 alarm(1200) = 1200

...etc.

Following your method and wget'ting from a remote server seems to do
a bit better (just watching vmstat since I can't compare timings against
my original method). wget seems to read 8K and write it in two 4K writes.
Don't know if this has anything to do with things... Pauses are still
there and the disc activity light still goes out several times per minute
coincident with the pauses.

--Adam

2002-08-13 02:49:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

Adam Kropelin wrote:
>
> On Mon, Aug 12, 2002 at 05:49:40PM -0700, Andrew Morton wrote:
> > Adam Kropelin wrote:
> > >
> > > ...
> > > > You can make 2.5 use the 2.4 settings with
> > > >
> > > > cd /proc/sys/vm
> > > > echo 30 > dirty_background_ratio
> > > > echo 60 > dirty_async_ratio
> > > > echo 70 > dirty_sync_ratio
> > >
> > > These settings bring -akpm in line with stock 2.5.31, but they are both
> > > still slower than 2.4.19 (which itself could do better, I think).
> >
> > In that case I'm confounded. It worked sweetly for me. Just
>
> > Which ftp client are you using? And can you strace it, to see how
> > much data it's writing per system call?
>
> Actually, I'm running an FTP server on the testbed machine and pushing the
> data from a client on another (much faster) machine. I straced the server
> (redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.
>

OK, tried that against a slow disk (13 megs/sec write bandwidth). 2.5.31,
defalt writeback settings.

ext3 is misbehaving:

r b w swpd free buff cache si so bi bo in cs us sy id
0 2 2 5104 4376 0 134016 0 0 0 21620 2888 1966 0 5 95
0 0 2 5104 4448 0 134224 0 0 0 11420 4787 4004 0 8 92
1 0 0 5104 4464 0 134776 0 0 0 100 13133 12564 1 24 75
1 0 0 5104 4440 0 134716 0 0 8 0 13281 12660 1 23 76
0 0 0 5104 4480 0 134448 0 0 56 0 13272 13022 1 22 77
0 1 2 5104 4592 0 133880 0 0 0 27200 2598 1596 0 5 95
0 1 2 5104 4588 0 133880 0 0 0 11544 1127 128 0 2 98
0 0 1 5104 4356 0 134388 0 0 0 692 10383 9839 0 21 79
1 0 0 5104 4368 0 134836 0 0 0 108 13115 12912 1 25 74
0 0 0 5104 4360 0 134556 0 0 36 68 11829 11687 1 20 79

and takes 86 seconds.

When the server is writing to ext2, it is good:

1 0 0 5104 4364 0 135248 0 0 56 12380 13316 16547 1 17 82
0 0 0 5104 4388 0 135296 0 0 0 12324 13310 16488 1 16 83
1 0 0 5104 4056 0 135600 0 0 0 12344 13300 16521 1 15 84
0 0 0 5104 4368 0 135264 0 0 0 12324 13293 16480 0 16 84
1 0 0 5104 4428 0 135184 0 0 0 8216 13306 16514 1 16 83
0 0 0 5104 4396 0 135172 0 0 48 12380 13296 16444 1 16 83
0 0 0 5104 4392 0 135148 0 0 56 12324 13304 16461 1 16 82
1 0 0 5104 4396 0 135196 0 0 0 12324 13297 16468 1 17 82
1 0 0 5104 4444 0 135116 0 0 0 12348 13304 16511 1 18 81

and the transfer takes 54 seconds, which is wirespeed.

The ext3 stall is going to require some thought - it's waiting on a previous
transaction commit so it can get in and modify an inode block again.

Are you _sure_ it was bad with ext2? How long does

dd if=/dev/zero of=foo bs=1M count=600 ; sync

take against that disk?

2002-08-13 04:06:26

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> > Actually, I'm running an FTP server on the testbed machine and pushing the
> > data from a client on another (much faster) machine. I straced the server
> > (redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.
> >
>
> OK, tried that against a slow disk (13 megs/sec write bandwidth). 2.5.31,
> defalt writeback settings.
>
> ext3 is misbehaving:
> and takes 86 seconds.
>
> When the server is writing to ext2, it is good:
> and the transfer takes 54 seconds, which is wirespeed.
>
> Are you _sure_ it was bad with ext2?

Yes.

[root@devbox adk0212] mount
/dev/hda3 on / type ext2 (rw)
none on /proc type proc (rw)
/dev/hda1 on /boot type ext2 (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
none on /dev/shm type tmpfs (rw)

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 1 120 4360 0 141132 0 0 0 9804 6775 564 0 45 55
0 1 1 120 4344 0 141132 0 0 0 0 1083 20 0 0 99
0 0 0 120 4364 0 141116 0 0 0 40 2098 156 0 11 89
0 0 0 120 4384 0 141368 0 0 0 4 7013 594 0 52 47
0 0 0 120 4360 0 141416 0 0 0 0 6914 589 1 56 43
0 1 1 120 4464 0 140856 0 0 0 15420 6235 520 0 42 58
0 1 1 120 4456 0 140856 0 0 0 3240 1094 36 0 2 98
1 0 0 120 4428 0 140844 0 0 0 52 1151 70 0 4 96
1 0 0 120 4440 0 141356 0 0 0 4 6810 541 1 42 57
0 0 0 120 4464 0 141320 0 0 0 0 6894 553 1 40 58
0 1 1 120 4396 0 140840 0 0 0 15508 6018 466 0 40 59
0 1 1 120 4388 0 140840 0 0 0 1608 1093 57 0 2 98
0 0 0 120 4404 0 140832 0 0 0 52 2350 165 0 12 87
0 0 0 120 4460 0 141380 0 0 0 4 7040 564 1 42 57
1 0 0 120 4356 0 141372 0 0 0 4 7073 570 1 45 54
0 1 1 120 4360 0 140916 0 0 0 15404 5541 437 1 36 63
0 1 1 120 4356 0 140916 0 0 0 2832 1084 55 0 1 99
0 0 0 120 4356 0 140904 0 0 0 48 1614 125 0 8 91
0 0 1 120 4380 0 141412 0 0 0 4 6888 552 1 43 56
1 0 0 120 4232 0 141476 0 0 4 0 6857 556 1 40 58
0 1 1 120 4352 0 140988 0 0 0 13700 5148 449 0 35 65

Is it possible that the darn thing is mounted ext3 even though fstab and mount
agree that it's ext2?

> How long does
>
> dd if=/dev/zero of=foo bs=1M count=600 ; sync
>
> take against that disk?

1m 23s (I said it was a slow disk ;)

Even during that, the writeout was inconsistent (but a lot better than during
the FTP transfer):

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 3 1784 2180 0 141072 0 0 0 5220 1070 19 0 6 93
0 1 2 1784 2248 0 141020 0 0 0 8064 1066 23 0 8 92
1 0 3 1784 2296 0 141008 0 0 0 8436 1132 36 0 12 87
0 1 3 1784 2300 0 141004 0 0 0 6828 1072 164 0 24 75
1 0 2 1784 2988 0 140336 0 0 0 4664 1071 144 0 21 79
1 0 2 1784 2616 0 140700 0 0 0 12944 2688 102 0 5 95
0 1 3 1784 2296 0 141036 0 0 0 10048 1076 125 1 21 78
0 1 1 1784 3284 0 140048 0 0 4 5504 1064 143 0 19 80
0 1 1 1784 3284 0 140048 0 0 0 0 1064 51 0 1 99
0 1 1 1784 3284 0 140048 0 0 0 0 1058 23 0 1 99
1 1 3 1812 2312 0 141236 0 28 0 22892 2495 131 0 10 90
0 2 3 1812 3204 0 140340 0 0 4 7736 1065 81 0 25 75
0 2 3 1812 3204 0 140340 0 0 0 3848 1062 52 0 9 90
0 2 3 1812 3204 0 140340 0 0 0 7696 1059 50 0 2 98
0 1 3 1812 3196 0 140336 0 0 4 3976 1061 58 0 20 80
0 1 3 1812 3312 0 140208 0 0 0 7944 1065 25 0 4 96
0 1 2 1812 3308 0 140208 0 0 0 3844 1065 32 0 1 99
0 1 2 1812 3308 0 140208 0 0 0 2956 1056 43 0 3 97
0 1 2 1812 3268 0 140248 0 0 4 5548 1059 64 0 5 94
0 1 2 1812 3268 0 140252 0 0 0 236 1065 56 0 4 96
0 1 2 1812 3268 0 140252 0 0 0 0 1058 42 0 1 99

(all of the above discussion was 2.5.31 stock with default writeout settings)

I've been trying these sorts of tests on this machine for over a year now,
with various disk subsystems, and I have *never* seen anything as nice and
consistent as the ext2 writeout you quoted. Maybe this machine is cursed.

--Adam

2002-08-13 05:18:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

Adam Kropelin wrote:
>
> On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > Adam Kropelin wrote:
> > > Actually, I'm running an FTP server on the testbed machine and pushing the
> > > data from a client on another (much faster) machine. I straced the server
> > > (redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.
> > >
> >
> > OK, tried that against a slow disk (13 megs/sec write bandwidth). 2.5.31,
> > defalt writeback settings.
> >
> > ext3 is misbehaving:
> > and takes 86 seconds.
> >
> > When the server is writing to ext2, it is good:
> > and the transfer takes 54 seconds, which is wirespeed.
> >
> > Are you _sure_ it was bad with ext2?
>
> Yes.
>
> [root@devbox adk0212] mount
> /dev/hda3 on / type ext2 (rw)
> none on /proc type proc (rw)
> /dev/hda1 on /boot type ext2 (rw)
> none on /dev/pts type devpts (rw,gid=5,mode=620)
> none on /dev/shm type tmpfs (rw)
>
> procs memory swap io system cpu
> r b w swpd free buff cache si so bi bo in cs us sy id
> 0 1 1 120 4360 0 141132 0 0 0 9804 6775 564 0 45 55
> 0 1 1 120 4344 0 141132 0 0 0 0 1083 20 0 0 99
> 0 0 0 120 4364 0 141116 0 0 0 40 2098 156 0 11 89
> 0 0 0 120 4384 0 141368 0 0 0 4 7013 594 0 52 47
> 0 0 0 120 4360 0 141416 0 0 0 0 6914 589 1 56 43
> 0 1 1 120 4464 0 140856 0 0 0 15420 6235 520 0 42 58
> 0 1 1 120 4456 0 140856 0 0 0 3240 1094 36 0 2 98
> 1 0 0 120 4428 0 140844 0 0 0 52 1151 70 0 4 96
> 1 0 0 120 4440 0 141356 0 0 0 4 6810 541 1 42 57
> 0 0 0 120 4464 0 141320 0 0 0 0 6894 553 1 40 58
> 0 1 1 120 4396 0 140840 0 0 0 15508 6018 466 0 40 59
> 0 1 1 120 4388 0 140840 0 0 0 1608 1093 57 0 2 98
> 0 0 0 120 4404 0 140832 0 0 0 52 2350 165 0 12 87
> 0 0 0 120 4460 0 141380 0 0 0 4 7040 564 1 42 57
> 1 0 0 120 4356 0 141372 0 0 0 4 7073 570 1 45 54
> ...

Sure looks like ext3.

>
> Is it possible that the darn thing is mounted ext3 even though fstab and mount
> agree that it's ext2?

Yes. Although it's usually the other way round. "How come it keeps running
fsck even though mount says ext3?".

Take a look in /proc/mounts.

> > How long does
> >
> > dd if=/dev/zero of=foo bs=1M count=600 ; sync
> >
> > take against that disk?
>
> 1m 23s (I said it was a slow disk ;)

gack. I've seen pencils which can write faster than that.

So your wirespeed actually exceeds the disk speed. That changes things.

The kernel *has* to stall the generator of dirty data. We can make
the stalls shorter, and more frequent. Go into drivers/block/ll_rw_blk.c
and see where it's initialising batch_requests. Just change it to

batch_requests = 1;

batch_requests needs to die anyhow...

And in fs/mpage.c, set RATELIMIT_PAGES to 16.

The application has to block, but the disk should certainly never
fall idle. I'll play with this a bit. IDE ceased to be an option
in 2.5.30, which does not aid this effort.

> I've been trying these sorts of tests on this machine for over a year now,
> with various disk subsystems, and I have *never* seen anything as nice and
> consistent as the ext2 writeout you quoted. Maybe this machine is cursed.
>

Lumpy writeback is pretty common. As is bad latency during writeout.
It's quite tricky to get these things balanced out, and it's easy to
fix one thing and break another. Not a lot of effort has been put into
fine tuning 2.5 for smoothness and latency thus far.

2002-08-13 05:24:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Aug 13, 2002 00:10 -0400, Adam Kropelin wrote:
> On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > Are you _sure_ it was bad with ext2?
>
> Yes.
>
> [root@devbox adk0212] mount
> /dev/hda3 on / type ext2 (rw)
> /dev/hda1 on /boot type ext2 (rw)
>
> Is it possible that the darn thing is mounted ext3 even though fstab and mount
> agree that it's ext2?

Yes, if you have a journal on your root filesystem, then it will be mounted
as ext3 regardless of what it says in /etc/fstab. Since "mount" also
looks in /etc/fstab for writing the entry in /etc/mtab _after_ the root
filesystem is mounted, the output from "mount" can also be bogus. You
need to check /proc/mounts to see the real answer.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-08-13 12:33:53

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Mon, Aug 12, 2002 at 11:25:59PM -0600, Andreas Dilger wrote:
> On Aug 13, 2002 00:10 -0400, Adam Kropelin wrote:
> > On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > > Are you _sure_ it was bad with ext2?
> >
> > Yes.
> >
> > [root@devbox adk0212] mount
> > /dev/hda3 on / type ext2 (rw)
> > /dev/hda1 on /boot type ext2 (rw)
> >
> > Is it possible that the darn thing is mounted ext3 even though fstab and mount
> > agree that it's ext2?
>
> Yes, if you have a journal on your root filesystem, then it will be mounted
> as ext3 regardless of what it says in /etc/fstab. Since "mount" also
> looks in /etc/fstab for writing the entry in /etc/mtab _after_ the root
> filesystem is mounted, the output from "mount" can also be bogus. You
> need to check /proc/mounts to see the real answer.

Ahhh, carp.

It's still ext3, precisely as you describe.

*/me hangs head in shame*

When I get home tonight I'll reboot with a rescue disk and blow away the
journal. *That* should fix its little red wagon.

--Adam

2002-08-13 17:20:33

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Aug 13, 2002 08:37 -0400, Adam Kropelin wrote:
> On Mon, Aug 12, 2002 at 11:25:59PM -0600, Andreas Dilger wrote:
> > Yes, if you have a journal on your root filesystem, then it will be mounted
> > as ext3 regardless of what it says in /etc/fstab. Since "mount" also
> > looks in /etc/fstab for writing the entry in /etc/mtab _after_ the root
> > filesystem is mounted, the output from "mount" can also be bogus. You
> > need to check /proc/mounts to see the real answer.
>
> Ahhh, carp.
>
> It's still ext3, precisely as you describe.
>
> */me hangs head in shame*
>
> When I get home tonight I'll reboot with a rescue disk and blow away the
> journal. *That* should fix its little red wagon.

Or, you can optionally use the "rootfstype=ext2" kernel option, to avoid
the need to remove and then re-create the journal.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-08-13 18:17:52

by Daniel Egger

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

Am Die, 2002-08-13 um 07.32 schrieb Andrew Morton:

> > 1m 23s (I said it was a slow disk ;)
> gack. I've seen pencils which can write faster than that.

Interesting. Even up-to-date notebook are not much faster on an ext3 fs:

egger@sonja:/localstuff/temp$ time dd if=/dev/zero of=foo bs=1M
count=600 ; sync
600+0 Records ein
600+0 Records aus

real 0m58.375s
user 0m0.010s
sys 0m4.930s

> So your wirespeed actually exceeds the disk speed. That changes things.

This is trivial especially with mainstream machines being shipped with
GigE.

--
Servus,
Daniel

2002-08-13 23:57:48

by Adam Kropelin

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Mon, Aug 12, 2002 at 10:32:11PM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> >
> > On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > > OK, tried that against a slow disk (13 megs/sec write bandwidth). 2.5.31,
> > > defalt writeback settings.
> > >
> > > ext3 is misbehaving:
> > > and takes 86 seconds.
> > >
> > > When the server is writing to ext2, it is good:
> > > and the transfer takes 54 seconds, which is wirespeed.
> > >
> > > Are you _sure_ it was bad with ext2?
> >
> > Yes.

...but I was wrong.

> Sure looks like ext3.

..it was.

*Actually* switching to ext2 (rather than just pretending) made a
tremendous difference. New numbers:

2.5.31-stock: 1m 49s
2.5.31-akpm: 1m 50s
2.4.19-stock: 1m 34s

...but, applying the writeout threshold settings you suggested:

2.5.31-stock: 1m 34s
2.5.31-akpm: 1m 34s

(That's with dirty_background at 30%; 10% turned in the same numbers
as 30% did.)

Presumably with the disk as the bottleneck, the -akpm changes aren't
expected to do much. At least they're not degrading anything.

> So your wirespeed actually exceeds the disk speed. That changes things.
...
>
> batch_requests = 1;
> And in fs/mpage.c, set RATELIMIT_PAGES to 16.

These changes didn't have as much effect as the threshold tweaks:

2.5.31-stock: 1m 39s

..unless I added in the threshold tweaks as well, in which case:

2.5.31-stock: 1m 34s

...which is the same as the threshold tweaks alone.

> The application has to block, but the disk should certainly never
> fall idle. I'll play with this a bit. IDE ceased to be an option
> in 2.5.30, which does not aid this effort.

With ext2 and the threshold tweaks it never becomes idle. That is clearly
an ext3 issue now.

> fix one thing and break another. Not a lot of effort has been put into
> fine tuning 2.5 for smoothness and latency thus far.

Understandably. I think it says a lot already that an untuned development
kernel can match the current release kernel. I'm sure once 2.5 gets into
the tweak 'n tune cycle we'll see it beating 2.4 hands down.

Actually 2.5 writeout to ext2 is far smoother than 2.4 already:

2.4.19:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 2 0 4400 1788 140520 0 0 0 7776 7434 892 2 47 51
1 0 2 0 4408 1796 140492 0 0 0 7868 7315 873 0 50 50
1 0 3 0 4428 1804 140484 0 0 0 10496 7327 877 3 56 41
1 0 2 0 4372 1812 140516 0 0 0 8132 7239 872 0 53 47
1 0 0 0 4408 1816 140460 0 0 4 5876 2415 255 0 17 83
1 0 0 0 4376 1824 140528 0 0 0 0 7555 894 1 42 56
0 0 2 0 4376 1832 140512 0 0 0 4096 7589 858 1 52 47
1 0 1 0 4416 1840 140464 0 0 0 8052 7229 879 1 51 47
0 0 1 0 4380 1848 140496 0 0 0 10180 7183 863 1 49 50
1 0 1 0 4348 1856 140500 0 0 0 8080 7240 852 1 49 50
1 0 1 0 4464 1864 140408 0 0 0 4504 7309 886 1 47 51
0 0 1 0 4444 1872 140400 0 0 0 7284 7459 873 1 51 48
0 0 3 0 4380 1880 140440 0 0 0 10184 7428 895 1 50 49
1 0 1 0 4428 1888 140400 0 0 0 8092 7308 867 0 52 48

2.5.31:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 7404 0 137796 0 0 0 4108 6933 1176 1 43 56
1 0 0 0 4384 0 141048 0 0 0 8216 6918 1293 1 42 57
0 0 0 104 4392 0 141472 0 104 0 4212 6909 1211 1 53 45
0 0 1 120 4440 0 141488 0 16 0 8232 6860 1233 1 61 38
1 0 1 120 4352 0 141628 0 0 0 4108 6810 1137 2 38 60
0 0 0 120 4468 0 141508 0 0 0 8216 6848 1114 0 40 59
1 0 0 120 4352 0 141608 0 0 0 4108 6817 1091 1 39 60
0 0 1 120 4464 0 141528 0 0 0 8216 6846 1090 1 39 60
0 0 0 120 4412 0 141568 0 0 0 4108 6836 1056 1 39 60
0 0 1 120 4388 0 141588 0 0 0 8216 6863 1088 1 41 58
1 0 0 120 4392 0 141608 0 0 0 4108 6899 1162 1 41 58
0 0 0 120 4428 0 141572 0 0 0 8216 6917 1085 2 40 58
0 0 0 120 4416 0 141592 0 0 0 4208 6887 1097 1 40 59

The oscillation between 8 MB and 4 MB is a little odd, but it's very consistent
and averages out to about 6 MB, which is exactly what the FTP session is doing.

Thanks for your insight and patience. I'm always excited to see another batch
of akpm patches show up on the list. If I can run other tests to help you, let
me know.

--Adam

2002-08-14 08:33:26

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch 1/21] random fixes

On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> Sorry, but there's a ton of stuff here. It ends up as a 4600 line
> diff. Some code dating back to 2.5.24. It's almost all performance
> work and it has been very painful getting its effectiveness tested
> on the big machines; the main problem has been getting them booting
> 2.5 at all. The results still are not as conclusive as I'd like,
> but the signs are good, and there are no other proposals around to
> fix these problems.

dbench 256 on a 16x/16G numaq:

Throughput 50.7526 MB/sec (NB=63.4408 MB/sec 507.526 MBit/sec) 256 procs


c013bf74 13251607 72.928 .text.lock.highmem
c013b7d0 1606972 8.84371 kunmap_high
c013b5dc 1211097 6.66507 kmap_high
c012f260 459420 2.52834 generic_file_write
c0114820 166854 0.918253 scheduler_tick
c012e53c 166773 0.917808 file_read_actor
c0105394 125561 0.691004 default_idle
c013bcbc 75623 0.416179 blk_queue_bounce
c013564c 72289 0.397831 rmqueue
c01113b8 69062 0.380071 smp_apic_timer_interrupt
c0143cec 64782 0.356517 block_prepare_write
c014330c 53426 0.294021 __block_prepare_write
c0142ee8 39892 0.219539 create_empty_buffers
c012dec0 39161 0.215516 unlock_page
c01143d8 38648 0.212693 load_balance
c013b558 34840 0.191736 flush_all_zero_pkmaps
c0135d28 33414 0.183888 page_cache_release
c013429c 22753 0.125217 lru_cache_add
c0135b10 20326 0.111861 __alloc_pages
c0143d98 19833 0.109148 generic_commit_write
c012dcb4 18150 0.0998855 add_to_page_cache
c0140044 17758 0.0977282 vfs_write