2001-11-26 22:02:55

by Nathan G. Grennan

[permalink] [raw]
Subject: Unresponiveness of 2.4.16

2.4.16 becomes very unresponsive for 30 seconds or so at a time during
large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
file is about 36mb. I run top in one window, run free repeatedly in
another window and run the tar -zxf in a third window. I had many
suspects, but still not sure what it is. I have tried

ext2 vs ext3
preemptive vs non-preemptive
tainted vs non-tainted

Nothing seems to help 2.4.16.

I tried switching to Redhat's 2.4.9-13 kernel and it acts Alot better.
Not only does 2.4.9-13 not get the 30 second delay, but it also seems to
take advantage of caching. 2.4.16 takes the same moment of time each
time, even tho it should have cached it all into memory the first time.
2.4.9-13 takes a while the first time(without the 30 second new process
freezing), but then takes almost no time the times after that. One
interesting thing I noticed is that with and without preemptive a
already started mp3 playing had no disruption even during the 30 second
windows where any new commands would get stuck with 2.4.16. I am not
using custom

I plan to do more testing to see how say 2.4.9, 2.4.13ac7, etc.

Any ideas of how to fix this for 2.4.16?

I have attached my .config.

My system:

Redhat 7.2 with all updates

Athlon Thunderbird 1.33ghz
768mb(512mb, 256mb) PC133 SDRAM
Abit KT7A-RAID v1.0(KT133A chipset)
Bios 64
HPT370(bios v1.2.0604)
Primary Master Quantum Fireball AS40.0
Secondary Master IBM-DTLA-307045
VIA686B
Primary Master CREATIVE DVD-ROM DVD6240E
Secondary Master CR-2801TE


Attachments:
.config (17.07 kB)

2001-11-26 22:09:03

by Alan

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

> 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> file is about 36mb. I run top in one window, run free repeatedly in

This seems to be one of the small as yet unresolved problems with the newer
VM code in 2.4.16. I've not managed to prove its the VM or the differing
I/O scheduling rules however.

> Any ideas of how to fix this for 2.4.16?

If it is the VM then watch for a patch from Rik for 2.4.16 + RielVM. If
that helps then we know its VM related , if not then we know to look at
other suspects

Alan

2001-11-26 22:22:34

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

"Nathan G. Grennan" wrote:
>
> 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> file is about 36mb. I run top in one window, run free repeatedly in
> another window and run the tar -zxf in a third window. I had many
> suspects, but still not sure what it is. I have tried

yes. I'm doing quite some work in this area at present. There
are a couple of things which may help here.

1: The current code which is designed to throttle heavy writers
basically doesn't work under some workloads. It's designed to
block the writer when there are too many dirty buffers in the
machine. But in fact, all dirty data writeout occurs in the
context of shrink_cache(), so all tasks are penalised and if
the writing task doesn't happen to run shrink_cache(), it gets
to merrily continue stuffing the machine full of write data.
The fix is to account for locked buffers as well as dirty ones
in balance_dirty_state().

2: The current elevator design is downright cruel to humans in
the presence of heavy write traffic.

Please try this lot:



--- linux-2.4.16-pre1/fs/buffer.c Thu Nov 22 23:02:58 2001
+++ linux-akpm/fs/buffer.c Sun Nov 25 00:07:47 2001
@@ -1036,6 +1036,7 @@ static int balance_dirty_state(void)
unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;

dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
+ dirty += size_buffers_type[BUF_LOCKED] >> PAGE_SHIFT;
tot = nr_free_buffer_pages();

dirty *= 100;
--- linux-2.4.16-pre1/mm/filemap.c Sat Nov 24 13:14:52 2001
+++ linux-akpm/mm/filemap.c Sun Nov 25 00:07:47 2001
@@ -3023,7 +3023,18 @@ generic_file_write(struct file *file,con
unlock:
kunmap(page);
/* Mark it unlocked again and drop the page.. */
- SetPageReferenced(page);
+// SetPageReferenced(page);
+ ClearPageReferenced(page);
+#if 0
+ {
+ lru_cache_del(page);
+ TestSetPageLRU(page);
+ spin_lock(&pagemap_lru_lock);
+ list_add_tail(&(page)->lru, &inactive_list);
+ nr_inactive_pages++;
+ spin_unlock(&pagemap_lru_lock);
+ }
+#endif
UnlockPage(page);
page_cache_release(page);

--- linux-2.4.16-pre1/mm/vmscan.c Thu Nov 22 23:02:59 2001
+++ linux-akpm/mm/vmscan.c Sun Nov 25 00:08:03 2001
@@ -573,6 +573,9 @@ static int shrink_caches(zone_t * classz
nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
if (nr_pages <= 0)
return 0;
+ nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
+ if (nr_pages <= 0)
+ return 0;

shrink_dcache_memory(priority, gfp_mask);
shrink_icache_memory(priority, gfp_mask);
@@ -585,7 +588,7 @@ static int shrink_caches(zone_t * classz

int try_to_free_pages(zone_t *classzone, unsigned int gfp_mask, unsigned int order)
{
- int priority = DEF_PRIORITY;
+ int priority = DEF_PRIORITY - 2;
int nr_pages = SWAP_CLUSTER_MAX;

do {


--- linux-2.4.16-pre1/include/linux/elevator.h Thu Feb 15 16:58:34 2001
+++ linux-akpm/include/linux/elevator.h Sat Nov 24 19:58:43 2001
@@ -5,8 +5,9 @@ typedef void (elevator_fn) (struct reque
struct list_head *,
struct list_head *, int);

-typedef int (elevator_merge_fn) (request_queue_t *, struct request **, struct list_head *,
- struct buffer_head *, int, int);
+typedef int (elevator_merge_fn)(request_queue_t *, struct request **,
+ struct list_head *, struct buffer_head *bh,
+ int rw, int max_sectors, int max_bomb_segments);

typedef void (elevator_merge_cleanup_fn) (request_queue_t *, struct request *, int);

@@ -16,6 +17,7 @@ struct elevator_s
{
int read_latency;
int write_latency;
+ int max_bomb_segments;

elevator_merge_fn *elevator_merge_fn;
elevator_merge_cleanup_fn *elevator_merge_cleanup_fn;
@@ -24,13 +26,13 @@ struct elevator_s
unsigned int queue_ID;
};

-int elevator_noop_merge(request_queue_t *, struct request **, struct list_head *, struct buffer_head *, int, int);
-void elevator_noop_merge_cleanup(request_queue_t *, struct request *, int);
-void elevator_noop_merge_req(struct request *, struct request *);
-
-int elevator_linus_merge(request_queue_t *, struct request **, struct list_head *, struct buffer_head *, int, int);
-void elevator_linus_merge_cleanup(request_queue_t *, struct request *, int);
-void elevator_linus_merge_req(struct request *, struct request *);
+elevator_merge_fn elevator_noop_merge;
+elevator_merge_cleanup_fn elevator_noop_merge_cleanup;
+elevator_merge_req_fn elevator_noop_merge_req;
+
+elevator_merge_fn elevator_linus_merge;
+elevator_merge_cleanup_fn elevator_linus_merge_cleanup;
+elevator_merge_req_fn elevator_linus_merge_req;

typedef struct blkelv_ioctl_arg_s {
int queue_ID;
@@ -54,22 +56,6 @@ extern void elevator_init(elevator_t *,
#define ELEVATOR_FRONT_MERGE 1
#define ELEVATOR_BACK_MERGE 2

-/*
- * This is used in the elevator algorithm. We don't prioritise reads
- * over writes any more --- although reads are more time-critical than
- * writes, by treating them equally we increase filesystem throughput.
- * This turns out to give better overall performance. -- sct
- */
-#define IN_ORDER(s1,s2) \
- ((((s1)->rq_dev == (s2)->rq_dev && \
- (s1)->sector < (s2)->sector)) || \
- (s1)->rq_dev < (s2)->rq_dev)
-
-#define BHRQ_IN_ORDER(bh, rq) \
- ((((bh)->b_rdev == (rq)->rq_dev && \
- (bh)->b_rsector < (rq)->sector)) || \
- (bh)->b_rdev < (rq)->rq_dev)
-
static inline int elevator_request_latency(elevator_t * elevator, int rw)
{
int latency;
@@ -85,7 +71,7 @@ static inline int elevator_request_laten
((elevator_t) { \
0, /* read_latency */ \
0, /* write_latency */ \
- \
+ 0, /* max_bomb_segments */ \
elevator_noop_merge, /* elevator_merge_fn */ \
elevator_noop_merge_cleanup, /* elevator_merge_cleanup_fn */ \
elevator_noop_merge_req, /* elevator_merge_req_fn */ \
@@ -95,7 +81,7 @@ static inline int elevator_request_laten
((elevator_t) { \
8192, /* read passovers */ \
16384, /* write passovers */ \
- \
+ 6, /* max_bomb_segments */ \
elevator_linus_merge, /* elevator_merge_fn */ \
elevator_linus_merge_cleanup, /* elevator_merge_cleanup_fn */ \
elevator_linus_merge_req, /* elevator_merge_req_fn */ \
--- linux-2.4.16-pre1/drivers/block/elevator.c Thu Jul 19 20:59:41 2001
+++ linux-akpm/drivers/block/elevator.c Sat Nov 24 20:51:29 2001
@@ -74,36 +74,52 @@ inline int bh_rq_in_between(struct buffe
return 0;
}

+struct akpm_elv_stats {
+ int zapme;
+ int nr_read_sectors;
+ int nr_write_sectors;
+ int nr_read_requests;
+ int nr_write_requests;
+} akpm_elv_stats;

int elevator_linus_merge(request_queue_t *q, struct request **req,
struct list_head * head,
struct buffer_head *bh, int rw,
- int max_sectors)
+ int max_sectors, int max_bomb_segments)
{
- struct list_head *entry = &q->queue_head;
- unsigned int count = bh->b_size >> 9, ret = ELEVATOR_NO_MERGE;
+ struct list_head *entry;
+ unsigned int count = bh->b_size >> 9;
+ unsigned int ret = ELEVATOR_NO_MERGE;
+ int no_in_between = 0;

+ if (akpm_elv_stats.zapme)
+ memset(&akpm_elv_stats, 0, sizeof(akpm_elv_stats));
+
+ entry = &q->queue_head;
while ((entry = entry->prev) != head) {
struct request *__rq = blkdev_entry_to_request(entry);
-
- /*
- * simply "aging" of requests in queue
- */
- if (__rq->elevator_sequence-- <= 0)
- break;
-
+ if (__rq->elevator_sequence-- <= 0) {
+ /*
+ * OK, we've exceeded someone's latency limit.
+ * But we still continue to look for merges,
+ * because they're so much better than seeks.
+ */
+ no_in_between = 1;
+ }
if (__rq->waiting)
continue;
if (__rq->rq_dev != bh->b_rdev)
continue;
- if (!*req && bh_rq_in_between(bh, __rq, &q->queue_head))
+ if (!*req && !no_in_between &&
+ bh_rq_in_between(bh, __rq, &q->queue_head)) {
*req = __rq;
+ }
if (__rq->cmd != rw)
continue;
if (__rq->nr_sectors + count > max_sectors)
continue;
if (__rq->elevator_sequence < count)
- break;
+ no_in_between = 1;
if (__rq->sector + __rq->nr_sectors == bh->b_rsector) {
ret = ELEVATOR_BACK_MERGE;
*req = __rq;
@@ -116,6 +132,66 @@ int elevator_linus_merge(request_queue_t
}
}

+ /*
+ * If we failed to merge a read anywhere in the request
+ * queue, we really don't want to place it at the end
+ * of the list, behind lots of writes. So place it near
+ * the front.
+ *
+ * We don't want to place it in front of _all_ writes: that
+ * would create lots of seeking, and isn't tunable.
+ * We try to avoid promoting this read in front of existing
+ * reads.
+ *
+ * max_bomb_sectors becomes the maximum number of write
+ * requests which we allow to remain in place in front of
+ * a newly introduced read. We weight things a little bit,
+ * so large writes are more expensive than small ones, but it's
+ * requests which count, not sectors.
+ */
+ if (rw == READ && ret == ELEVATOR_NO_MERGE) {
+ int cur_latency = 0;
+ struct request * const cur_request = *req;
+
+ entry = head->next;
+ while (entry != &q->queue_head) {
+ struct request *__rq;
+
+ if (entry == &q->queue_head)
+ BUG();
+ if (entry == q->queue_head.next &&
+ q->head_active && !q->plugged)
+ BUG();
+ __rq = blkdev_entry_to_request(entry);
+
+ if (__rq == cur_request) {
+ /*
+ * This is where the old algorithm placed it.
+ * There's no point pushing it further back,
+ * so leave it here, in sorted order.
+ */
+ break;
+ }
+ if (__rq->cmd == WRITE) {
+ cur_latency += 1 + __rq->nr_sectors / 64;
+ if (cur_latency >= max_bomb_segments) {
+ *req = __rq;
+ break;
+ }
+ }
+ entry = entry->next;
+ }
+ }
+ if (ret == ELEVATOR_NO_MERGE) {
+ if (rw == READ)
+ akpm_elv_stats.nr_read_requests++;
+ else
+ akpm_elv_stats.nr_write_requests++;
+ }
+ if (rw == READ)
+ akpm_elv_stats.nr_read_sectors += count;
+ else
+ akpm_elv_stats.nr_write_sectors += count;
return ret;
}

@@ -144,7 +220,7 @@ void elevator_linus_merge_req(struct req
int elevator_noop_merge(request_queue_t *q, struct request **req,
struct list_head * head,
struct buffer_head *bh, int rw,
- int max_sectors)
+ int max_sectors, int max_bomb_segments)
{
struct list_head *entry;
unsigned int count = bh->b_size >> 9;
@@ -188,7 +264,7 @@ int blkelvget_ioctl(elevator_t * elevato
output.queue_ID = elevator->queue_ID;
output.read_latency = elevator->read_latency;
output.write_latency = elevator->write_latency;
- output.max_bomb_segments = 0;
+ output.max_bomb_segments = elevator->max_bomb_segments;

if (copy_to_user(arg, &output, sizeof(blkelv_ioctl_arg_t)))
return -EFAULT;
@@ -207,9 +283,12 @@ int blkelvset_ioctl(elevator_t * elevato
return -EINVAL;
if (input.write_latency < 0)
return -EINVAL;
+ if (input.max_bomb_segments < 0)
+ return -EINVAL;

elevator->read_latency = input.read_latency;
elevator->write_latency = input.write_latency;
+ elevator->max_bomb_segments = input.max_bomb_segments;
return 0;
}

--- linux-2.4.16-pre1/drivers/block/ll_rw_blk.c Mon Nov 5 21:01:11 2001
+++ linux-akpm/drivers/block/ll_rw_blk.c Sat Nov 24 22:25:47 2001
@@ -690,7 +690,8 @@ again:
} else if (q->head_active && !q->plugged)
head = head->next;

- el_ret = elevator->elevator_merge_fn(q, &req, head, bh, rw,max_sectors);
+ el_ret = elevator->elevator_merge_fn(q, &req, head, bh,
+ rw, max_sectors, elevator->max_bomb_segments);
switch (el_ret) {

case ELEVATOR_BACK_MERGE:

2001-11-26 22:50:07

by Lincoln Dale

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

At 10:17 PM 26/11/2001 +0000, Alan Cox wrote:
> > 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> > large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> > file is about 36mb. I run top in one window, run free repeatedly in
>
>This seems to be one of the small as yet unresolved problems with the newer
>VM code in 2.4.16. I've not managed to prove its the VM or the differing
>I/O scheduling rules however.

it is I/O scheduling.

i have a system with a large amount of RAM.
it has both 15K RPM SCSI disks (off a symbios controller) and some bog-slow
IDE/ATA disks which the system decides to use PIO for rather than DMA. (i
don't use them for anything other than bootup so don't really care about it
deciding to use PIO..).

a copy to/from the 15K RPM SCSI disks doesn't show any performance problems.
a copy to/from the PIO-based IDE disks has the same effect -- 20/30 seconds
of no interactiveness -- even a "vmstat 1" *stops* for 20-30 seconds while
200+MB of buffer-cache data gets written out to disk.

i'm guessing that:
(a) the i/o scheduler isn't taking into account "disk speed" and thus
slower disks
show it more effectively than fast-disks
(b) its isolated to somewhere in the IDE drivers


cheers,

lincoln.

2001-11-26 23:35:09

by Nicolas Pitre

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Mon, 26 Nov 2001, Alan Cox wrote:

> > 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> > large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> > file is about 36mb. I run top in one window, run free repeatedly in
>
> This seems to be one of the small as yet unresolved problems with the newer
> VM code in 2.4.16. I've not managed to prove its the VM or the differing
> I/O scheduling rules however.

FWIW...

I experienced quite the same unresponsiveness but more in the order of 4-5
seconds since I started to use ext3 with RH 7.2 (i.e. kernel 2.4.7 based).
I'm currently running 2.4.15-pre7 and the same momentary stalls are there
just like with 2.4.7. It is much more visible when applying large patches to
a kernel source tree as the patch output stops scrolling from time to time
for about 5 secs. I never saw such thing while previously using reiserfs.
I've yet to try reiserfs on a 2.4.16 tree to see if this is actually an ext3
problem.


Nicolas

2001-11-27 00:00:20

by Rik van Riel

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Mon, 26 Nov 2001, Alan Cox wrote:

> > Any ideas of how to fix this for 2.4.16?
>
> If it is the VM then watch for a patch from Rik for 2.4.16 + RielVM.
> If that helps then we know its VM related , if not then we know to
> look at other suspects

The patch to 2.4.16 + rielvm (well, a merge between my VM and
Andrea's VM) is available on my home page and seems stable now.
FYI, my 64MB dual pentium test box seems to "happily" survive
a 'make -j bzImage' over NFS...

However, I suspect this unresponsiveness issue is related to
either IO scheduling or write throttling, and that code is
the same in both VMs. I'll take a look at smoothing out writes
so we can get this thing fixed in both VMs.

The patch is on http://www.surriel.com/patches/

regards,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-11-27 00:02:51

by SLion

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

I'm running 2.4.13-ac7 with preempt patch and ext3 on this box. I don't seem
to be encountering any unresponsiveness at all while untar'ing a kernel src.
Just some info for you guys.

-Steve
* Nicolas Pitre ([email protected]) wrote:
> On Mon, 26 Nov 2001, Alan Cox wrote:
>
> > > 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> > > large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> > > file is about 36mb. I run top in one window, run free repeatedly in
> >
> > This seems to be one of the small as yet unresolved problems with the newer
> > VM code in 2.4.16. I've not managed to prove its the VM or the differing
> > I/O scheduling rules however.
>
> FWIW...
>
> I experienced quite the same unresponsiveness but more in the order of 4-5
> seconds since I started to use ext3 with RH 7.2 (i.e. kernel 2.4.7 based).
> I'm currently running 2.4.15-pre7 and the same momentary stalls are there
> just like with 2.4.7. It is much more visible when applying large patches to
> a kernel source tree as the patch output stops scrolling from time to time
> for about 5 secs. I never saw such thing while previously using reiserfs.
> I've yet to try reiserfs on a 2.4.16 tree to see if this is actually an ext3
> problem.
>
>
> Nicolas
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-11-27 00:38:04

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Rik van Riel wrote:
>
> However, I suspect this unresponsiveness issue is related to
> either IO scheduling or write throttling, and that code is
> the same in both VMs. I'll take a look at smoothing out writes
> so we can get this thing fixed in both VMs.
>

umm... What I said.

balance_dirty_state() is allowing writes to flood the machine
with locked buffers.

elevator is penalising reads horridly. Try this on your
64 megabyte box:

dd if=/dev/zero of=foo bs=1024k count=8000

and then try to log in to it. Be patient. Very patient. Five
minutes pass. Still being patient? In fact with this test I've
never been able to get a login prompt. The filesystem which
holds `foo' is only 8 gigs, and it fills up, permitting the login
to happen.

What happens is this: sshd gets paged out. It wakes up, faults
and tries to read a page. That read gets stuck on the request
queue behind about 50 megabytes of write data. Eventually, it
gets read. Then sshd faults in another page. That gets stuck
on the request queue behind about 50 megabytes of data. By the time
this one gets read, the first page is probably paged out again. See
how this isn't getting us very far?

The patch I sent puts read requests near the head of the request
queue, and to hell with aggregate throughput. It's tunable with
`elvtune -b'. And it fixes it.

-

2001-11-27 00:45:26

by Brandon Low

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

I'm running 2.4.16 with 2 IDE UDMA mode 4 drives, and I have experienced
no such pausing no matter what I do. (which usually includes patching,
extracting, and generally messing with kernels from Eterm with XMMS
playing, and a couple mozillas open)

Nathan G. Grennan wrote:

>2.4.16 becomes very unresponsive for 30 seconds or so at a time during
>large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
>file is about 36mb. I run top in one window, run free repeatedly in
>another window and run the tar -zxf in a third window. I had many
>suspects, but still not sure what it is. I have tried
>
>ext2 vs ext3
>preemptive vs non-preemptive
>tainted vs non-tainted
>
>Nothing seems to help 2.4.16.
>
>I tried switching to Redhat's 2.4.9-13 kernel and it acts Alot better.
>Not only does 2.4.9-13 not get the 30 second delay, but it also seems to
>take advantage of caching. 2.4.16 takes the same moment of time each
>time, even tho it should have cached it all into memory the first time.
>2.4.9-13 takes a while the first time(without the 30 second new process
>freezing), but then takes almost no time the times after that. One
>interesting thing I noticed is that with and without preemptive a
>already started mp3 playing had no disruption even during the 30 second
>windows where any new commands would get stuck with 2.4.16. I am not
>using custom
>
>I plan to do more testing to see how say 2.4.9, 2.4.13ac7, etc.
>
>Any ideas of how to fix this for 2.4.16?
>
>I have attached my .config.
>
>My system:
>
>Redhat 7.2 with all updates
>
>Athlon Thunderbird 1.33ghz
>768mb(512mb, 256mb) PC133 SDRAM
>Abit KT7A-RAID v1.0(KT133A chipset)
> Bios 64
> HPT370(bios v1.2.0604)
> Primary Master Quantum Fireball AS40.0
> Secondary Master IBM-DTLA-307045
> VIA686B
> Primary Master CREATIVE DVD-ROM DVD6240E
> Secondary Master CR-2801TE
>


2001-11-27 00:47:14

by Rik van Riel

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Mon, 26 Nov 2001, Andrew Morton wrote:

> umm... What I said.
>
> balance_dirty_state() is allowing writes to flood the machine
> with locked buffers.

Saw your patch, it's neat. I'm going to try it
first thing in the morning...

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-11-27 00:58:14

by Brandon Low

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Lost Logic wrote:

> I'm running 2.4.16 with 2 IDE UDMA mode 4 drives, and I have
> experienced no such pausing no matter what I do. (which usually
> includes patching, extracting, and generally messing with kernels from
> Eterm with XMMS playing, and a couple mozillas open)

Ignore that, I know why I have no problems, I can extract kernels, make
kernels, etc w/o paging...

>
> Nathan G. Grennan wrote:
>
>> 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
>> large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
>> file is about 36mb. I run top in one window, run free repeatedly in
>> another window and run the tar -zxf in a third window. I had many
>> suspects, but still not sure what it is. I have tried
>>
>> ext2 vs ext3
>> preemptive vs non-preemptive
>> tainted vs non-tainted
>>
>> Nothing seems to help 2.4.16.
>>
>> I tried switching to Redhat's 2.4.9-13 kernel and it acts Alot better.
>> Not only does 2.4.9-13 not get the 30 second delay, but it also seems to
>> take advantage of caching. 2.4.16 takes the same moment of time each
>> time, even tho it should have cached it all into memory the first time.
>> 2.4.9-13 takes a while the first time(without the 30 second new process
>> freezing), but then takes almost no time the times after that. One
>> interesting thing I noticed is that with and without preemptive a
>> already started mp3 playing had no disruption even during the 30 second
>> windows where any new commands would get stuck with 2.4.16. I am not
>> using custom
>>
>> I plan to do more testing to see how say 2.4.9, 2.4.13ac7, etc.
>> Any ideas of how to fix this for 2.4.16?
>>
>> I have attached my .config.
>>
>> My system:
>>
>> Redhat 7.2 with all updates
>>
>> Athlon Thunderbird 1.33ghz
>> 768mb(512mb, 256mb) PC133 SDRAM
>> Abit KT7A-RAID v1.0(KT133A chipset)
>> Bios 64
>> HPT370(bios v1.2.0604)
>> Primary Master Quantum Fireball AS40.0
>> Secondary Master IBM-DTLA-307045
>> VIA686B Primary Master CREATIVE DVD-ROM DVD6240E
>> Secondary Master CR-2801TE
>>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>



2001-11-27 01:45:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Mon, Nov 26, 2001 at 10:17:06PM +0000, Alan Cox wrote:
> > 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> > large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> > file is about 36mb. I run top in one window, run free repeatedly in
>
> This seems to be one of the small as yet unresolved problems with the newer
> VM code in 2.4.16. I've not managed to prove its the VM or the differing

can you reproduce on 2.4.15aa1?

Andrea

2001-11-27 03:50:52

by Sean Elble

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

> I tried switching to Redhat's 2.4.9-13 kernel and it acts Alot better.
> Not only does 2.4.9-13 not get the 30 second delay, but it also seems to
> take advantage of caching. 2.4.16 takes the same moment of time each
> time, even tho it should have cached it all into memory the first time.

Unless Red Hat has specifically added Andrea's new VM code to the 2.4.9
kernel, then that kernel is still using the old VM. The 2.4.10 (?) and above
kernels all use Andrea's new VM, and this includes 2.4.16 (obviously :-). My
guess is that it is a small VM-related problem, but I am certainly not a
programmer; I did see other replies to this problem, but I accidently
deleted them before I could fully read them. :-( My suggestion would
definitely to be to try other kernels; I would personally try 2.4.10,
2.4.12, 2.4.14, and you have already tried 2.4.16. This would at the very
least tell you where the problem was introduced, and hopefully, some of the
brilliant kernel people (not me) could take over from there. Hope that
helps.

-----------------------------------------------
Sean P. Elble
Editor, Writer, Co-Webmaster
ReactiveLinux.com (Formerly MaximumLinux.org)
http://www.reactivelinux.com/
[email protected]
-----------------------------------------------

----- Original Message -----
From: "Nathan G. Grennan" <[email protected]>
To: <[email protected]>
Sent: Monday, November 26, 2001 5:02 PM
Subject: Unresponiveness of 2.4.16


> 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> file is about 36mb. I run top in one window, run free repeatedly in
> another window and run the tar -zxf in a third window. I had many
> suspects, but still not sure what it is. I have tried
>
> ext2 vs ext3
> preemptive vs non-preemptive
> tainted vs non-tainted
>
> Nothing seems to help 2.4.16.
>
> I tried switching to Redhat's 2.4.9-13 kernel and it acts Alot better.
> Not only does 2.4.9-13 not get the 30 second delay, but it also seems to
> take advantage of caching. 2.4.16 takes the same moment of time each
> time, even tho it should have cached it all into memory the first time.
> 2.4.9-13 takes a while the first time(without the 30 second new process
> freezing), but then takes almost no time the times after that. One
> interesting thing I noticed is that with and without preemptive a
> already started mp3 playing had no disruption even during the 30 second
> windows where any new commands would get stuck with 2.4.16. I am not
> using custom
>
> I plan to do more testing to see how say 2.4.9, 2.4.13ac7, etc.
>
> Any ideas of how to fix this for 2.4.16?
>
> I have attached my .config.
>
> My system:
>
> Redhat 7.2 with all updates
>
> Athlon Thunderbird 1.33ghz
> 768mb(512mb, 256mb) PC133 SDRAM
> Abit KT7A-RAID v1.0(KT133A chipset)
> Bios 64
> HPT370(bios v1.2.0604)
> Primary Master Quantum Fireball AS40.0
> Secondary Master IBM-DTLA-307045
> VIA686B
> Primary Master CREATIVE DVD-ROM DVD6240E
> Secondary Master CR-2801TE
>

2001-11-27 03:57:13

by Doug Ledford

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Sean Elble wrote:

>>I tried switching to Redhat's 2.4.9-13 kernel and it acts Alot better.
>>Not only does 2.4.9-13 not get the 30 second delay, but it also seems to
>>take advantage of caching. 2.4.16 takes the same moment of time each
>>time, even tho it should have cached it all into memory the first time.
>>
>
> Unless Red Hat has specifically added Andrea's new VM code to the 2.4.9
> kernel, then that kernel is still using the old VM.


Not exactly. That kernel is -ac based (plus lots of other patches, some
of them VM tweaks) and is a Van Riel VM.




--

Doug Ledford <[email protected]> http://people.redhat.com/dledford
Please check my web site for aic7xxx updates/answers before
e-mailing me about problems

2001-11-27 04:01:36

by Sean Elble

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

> Not exactly. That kernel is -ac based (plus lots of other patches, some
> of them VM tweaks) and is a Van Riel VM.

Right; it's not the "stock" 2.4.9 VM, but it isn't Andrea's either . . . one
of those gray area things. :-) I guess we just have to wait until he posts
the results with the "stock" 2.4.9 kernel to see if Red Hat fixed the
problem or not. Have a good one!

-----------------------------------------------
Sean P. Elble
Editor, Writer, Co-Webmaster
ReactiveLinux.com (Formerly MaximumLinux.org)
http://www.reactivelinux.com/
[email protected]
-----------------------------------------------

----- Original Message -----
From: "Doug Ledford" <[email protected]>
To: "Sean Elble" <[email protected]>
Cc: "Nathan G. Grennan" <[email protected]>;
<[email protected]>
Sent: Monday, November 26, 2001 10:56 PM
Subject: Re: Unresponiveness of 2.4.16


> Sean Elble wrote:
>
> >>I tried switching to Redhat's 2.4.9-13 kernel and it acts Alot better.
> >>Not only does 2.4.9-13 not get the 30 second delay, but it also seems to
> >>take advantage of caching. 2.4.16 takes the same moment of time each
> >>time, even tho it should have cached it all into memory the first time.
> >>
> >
> > Unless Red Hat has specifically added Andrea's new VM code to the 2.4.9
> > kernel, then that kernel is still using the old VM.
>
>
> Not exactly. That kernel is -ac based (plus lots of other patches, some
> of them VM tweaks) and is a Van Riel VM.
>
>
>
>
> --
>
> Doug Ledford <[email protected]> http://people.redhat.com/dledford
> Please check my web site for aic7xxx updates/answers before
> e-mailing me about problems

2001-11-27 04:35:07

by Masanori Goto

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

At Mon, 26 Nov 2001 14:44:19 -0800,
Lincoln Dale <[email protected]> wrote:
> At 10:17 PM 26/11/2001 +0000, Alan Cox wrote:
> > > 2.4.16 becomes very unresponsive for 30 seconds or so at a time during
> > > large unarchiving of tarballs, like tar -zxf mozilla-src.tar.gz. The
> > > file is about 36mb. I run top in one window, run free repeatedly in
> >
> >This seems to be one of the small as yet unresolved problems with the newer
> >VM code in 2.4.16. I've not managed to prove its the VM or the differing
> >I/O scheduling rules however.
>
> it is I/O scheduling.
>
> i have a system with a large amount of RAM.
> it has both 15K RPM SCSI disks (off a symbios controller) and some bog-slow
> IDE/ATA disks which the system decides to use PIO for rather than DMA. (i
> don't use them for anything other than bootup so don't really care about it
> deciding to use PIO..).
>
> a copy to/from the 15K RPM SCSI disks doesn't show any performance problems.
> a copy to/from the PIO-based IDE disks has the same effect -- 20/30 seconds
> of no interactiveness -- even a "vmstat 1" *stops* for 20-30 seconds while
> 200+MB of buffer-cache data gets written out to disk.

I guess this problem repeatly posted...
Is it related with IDE chip or chipset code?
I use Athlon on KT133A plus 2 IDE disks, and I'm also experiencing
such problem with only 1 disk. But I don't know it's PIO-based or not.

-- gotom



2001-11-27 04:39:27

by Mike Fedyk

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Mon, Nov 26, 2001 at 04:36:25PM -0800, Andrew Morton wrote:
> The patch I sent puts read requests near the head of the request
> queue, and to hell with aggregate throughput. It's tunable with
> `elvtune -b'. And it fixes it.

for i in `seq 9`; do elvtune -b $i /dev/hda; done

-b doesn't seem to change the "max_bomb_segments". Does your patch fix this?

Tested on 2.4.15-pre1.

MF

2001-11-27 04:46:47

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Mike Fedyk wrote:
>
> On Mon, Nov 26, 2001 at 04:36:25PM -0800, Andrew Morton wrote:
> > The patch I sent puts read requests near the head of the request
> > queue, and to hell with aggregate throughput. It's tunable with
> > `elvtune -b'. And it fixes it.
>
> for i in `seq 9`; do elvtune -b $i /dev/hda; done
>
> -b doesn't seem to change the "max_bomb_segments". Does your patch fix this?
>

Yes, it does.

Presumably, once upon a time, max_bomb_segments actually did
something. But it's a complete no-op at present, so I co-opted it.

Nice name, but I'd prefer max_cluster_bombs.

-

2001-11-27 07:45:18

by Jens Axboe

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Mon, Nov 26 2001, Andrew Morton wrote:
> 2: The current elevator design is downright cruel to humans in
> the presence of heavy write traffic.

max_bomb_segments logic was established to help absolutely _nothing_ a
long time ago.

I agree that the current i/o scheduler has really bad interactive
performance -- at first sight your changes looks mostly like add-on
hacks though. Arjan's priority based scheme is more promising.

--
Jens Axboe

2001-11-27 08:00:26

by Mike Fedyk

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Tue, Nov 27, 2001 at 08:42:34AM +0100, Jens Axboe wrote:
> On Mon, Nov 26 2001, Andrew Morton wrote:
> > 2: The current elevator design is downright cruel to humans in
> > the presence of heavy write traffic.
>
> max_bomb_segments logic was established to help absolutely _nothing_ a
> long time ago.
>
> I agree that the current i/o scheduler has really bad interactive
> performance -- at first sight your changes looks mostly like add-on
> hacks though. Arjan's priority based scheme is more promising.
>

Based on pid priority or niceness?

2001-11-27 08:03:27

by Jens Axboe

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Mon, Nov 26 2001, Mike Fedyk wrote:
> On Tue, Nov 27, 2001 at 08:42:34AM +0100, Jens Axboe wrote:
> > On Mon, Nov 26 2001, Andrew Morton wrote:
> > > 2: The current elevator design is downright cruel to humans in
> > > the presence of heavy write traffic.
> >
> > max_bomb_segments logic was established to help absolutely _nothing_ a
> > long time ago.
> >
> > I agree that the current i/o scheduler has really bad interactive
> > performance -- at first sight your changes looks mostly like add-on
> > hacks though. Arjan's priority based scheme is more promising.
> >
>
> Based on pid priority or niceness?

None of the above yet. It isn't hard to add process I/O priority and
inherit that once the support is there in the i/o scheduler / block
layer, though.

--
Jens Axboe

2001-11-27 08:33:25

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Jens Axboe wrote:
>
> I agree that the current i/o scheduler has really bad interactive
> performance -- at first sight your changes looks mostly like add-on
> hacks though.

Good hacks, or bad ones?

It keeps things localised. It works. It's tunable. It's the best
IO scheduler presently available.

> Arjan's priority based scheme is more promising.

If the IO priority becomes an attribute of the calling process
then an approach like that has value. For writes, the priority
should be driven by VM pressure and it's probably simpler just
to stick the priority into struct buffer_head -> struct request.
For reads, the priority could just be scooped out of *current.

If we're not going to push the IO priority all the way down from
userspace then you may as well keep the logic inside the elevator
and just say reads-go-here and writes-go-there.

But this has potential to turn into a great designfest. Are
we going to leave 2.4 as-is? Please say no.

-

2001-11-27 08:38:35

by Jens Axboe

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Tue, Nov 27 2001, Andrew Morton wrote:
> Jens Axboe wrote:
> >
> > I agree that the current i/o scheduler has really bad interactive
> > performance -- at first sight your changes looks mostly like add-on
> > hacks though.
>
> Good hacks, or bad ones?
>
> It keeps things localised. It works. It's tunable. It's the best
> IO scheduler presently available.

Hacks look ok on cursory glances :-)

> > Arjan's priority based scheme is more promising.
>
> If the IO priority becomes an attribute of the calling process
> then an approach like that has value. For writes, the priority
> should be driven by VM pressure and it's probably simpler just
> to stick the priority into struct buffer_head -> struct request.
> For reads, the priority could just be scooped out of *current.
>
> If we're not going to push the IO priority all the way down from
> userspace then you may as well keep the logic inside the elevator
> and just say reads-go-here and writes-go-there.

Priority will be passed down for reads as you suggest, at least that is
the intention I had as well. I've only worked on 2.5 with this, but I
guess we can find some space in the buffer_head to squeeze in some
priority bits.

> But this has potential to turn into a great designfest. Are

Oh yeah

> we going to leave 2.4 as-is? Please say no.

I'd be happy to review anything you come up with -- or in other works,
feel free to knock yourself out, I'm busy with other stuff currently :)

--
Jens Axboe

2001-11-27 09:14:26

by Ahmed Masud

[permalink] [raw]
Subject: RE: Unresponiveness of 2.4.16

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Nicolas
> Pitre Sent: Monday, November 26, 2001 6:34 PM
> To: Alan Cox
> Cc: Nathan G. Grennan; lkml
> Subject: Re: Unresponiveness of 2.4.16
>
>
> On Mon, 26 Nov 2001, Alan Cox wrote:
>
> > > 2.4.16 becomes very unresponsive for 30 seconds or so at a time
> > > during large unarchiving of tarballs, like tar -zxf
> > > mozilla-src.tar.gz. The file is about 36mb. I run top in
> one window,
> > > run free repeatedly in
> >
> > This seems to be one of the small as yet unresolved
> problems with the
> > newer VM code in 2.4.16. I've not managed to prove its the
> VM or the
> > differing I/O scheduling rules however.
>
> FWIW...
>
> I experienced quite the same unresponsiveness but more in the
> order of 4-5 seconds since I started to use ext3 with RH 7.2
> (i.e. kernel 2.4.7 based).
> I'm currently running 2.4.15-pre7 and the same momentary
> stalls are there just like with 2.4.7. It is much more
> visible when applying large patches to a kernel source tree
> as the patch output stops scrolling from time to time for
> about 5 secs. I never saw such thing while previously using
> reiserfs.
> I've yet to try reiserfs on a 2.4.16 tree to see if this is
> actually an ext3 problem.
>
>

Just to add to the above something I've experienced:

2.4.12 - 2.4.14 on a number of AMD Athelon 900 with 256 MB
RAM doing serial I/O would miss data while any DISK writes would
occure.

Reads would be okay but writes of any significance like untarring a
relatively large tar ball ( > 10 megs ).

While turning on UDMA for PROMISE PDC20265 chipset significantly
reduced the
Slugishness (by an order of magnitude) the problem would still crop
up
Whenever there were more than three processing doing disk writing.


CPU: AMD 900 Athelon
Chipset: VIA
IDE Controller: PROMISE PDC20265
Disks: IBM ATA100 IC35L020AVER07-0

I tried the same operations on Reiserfs, ext2 and ext3; on direct
partitions
on software raid 1 devices and on LVM ( 1.0.1-rc4 patches from
sistina ).

All permutations with all kernels 2.4.12 thru to 2.4.14 yield
identical results
... Loss of data while selecting on serial ports while there are
heavy writes to
the file system.

Doing the same operation on same hardware with 2.2.16 yields no loss
of data.

Perhaps if I can get some guidance as to what else to try to resolve
whether this is
a VM related problem or an IO subsystem related problem, I'll be more
than happy to
experiment and relay the results.

Ahmed

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 6.5.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPANZG+A+WVFT6/r4EQL7PgCg3dWSrBDxsxqCF6OY1YiKDiEd34sAnA4W
S6Zb2wfzBj6bXETTFNoYzTlW
=HFWs
-----END PGP SIGNATURE-----

2001-11-27 09:56:35

by willy tarreau

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

> Please try this lot:

Hi Andrew,

I just tried 2.4.16 with and without your patch.
During the
test, I wrote a 640 MB file on an IDE disk at an
average
speed of 10 MB/s. Without your patch, I could easily
reproduce the slugginess other people report, mostly
at
the login prompt. But when I applied your patch, I can
log in
immediately, so yes, I can say that your patch
improves
things dramatically.

I can't say yet if there are side effects, but I keep
testing.

Regards,
Willy


___________________________________________________________
Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en fran?ais !
Yahoo! Courrier : http://courrier.yahoo.fr

2001-11-27 10:53:38

by Heinz Diehl

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Tue Nov 27 2001, willy tarreau wrote:

> I just tried 2.4.16 with and without your patch.

I applied Andrew's patch to 2.5.1-pre1.

> Without your patch, I could easily
> reproduce the slugginess other people report, mostly
> at the login prompt. But when I applied your patch, I can
> log in immediately, so yes, I can say that your patch
> improves things dramatically.

The same thing here: with the patch applied, things improved, without
I can also easily reproduce unresponsiveness. It definitely fixes
the problem....

--
# Heinz Diehl, 68259 Mannheim, Germany

2001-11-27 17:13:33

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Ahmed Masud wrote:
>
> Just to add to the above something I've experienced:
>
> 2.4.12 - 2.4.14 on a number of AMD Athelon 900 with 256 MB
> RAM doing serial I/O would miss data while any DISK writes would
> occure.

Two possibilities suggest themselves:

- Interrupt latency. Last time I checked (a year ago), the worst-case
interrupt latency of the IDE drivers was 80 microseconds on a 500MHz PII.
That was with `hdparm -u 1'. That's pretty good.

Could you please confirm that you're using `hdparm -u 1' against the
relevant disk?

- The serial port is working OK, but the application which is handling
serial IO is blocked on a disk read (something got paged out), and
that disk read fails to complete by the time the serial port buffer
fills up.

I'll send you a patch which makes the VM less inclined to page things
out in the presence of heavy writes, and which decreases read
latencies.

Thanks.

2001-11-27 20:31:48

by Mike Fedyk

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Tue, Nov 27, 2001 at 09:12:13AM -0800, Andrew Morton wrote:
> Ahmed Masud wrote:
> >
> > Just to add to the above something I've experienced:
> >
> > 2.4.12 - 2.4.14 on a number of AMD Athelon 900 with 256 MB
> > RAM doing serial I/O would miss data while any DISK writes would
> > occure.
>
> Two possibilities suggest themselves:
>
> - Interrupt latency. Last time I checked (a year ago), the worst-case
> interrupt latency of the IDE drivers was 80 microseconds on a 500MHz PII.
> That was with `hdparm -u 1'. That's pretty good.
>
> Could you please confirm that you're using `hdparm -u 1' against the
> relevant disk?
>
> - The serial port is working OK, but the application which is handling
> serial IO is blocked on a disk read (something got paged out), and
> that disk read fails to complete by the time the serial port buffer
> fills up.
>
> I'll send you a patch which makes the VM less inclined to page things
> out in the presence of heavy writes, and which decreases read
> latencies.
>
Is this patch posted anywhere?

2001-11-27 21:00:18

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Mike Fedyk wrote:
>
> > I'll send you a patch which makes the VM less inclined to page things
> > out in the presence of heavy writes, and which decreases read
> > latencies.
> >
> Is this patch posted anywhere?

I sent it yesterday, in this thread. Here it is again.

Description:

- Account for locked as well as dirty buffers when deciding
to throttle writers.

- Tweak VM to make it work the inactive list harder, before starting
to evict pages or swap.

- Change the elevator so that once a request's latency has
expired, we can still perform merges in front of that
request. But we no longer will insert new requests in
front of that request.

- Modify elevator so that new read requests do not have
more than N write requests placed in front of them, where
N is tunable per-device with `elvtune -b'.

Theoretically, the last change needs significant alterations
to the readhead code. But a rewrite of readhead made negligible
difference (I wasn't able to trigger the failure scenario).
Still crunching on this.



--- linux-2.4.16-pre1/fs/buffer.c Thu Nov 22 23:02:58 2001
+++ linux-akpm/fs/buffer.c Sun Nov 25 00:07:47 2001
@@ -1036,6 +1036,7 @@ static int balance_dirty_state(void)
unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;

dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
+ dirty += size_buffers_type[BUF_LOCKED] >> PAGE_SHIFT;
tot = nr_free_buffer_pages();

dirty *= 100;
--- linux-2.4.16-pre1/mm/filemap.c Sat Nov 24 13:14:52 2001
+++ linux-akpm/mm/filemap.c Sun Nov 25 00:07:47 2001
@@ -3023,7 +3023,18 @@ generic_file_write(struct file *file,con
unlock:
kunmap(page);
/* Mark it unlocked again and drop the page.. */
- SetPageReferenced(page);
+// SetPageReferenced(page);
+ ClearPageReferenced(page);
+#if 0
+ {
+ lru_cache_del(page);
+ TestSetPageLRU(page);
+ spin_lock(&pagemap_lru_lock);
+ list_add_tail(&(page)->lru, &inactive_list);
+ nr_inactive_pages++;
+ spin_unlock(&pagemap_lru_lock);
+ }
+#endif
UnlockPage(page);
page_cache_release(page);

--- linux-2.4.16-pre1/mm/vmscan.c Thu Nov 22 23:02:59 2001
+++ linux-akpm/mm/vmscan.c Sun Nov 25 00:08:03 2001
@@ -573,6 +573,9 @@ static int shrink_caches(zone_t * classz
nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
if (nr_pages <= 0)
return 0;
+ nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
+ if (nr_pages <= 0)
+ return 0;

shrink_dcache_memory(priority, gfp_mask);
shrink_icache_memory(priority, gfp_mask);
@@ -585,7 +588,7 @@ static int shrink_caches(zone_t * classz

int try_to_free_pages(zone_t *classzone, unsigned int gfp_mask, unsigned int order)
{
- int priority = DEF_PRIORITY;
+ int priority = DEF_PRIORITY - 2;
int nr_pages = SWAP_CLUSTER_MAX;

do {


--- linux-2.4.16/include/linux/elevator.h Thu Feb 15 16:58:34 2001
+++ linux-akpm/include/linux/elevator.h Tue Nov 27 12:34:59 2001
@@ -5,8 +5,9 @@ typedef void (elevator_fn) (struct reque
struct list_head *,
struct list_head *, int);

-typedef int (elevator_merge_fn) (request_queue_t *, struct request **, struct list_head *,
- struct buffer_head *, int, int);
+typedef int (elevator_merge_fn)(request_queue_t *, struct request **,
+ struct list_head *, struct buffer_head *bh,
+ int rw, int max_sectors, int max_bomb_segments);

typedef void (elevator_merge_cleanup_fn) (request_queue_t *, struct request *, int);

@@ -16,6 +17,7 @@ struct elevator_s
{
int read_latency;
int write_latency;
+ int max_bomb_segments;

elevator_merge_fn *elevator_merge_fn;
elevator_merge_cleanup_fn *elevator_merge_cleanup_fn;
@@ -24,13 +26,13 @@ struct elevator_s
unsigned int queue_ID;
};

-int elevator_noop_merge(request_queue_t *, struct request **, struct list_head *, struct buffer_head *, int, int);
-void elevator_noop_merge_cleanup(request_queue_t *, struct request *, int);
-void elevator_noop_merge_req(struct request *, struct request *);
-
-int elevator_linus_merge(request_queue_t *, struct request **, struct list_head *, struct buffer_head *, int, int);
-void elevator_linus_merge_cleanup(request_queue_t *, struct request *, int);
-void elevator_linus_merge_req(struct request *, struct request *);
+elevator_merge_fn elevator_noop_merge;
+elevator_merge_cleanup_fn elevator_noop_merge_cleanup;
+elevator_merge_req_fn elevator_noop_merge_req;
+
+elevator_merge_fn elevator_linus_merge;
+elevator_merge_cleanup_fn elevator_linus_merge_cleanup;
+elevator_merge_req_fn elevator_linus_merge_req;

typedef struct blkelv_ioctl_arg_s {
int queue_ID;
@@ -54,22 +56,6 @@ extern void elevator_init(elevator_t *,
#define ELEVATOR_FRONT_MERGE 1
#define ELEVATOR_BACK_MERGE 2

-/*
- * This is used in the elevator algorithm. We don't prioritise reads
- * over writes any more --- although reads are more time-critical than
- * writes, by treating them equally we increase filesystem throughput.
- * This turns out to give better overall performance. -- sct
- */
-#define IN_ORDER(s1,s2) \
- ((((s1)->rq_dev == (s2)->rq_dev && \
- (s1)->sector < (s2)->sector)) || \
- (s1)->rq_dev < (s2)->rq_dev)
-
-#define BHRQ_IN_ORDER(bh, rq) \
- ((((bh)->b_rdev == (rq)->rq_dev && \
- (bh)->b_rsector < (rq)->sector)) || \
- (bh)->b_rdev < (rq)->rq_dev)
-
static inline int elevator_request_latency(elevator_t * elevator, int rw)
{
int latency;
@@ -85,7 +71,7 @@ static inline int elevator_request_laten
((elevator_t) { \
0, /* read_latency */ \
0, /* write_latency */ \
- \
+ 0, /* max_bomb_segments */ \
elevator_noop_merge, /* elevator_merge_fn */ \
elevator_noop_merge_cleanup, /* elevator_merge_cleanup_fn */ \
elevator_noop_merge_req, /* elevator_merge_req_fn */ \
@@ -95,7 +81,7 @@ static inline int elevator_request_laten
((elevator_t) { \
8192, /* read passovers */ \
16384, /* write passovers */ \
- \
+ 6, /* max_bomb_segments */ \
elevator_linus_merge, /* elevator_merge_fn */ \
elevator_linus_merge_cleanup, /* elevator_merge_cleanup_fn */ \
elevator_linus_merge_req, /* elevator_merge_req_fn */ \
--- linux-2.4.16/drivers/block/elevator.c Thu Jul 19 20:59:41 2001
+++ linux-akpm/drivers/block/elevator.c Tue Nov 27 12:35:20 2001
@@ -74,36 +74,41 @@ inline int bh_rq_in_between(struct buffe
return 0;
}

-
int elevator_linus_merge(request_queue_t *q, struct request **req,
struct list_head * head,
struct buffer_head *bh, int rw,
- int max_sectors)
+ int max_sectors, int max_bomb_segments)
{
- struct list_head *entry = &q->queue_head;
- unsigned int count = bh->b_size >> 9, ret = ELEVATOR_NO_MERGE;
+ struct list_head *entry;
+ unsigned int count = bh->b_size >> 9;
+ unsigned int ret = ELEVATOR_NO_MERGE;
+ int no_in_between = 0;

+ entry = &q->queue_head;
while ((entry = entry->prev) != head) {
struct request *__rq = blkdev_entry_to_request(entry);
-
- /*
- * simply "aging" of requests in queue
- */
- if (__rq->elevator_sequence-- <= 0)
- break;
-
+ if (__rq->elevator_sequence-- <= 0) {
+ /*
+ * OK, we've exceeded someone's latency limit.
+ * But we still continue to look for merges,
+ * because they're so much better than seeks.
+ */
+ no_in_between = 1;
+ }
if (__rq->waiting)
continue;
if (__rq->rq_dev != bh->b_rdev)
continue;
- if (!*req && bh_rq_in_between(bh, __rq, &q->queue_head))
+ if (!*req && !no_in_between &&
+ bh_rq_in_between(bh, __rq, &q->queue_head)) {
*req = __rq;
+ }
if (__rq->cmd != rw)
continue;
if (__rq->nr_sectors + count > max_sectors)
continue;
if (__rq->elevator_sequence < count)
- break;
+ no_in_between = 1;
if (__rq->sector + __rq->nr_sectors == bh->b_rsector) {
ret = ELEVATOR_BACK_MERGE;
*req = __rq;
@@ -116,6 +121,56 @@ int elevator_linus_merge(request_queue_t
}
}

+ /*
+ * If we failed to merge a read anywhere in the request
+ * queue, we really don't want to place it at the end
+ * of the list, behind lots of writes. So place it near
+ * the front.
+ *
+ * We don't want to place it in front of _all_ writes: that
+ * would create lots of seeking, and isn't tunable.
+ * We try to avoid promoting this read in front of existing
+ * reads.
+ *
+ * max_bomb_sectors becomes the maximum number of write
+ * requests which we allow to remain in place in front of
+ * a newly introduced read. We weight things a little bit,
+ * so large writes are more expensive than small ones, but it's
+ * requests which count, not sectors.
+ */
+ if (rw == READ && ret == ELEVATOR_NO_MERGE) {
+ int cur_latency = 0;
+ struct request * const cur_request = *req;
+
+ entry = head->next;
+ while (entry != &q->queue_head) {
+ struct request *__rq;
+
+ if (entry == &q->queue_head)
+ BUG();
+ if (entry == q->queue_head.next &&
+ q->head_active && !q->plugged)
+ BUG();
+ __rq = blkdev_entry_to_request(entry);
+
+ if (__rq == cur_request) {
+ /*
+ * This is where the old algorithm placed it.
+ * There's no point pushing it further back,
+ * so leave it here, in sorted order.
+ */
+ break;
+ }
+ if (__rq->cmd == WRITE) {
+ cur_latency += 1 + __rq->nr_sectors / 64;
+ if (cur_latency >= max_bomb_segments) {
+ *req = __rq;
+ break;
+ }
+ }
+ entry = entry->next;
+ }
+ }
return ret;
}

@@ -144,7 +199,7 @@ void elevator_linus_merge_req(struct req
int elevator_noop_merge(request_queue_t *q, struct request **req,
struct list_head * head,
struct buffer_head *bh, int rw,
- int max_sectors)
+ int max_sectors, int max_bomb_segments)
{
struct list_head *entry;
unsigned int count = bh->b_size >> 9;
@@ -188,7 +243,7 @@ int blkelvget_ioctl(elevator_t * elevato
output.queue_ID = elevator->queue_ID;
output.read_latency = elevator->read_latency;
output.write_latency = elevator->write_latency;
- output.max_bomb_segments = 0;
+ output.max_bomb_segments = elevator->max_bomb_segments;

if (copy_to_user(arg, &output, sizeof(blkelv_ioctl_arg_t)))
return -EFAULT;
@@ -207,9 +262,12 @@ int blkelvset_ioctl(elevator_t * elevato
return -EINVAL;
if (input.write_latency < 0)
return -EINVAL;
+ if (input.max_bomb_segments < 0)
+ return -EINVAL;

elevator->read_latency = input.read_latency;
elevator->write_latency = input.write_latency;
+ elevator->max_bomb_segments = input.max_bomb_segments;
return 0;
}

--- linux-2.4.16/drivers/block/ll_rw_blk.c Mon Nov 5 21:01:11 2001
+++ linux-akpm/drivers/block/ll_rw_blk.c Tue Nov 27 12:34:59 2001
@@ -690,7 +690,8 @@ again:
} else if (q->head_active && !q->plugged)
head = head->next;

- el_ret = elevator->elevator_merge_fn(q, &req, head, bh, rw,max_sectors);
+ el_ret = elevator->elevator_merge_fn(q, &req, head, bh,
+ rw, max_sectors, elevator->max_bomb_segments);
switch (el_ret) {

case ELEVATOR_BACK_MERGE:

2001-11-27 21:19:38

by Martin Eriksson

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

----- Original Message -----
From: "Andrew Morton" <[email protected]>
To: "Mike Fedyk" <[email protected]>
Cc: "Ahmed Masud" <[email protected]>; "'lkml'"
<[email protected]>
Sent: Tuesday, November 27, 2001 9:57 PM
Subject: Re: Unresponiveness of 2.4.16


> Mike Fedyk wrote:
> >
> > > I'll send you a patch which makes the VM less inclined to page
things
> > > out in the presence of heavy writes, and which decreases read
> > > latencies.
> > >
> > Is this patch posted anywhere?
>
> I sent it yesterday, in this thread. Here it is again.

<snip>

I have made it available at
http://www.cs.umu.se/~c97men/linux/am-response-2.4.16.patch

because I personally like a link or attachment, as that doesn't mess up the
whitespace...(goddamn OE)
I hope you don't mind?

Btw, I'm happily running your patch with
2.4.16 (final)
preempt-kernel-rml-2.4.16-1
ide.2.4.16-p1.11242001

/Martin

2001-11-27 21:25:10

by Mike Fedyk

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Tue, Nov 27, 2001 at 12:57:19PM -0800, Andrew Morton wrote:
> Mike Fedyk wrote:
> >
> > > I'll send you a patch which makes the VM less inclined to page things
> > > out in the presence of heavy writes, and which decreases read
> > > latencies.
> > >
> > Is this patch posted anywhere?
>
> I sent it yesterday, in this thread. Here it is again.
>

Yep, saw it. I didn't realize (didn't read patch) that it modified the VM
swapping.

> Description:
>
> - Account for locked as well as dirty buffers when deciding
> to throttle writers.
>
> - Tweak VM to make it work the inactive list harder, before starting
> to evict pages or swap.
>
> - Change the elevator so that once a request's latency has
> expired, we can still perform merges in front of that
> request. But we no longer will insert new requests in
> front of that request.
>
> - Modify elevator so that new read requests do not have
> more than N write requests placed in front of them, where
> N is tunable per-device with `elvtune -b'.
>
> Theoretically, the last change needs significant alterations
> to the readhead code. But a rewrite of readhead made negligible
> difference (I wasn't able to trigger the failure scenario).
> Still crunching on this.
>

Sounds great.

I'll test it out.

MF

2001-11-28 00:35:08

by Torrey Hoffman

[permalink] [raw]
Subject: RE: Unresponiveness of 2.4.16


I've running 2.4.16 with this VM patch combined with your
2.4.15-pre7-low-latency patch from http://www.zip.com.au. (it applied with a
little fuzz, no rejects). Is this a combination that you would feel
comfortable with?

So far it hasn't blown up on me, and in fact seems very quick and
responsive.

Unless I hear a "No, don't do that!", I'm going to push this kernel into
testing for our video applications...

Thanks!

Torrey Hoffman
[email protected]

Andrew Morton wrote:
[...]
> Description:
>
> - Account for locked as well as dirty buffers when deciding
> to throttle writers.
>
> - Tweak VM to make it work the inactive list harder, before starting
> to evict pages or swap.
>
> - Change the elevator so that once a request's latency has
> expired, we can still perform merges in front of that
> request. But we no longer will insert new requests in
> front of that request.
>
> - Modify elevator so that new read requests do not have
> more than N write requests placed in front of them, where
> N is tunable per-device with `elvtune -b'.
>
> Theoretically, the last change needs significant alterations
> to the readhead code. But a rewrite of readhead made negligible
> difference (I wasn't able to trigger the failure scenario).
> Still crunching on this.

2001-11-28 00:49:48

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Torrey Hoffman wrote:
>
> I've running 2.4.16 with this VM patch combined with your
> 2.4.15-pre7-low-latency patch from http://www.zip.com.au. (it applied with a
> little fuzz, no rejects). Is this a combination that you would feel
> comfortable with?

Should be OK. There is a possibility of livelock when you have
a lot of dirty buffers against multiple devices. It may
be a good idea to pick up the 2.4.16 low-latency patch.
http://www.zip.com.au/~akpm/linux/2.4.16-low-latency.patch.gz

> So far it hasn't blown up on me, and in fact seems very quick and
> responsive.
>
> Unless I hear a "No, don't do that!", I'm going to push this kernel into
> testing for our video applications...

If any quantitative results become available, please share...

-

2001-11-28 04:14:48

by Robert Love

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Tue, 2001-11-27 at 22:53, Dieter N?tzel wrote:

> To Robert Love:
> I get the following in dmesg:
> lock-break-rml-2.4.16-1.patch
>
> date: busy buffer
> lock_break: buffer.c:681: count was 2 not 551
> invalidate: busy buffer
> lock_break: buffer.c:681: count was 2 not 551
> invalidate: busy buffer

Thanks for the feedback, Dieter.

Robert Love

2001-11-28 18:57:38

by Torrey Hoffman

[permalink] [raw]
Subject: RE: Unresponiveness of 2.4.16

Hmm. Speaking of dbench, I tried the combination of 2.4.16,
your 2.4.16 low latency patch, and the IO scheduling patch
on my dual PIII.

After starting it up I did a dbench 32 on a 180 GB reiserfs
running on software RAID 5, just to see if it would
fall over, and during the run I got the following error/
warning message printed about 20 times on the console
and in the kernel log:

vs-4150: reiserfs_new_blocknrs, block not free<4>

Took it to single user mode after that and ran reiserfsck,
which printed a lot of stuff but I don't think it found any
problems.

Went back to 2.4.15-pre5 and could not reproduce the problem
on that kernel.

Torrey

2001-11-28 19:27:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16



On Tue, 27 Nov 2001, Andrew Morton wrote:

> Torrey Hoffman wrote:
> >
> > I've running 2.4.16 with this VM patch combined with your
> > 2.4.15-pre7-low-latency patch from http://www.zip.com.au. (it applied with a
> > little fuzz, no rejects). Is this a combination that you would feel
> > comfortable with?
>
> Should be OK. There is a possibility of livelock when you have
> a lot of dirty buffers against multiple devices.

Could you please describe this one ?

2001-11-28 19:33:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Torrey Hoffman wrote:
>
> Hmm. Speaking of dbench, I tried the combination of 2.4.16,
> your 2.4.16 low latency patch, and the IO scheduling patch
> on my dual PIII.
>
> After starting it up I did a dbench 32 on a 180 GB reiserfs
> running on software RAID 5, just to see if it would
> fall over, and during the run I got the following error/
> warning message printed about 20 times on the console
> and in the kernel log:
>
> vs-4150: reiserfs_new_blocknrs, block not free<4>
>

uh-oh. I probably broke reiserfs in the low-latency patch.

It's fairly harmless - we drop the big kernel lock, schedule
away. Upon resumption, the block we had decided to allocate
has been allocated by someone else. The filesystem emits a
warning and goes off to find a different block.

Will fix.

-

2001-11-28 19:39:59

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Marcelo Tosatti wrote:
>
> On Tue, 27 Nov 2001, Andrew Morton wrote:
>
> > Torrey Hoffman wrote:
> > >
> > > I've running 2.4.16 with this VM patch combined with your
> > > 2.4.15-pre7-low-latency patch from http://www.zip.com.au. (it applied with a
> > > little fuzz, no rejects). Is this a combination that you would feel
> > > comfortable with?
> >
> > Should be OK. There is a possibility of livelock when you have
> > a lot of dirty buffers against multiple devices.
>
> Could you please describe this one ?

It's a recurring problem with the low-latency patch. Basically:

restart:
spin_lock(some_lock);
for (lots of data) {
if (current->need_resched) {
spin_unlock(some_lock);
schedule();
goto restart;
}
if (something_which_is_often_true)
continue();
other_stuff();
}

If there is a realtime task which wants to be scheduled at,
say, one kilohertz, and the execution of that loop takes
more than one millisecond before it actually hits other_stuff()
and does any actual work, we make no progress at all, and we lock
up until the 1 kHz scheduling pressure is stopped.

In the 2.4.15-pre low-latency patch this can happen if we're
running fsync_dev(devA) and there are heaps of buffers for
devB on a list.

It's not a problem in your kernel ;)

-

2001-11-28 19:42:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16



On Tue, 27 Nov 2001, Andrew Morton wrote:

> Mike Fedyk wrote:
> >
> > > I'll send you a patch which makes the VM less inclined to page things
> > > out in the presence of heavy writes, and which decreases read
> > > latencies.
> > >
> > Is this patch posted anywhere?
>
> I sent it yesterday, in this thread. Here it is again.
>
> Description:
>
> - Account for locked as well as dirty buffers when deciding
> to throttle writers.

Just one thing: If we have lots of locked buffers due to reads we are
going to may unecessarily block writes, and thats not any good.

But well, I prefer to fix interactivity than to care about that one kind
of workload, so I'm ok with it.

> - Tweak VM to make it work the inactive list harder, before starting
> to evict pages or swap.

I would like to see he interactivity problems get fixed on block layer
side first: Its not a VM issue initially. Actually, the thing is that if
you tweak VM this way you're going to break some workloads.

> - Change the elevator so that once a request's latency has
> expired, we can still perform merges in front of that
> request. But we no longer will insert new requests in
> front of that request.

Sounds fine... I've received quite many success reports already, right ?

2001-11-28 19:42:59

by Torrey Hoffman

[permalink] [raw]
Subject: RE: Unresponiveness of 2.4.16


Yes, I just looked at the code in /fs/reiserfs/bitmap.c and
the comment block above the warning message specifically mentions
the low-latency patches.

I feel better now, looks like my filesystem is safe...

Torrey

Andrew Morton wrote:
[...]

> > fall over, and during the run I got the following error/
> > warning message printed about 20 times on the console
> > and in the kernel log:
> >
> > vs-4150: reiserfs_new_blocknrs, block not free<4>
> >
>
> uh-oh. I probably broke reiserfs in the low-latency patch.
>
> It's fairly harmless - we drop the big kernel lock, schedule
> away. Upon resumption, the block we had decided to allocate
> has been allocated by someone else. The filesystem emits a
> warning and goes off to find a different block.
>
> Will fix.

2001-11-28 20:14:49

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16



On Wed, 28 Nov 2001, Marcelo Tosatti wrote:

>
>
> On Tue, 27 Nov 2001, Andrew Morton wrote:
>
> > Mike Fedyk wrote:
> > >
> > > > I'll send you a patch which makes the VM less inclined to page things
> > > > out in the presence of heavy writes, and which decreases read
> > > > latencies.
> > > >
> > > Is this patch posted anywhere?
> >
> > I sent it yesterday, in this thread. Here it is again.
> >
> > Description:
> >
> > - Account for locked as well as dirty buffers when deciding
> > to throttle writers.
>
> Just one thing: If we have lots of locked buffers due to reads we are
> going to may unecessarily block writes, and thats not any good.
>
> But well, I prefer to fix interactivity than to care about that one kind
> of workload, so I'm ok with it.
>
> > - Tweak VM to make it work the inactive list harder, before starting
> > to evict pages or swap.
>
> I would like to see he interactivity problems get fixed on block layer
> side first: Its not a VM issue initially. Actually, the thing is that if
> you tweak VM this way you're going to break some workloads.
>
> > - Change the elevator so that once a request's latency has
> > expired, we can still perform merges in front of that
> > request. But we no longer will insert new requests in
> > front of that request.
>
> Sounds fine... I've received quite many success reports already, right ?

Err...

s/I/you/

2001-11-28 20:33:01

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Marcelo Tosatti wrote:
>
> On Tue, 27 Nov 2001, Andrew Morton wrote:
>
> > Mike Fedyk wrote:
> > >
> > > > I'll send you a patch which makes the VM less inclined to page things
> > > > out in the presence of heavy writes, and which decreases read
> > > > latencies.
> > > >
> > > Is this patch posted anywhere?
> >
> > I sent it yesterday, in this thread. Here it is again.
> >
> > Description:
> >
> > - Account for locked as well as dirty buffers when deciding
> > to throttle writers.
>
> Just one thing: If we have lots of locked buffers due to reads we are
> going to may unecessarily block writes, and thats not any good.

True. I believe this change makes balance_dirty() work as it was
originally intended to work. But in so doing, lots of things change.
Various places which have been tuned for the broken balance_dirty()
behaviour may need to be retuned. It needs testing, thought, and
a comment from Linus would be helpful.

> But well, I prefer to fix interactivity than to care about that one kind
> of workload, so I'm ok with it.
>
> > - Tweak VM to make it work the inactive list harder, before starting
> > to evict pages or swap.
>
> I would like to see he interactivity problems get fixed on block layer
> side first: Its not a VM issue initially. Actually, the thing is that if
> you tweak VM this way you're going to break some workloads.

Possibly. I have a feeling that the VM is a bit too swaphappy,
especially in the presence of heavy write() loads. I'd rather
see more aggressive dropbehind on the write() data, than see
useful cache data dropped. But I'm not sure yet.

> > - Change the elevator so that once a request's latency has
> > expired, we can still perform merges in front of that
> > request. But we no longer will insert new requests in
> > front of that request.
>
> Sounds fine... I've received quite many success reports already, right ?

A few people have reported success. Nathan Grennan didn't.

The elevator change also needs more testing and review.
There's a possibility that it could cause a seek-storm collapse
when interacting with readahead. Currently, readhead does this:

for (some pages) {
alloc_page()
page_cache_read()
}

See the potential here for the alloc_page() to get abducted
by shrink_cache(), to perform IO, and to not return until after
the previous page_cache_read() has been submitted to the device?
Ouch. Putting reads nearer the elevator head exposes this possibility.

It seems to not happen, due to the vagaries of the VM-of-the-minute,
and the workload. But it could.

So the obvious change is to allocate all the readhead pages up-front
before issuing the reads. I rewrote the readhead code to do this
(and dropped about 300 lines from filemap.c in the process), but given
that the condition doesn't trigger, it doesn't make much difference.

I've spent a week so far looking closely at various performance
and usability problems with 2.4. It's still a work-in-progress.
I don't feel ready to start offering anything for merging yet,
really. Some of these things interact, and I'd prefer to get
more off-stream testing done, as well as code review.

Current patchset is at http://www.zip.com.au/~akpm/linux/2.4/2.4.17-pre1/

The list so far is:

vm-fixes.patch
The balance_dirty() and less swap-happy changes
write-cluster.patch
ext2 metadata prereading and various other hacks which
prevent writes from stumbling over reads, and thus ruining
write clustering. This patch is in the early prototype stage
readhead.patch
VM readhead rewrite. Designed to avoid the above
problem, and to make readhead growth more aggressive,
and to make readhead shrinkth less aggressive. I
don't see why we should drop the readhead window on the
floor if someone has read a few megs from a file and then
seeks elsewhere within it. Also uses common code for
mmap readhead. The madvise explicit dropbehind code
accidentally died. Oh well.
Testing with paging-intensive workloads (start X11, staroffice6)
indicates that we indeed do more IO, in less requests. But
walltime doesn't change. I may not proceed with this.
mini-ll.patch
A kinder, gentler low-latency patch, based on the one which
Andrea is maintaining. Doesn't drop any locks. As far as
I'm concerned, this can be merged today (six months ago, in
fact). It gives practically all the perceived benefit of
the preemptive kernel patch and is clearly safe.
A number of vendors are shipping kernels which are patched
to add rescheduling points to copy_*_user(), which is
much less effective than this patch. They shouldn't
be doing this.
elevator.patch
The previously-described elevator changes
inline.patch
Drops a large number of ill-chosen `inline' qualifiers
from the kernel. Removes a total of about 12,000 bytes
of instructions, almost all from the very hottest parts of
the kernel. Should prove useful for computers which
have an L1 cache which is faster than main memory.
block-alloc.patch
My nemesis. Fixing the long- and short-term fragmentation
of ext2/ext3 blocks would be a more significant performance
boost than anything else in the 2.4 series. But it's just
proving intractable. I'll probably have to drop most of
this, and look at online defrag. There's potential for
a 3x to 5x speedup here.

Also need to do something about the stalls which Nathan Grennan
has reported. On ext3 it seems to be due to atime updates.
Not sure about ext2 yet.

2001-11-28 20:55:14

by Dieter Nützel

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Am Mittwoch, 28. November 2001 20:42 schrieb Torrey Hoffman:
> Yes, I just looked at the code in /fs/reiserfs/bitmap.c and
> the comment block above the warning message specifically mentions
> the low-latency patches.
>
> I feel better now, looks like my filesystem is safe...
>
> Torrey

So may I ask you to give 2.4.16 + preempt + lock-break (it is an additional
one which do the same as Andrew's low-latency) a try?

Please run an MP3 or Ogg-Vorbis together with dbench. As you have a dual PIII
I am very interested. I will buy a dual Athlon XP/MP, soon.

Thanks,
Dieter

> Andrew Morton wrote:
> [...]
>
> > > fall over, and during the run I got the following error/
> > > warning message printed about 20 times on the console
> > > and in the kernel log:
> > >
> > > vs-4150: reiserfs_new_blocknrs, block not free<4>
> >
> > uh-oh. I probably broke reiserfs in the low-latency patch.
> >
> > It's fairly harmless - we drop the big kernel lock, schedule
> > away. Upon resumption, the block we had decided to allocate
> > has been allocated by someone else. The filesystem emits a
> > warning and goes off to find a different block.
> >
> > Will fix.

2001-11-28 20:57:14

by Andreas Dilger

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

On Nov 28, 2001 12:31 -0800, Andrew Morton wrote:
> write-cluster.patch
> ext2 metadata prereading and various other hacks which
> prevent writes from stumbling over reads, and thus ruining
> write clustering. This patch is in the early prototype stage

Shouldn't the ext2_inode_preread() code use "ll_rw_block(READ_AHEAD,...)"
just to be proper?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-28 21:13:34

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Andreas Dilger wrote:
>
> On Nov 28, 2001 12:31 -0800, Andrew Morton wrote:
> > write-cluster.patch
> > ext2 metadata prereading and various other hacks which
> > prevent writes from stumbling over reads, and thus ruining
> > write clustering. This patch is in the early prototype stage
>
> Shouldn't the ext2_inode_preread() code use "ll_rw_block(READ_AHEAD,...)"
> just to be proper?
>

Yes, especially now the request queues are shorter than they have
historically been. READA also needs to be propagated through the
pagecache readhead, which may prove tricky.

But so little code is actually using READA at this stage that I didn't
bother - I first need to go through those paths and make sure that they
are in fact complete, working and useful...

-

2001-11-28 21:22:14

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16



On Wed, 28 Nov 2001, Andrew Morton wrote:

> Andreas Dilger wrote:
> >
> > On Nov 28, 2001 12:31 -0800, Andrew Morton wrote:
> > > write-cluster.patch
> > > ext2 metadata prereading and various other hacks which
> > > prevent writes from stumbling over reads, and thus ruining
> > > write clustering. This patch is in the early prototype stage
> >
> > Shouldn't the ext2_inode_preread() code use "ll_rw_block(READ_AHEAD,...)"
> > just to be proper?
> >
>
> Yes, especially now the request queues are shorter than they have
> historically been. READA also needs to be propagated through the
> pagecache readhead, which may prove tricky.
>
> But so little code is actually using READA at this stage that I didn't
> bother - I first need to go through those paths and make sure that they
> are in fact complete, working and useful...

I've done some experiments in the past which have shown that doing this
will cause us to almost _never_ do readahead on IO intensive workloads,
which ended up decreasing performance instead increasing it.

Please make sure to extensively test the propagation of READA through the
pagecache when you do so...

2001-11-28 21:28:04

by Andrew Morton

[permalink] [raw]
Subject: Re: Unresponiveness of 2.4.16

Marcelo Tosatti wrote:
>
> > ...
> > But so little code is actually using READA at this stage that I didn't
> > bother - I first need to go through those paths and make sure that they
> > are in fact complete, working and useful...
>
> I've done some experiments in the past which have shown that doing this
> will cause us to almost _never_ do readahead on IO intensive workloads,
> which ended up decreasing performance instead increasing it.

Interesting. Thanks.

One _could_ make the first readahead page non-READA, and then
make the rest READA. That way, all block-contiguous requests
will be merged, and any non-contiguous requests will be dropped on
the floor if the request queue is full. Which is probably what
we want to happen anyway.

Of course the alternative is to slot a little bmap() call into
the readhead logic :)

> Please make sure to extensively test the propagation of READA through the
> pagecache when you do so...

Extensivelytest is my middle name.

-