2001-10-24 10:42:21

by Zlatko Calusic

[permalink] [raw]
Subject: xmm2 - monitor Linux MM active/inactive lists graphically

New version is out and can be found at the same URL:

<URL:http://linux.inet.hr/>

As Linus' MM lost inactive dirty/clean lists in favour of just one
inactive list, the application needed to be modified to support that.

You can still continue to use the older one for kernels <= 2.4.9
and/or Alan's (-ac) kernels, which continued to use older Rik's VM
system.

Enjoy and, as usual, all comments welcome!
--
Zlatko

P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
[email protected] is redirected to /dev/null. <g>


2001-10-24 15:47:43

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically



On 24 Oct 2001, Zlatko Calusic wrote:

> P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
> [email protected] is redirected to /dev/null. <g>

Zlatko,

Could you please show us your case of bad writeout performance ?

Thanks

2001-10-25 00:26:00

by Zlatko Calusic

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Marcelo Tosatti <[email protected]> writes:

> On 24 Oct 2001, Zlatko Calusic wrote:
>
> > P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
> > [email protected] is redirected to /dev/null. <g>
>
> Zlatko,
>
> Could you please show us your case of bad writeout performance ?
>
> Thanks
>

Sure. Output of 'vmstat 1' follows:


1 0 0 0 254552 5120 183476 0 0 12 24 178 438 2 37 60
0 1 0 0 137296 5232 297760 0 0 4 5284 195 440 3 43 54
1 0 0 0 126520 5244 308260 0 0 0 10588 215 230 0 3 96
0 2 0 0 117488 5252 317064 0 0 0 8796 176 139 1 3 96
0 2 0 0 107556 5264 326744 0 0 0 9704 174 78 0 3 97
0 2 0 0 99552 5268 334548 0 0 0 7880 174 67 0 3 97
0 2 0 0 89448 5280 344392 0 0 0 9804 175 76 0 4 96
0 1 0 0 79352 5288 354236 0 0 0 9852 176 87 0 5 95
0 1 0 0 71220 5300 362156 0 0 4 7884 170 120 0 4 96
0 1 0 0 63088 5308 370084 0 0 0 7936 174 76 0 3 97
0 2 0 0 52988 5320 379924 0 0 0 9920 175 77 0 4 96
0 2 0 0 43148 5328 389516 0 0 0 9548 174 97 0 4 95
0 2 0 0 35144 5336 397316 0 0 0 7820 176 73 0 3 97
0 2 0 0 25172 5344 407036 0 0 0 9724 188 183 0 4 96
0 2 1 0 17300 5352 414708 0 0 0 7744 174 78 0 4 96
0 1 0 0 7068 5360 424684 0 0 0 9920 175 93 0 3 97
0 1 0 0 3128 4132 430132 0 0 0 9920 174 81 0 4 96

Notice how there's planty of RAM. I'm writing sequentially to a file
on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!
--
Zlatko

2001-10-25 01:50:21

by Simon Kirby

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

On Thu, Oct 25, 2001 at 02:25:45AM +0200, Zlatko Calusic wrote:

> Sure. Output of 'vmstat 1' follows:
>...
> 0 2 0 0 43148 5328 389516 0 0 0 9548 174 97 0 4 95
> 0 2 0 0 35144 5336 397316 0 0 0 7820 176 73 0 3 97
> 0 2 0 0 25172 5344 407036 0 0 0 9724 188 183 0 4 96
> 0 2 1 0 17300 5352 414708 0 0 0 7744 174 78 0 4 96
>...
> Notice how there's planty of RAM. I'm writing sequentially to a file
> on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!

Same here. But hey, at least it doesn't swap now! :)

Also, dd if=/dev/zero of=blah bs=1024k seems to totally kill everything
else on my box until I ^C it.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-25 04:21:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On 25 Oct 2001, Zlatko Calusic wrote:
>
> Sure. Output of 'vmstat 1' follows:
>
> 1 0 0 0 254552 5120 183476 0 0 12 24 178 438 2 37 60
> 0 1 0 0 137296 5232 297760 0 0 4 5284 195 440 3 43 54
> 1 0 0 0 126520 5244 308260 0 0 0 10588 215 230 0 3 96
> 0 2 0 0 117488 5252 317064 0 0 0 8796 176 139 1 3 96
> 0 2 0 0 107556 5264 326744 0 0 0 9704 174 78 0 3 97

This does not look like a VM issue at all - at this point you're already
getting only 10MB/s, yet the VM isn't even involved (there's definitely no
VM pressure here).

> Notice how there's planty of RAM. I'm writing sequentially to a file
> on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!

Are you sure you haven't lost some DMA setting or something?

Linus

2001-10-25 04:58:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On Wed, 24 Oct 2001, Linus Torvalds wrote:
>
> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Sure. Output of 'vmstat 1' follows:
> >
> > 1 0 0 0 254552 5120 183476 0 0 12 24 178 438 2 37 60
> > 0 1 0 0 137296 5232 297760 0 0 4 5284 195 440 3 43 54
> > 1 0 0 0 126520 5244 308260 0 0 0 10588 215 230 0 3 96
> > 0 2 0 0 117488 5252 317064 0 0 0 8796 176 139 1 3 96
> > 0 2 0 0 107556 5264 326744 0 0 0 9704 174 78 0 3 97
>
> This does not look like a VM issue at all - at this point you're already
> getting only 10MB/s, yet the VM isn't even involved (there's definitely no
> VM pressure here).

I wonder if you're getting screwed by bdflush().. You do have a lot of
context switching going on, and you do have a clear pattern: once the
write-out gets going, you're filling new cached pages at about the same
pace that you're writing them out, which definitely means that the dirty
buffer balancing is nice and active.

So the problem is that you're obviously not actually getting the
throughput you should - it's not the VM, as the page cache grows nicely at
the same rate you're writing.

Try something for me: in fs/buffer.c make "balance_dirty_state()" never
return > 0, ie make the "return 1" be a "return 0" instead.

That will cause us to not wake up bdflush at all, and if you're just on
the "border" of 40% dirty buffer usage you'll have bdflush work in
lock-step with you, alternately writing out buffers and waiting for them.

Quite frankly, just the act of doing the "write_some_buffers()" in
balance_dirty() should cause us to block much better than the synchronous
waiting anyway, because then we will block when the request queue fills
up, not at random points.

Even so, considering that you have such a steady 9-10MB/s, please double-
check that it's not something even simpler and embarrassing, like just
having forgotten to enable auto-DMA in the kernel config ;)

Linus

2001-10-25 09:08:17

by Zlatko Calusic

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Linus Torvalds <[email protected]> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Sure. Output of 'vmstat 1' follows:
> >
> > 1 0 0 0 254552 5120 183476 0 0 12 24 178 438 2 37 60
> > 0 1 0 0 137296 5232 297760 0 0 4 5284 195 440 3 43 54
> > 1 0 0 0 126520 5244 308260 0 0 0 10588 215 230 0 3 96
> > 0 2 0 0 117488 5252 317064 0 0 0 8796 176 139 1 3 96
> > 0 2 0 0 107556 5264 326744 0 0 0 9704 174 78 0 3 97
>
> This does not look like a VM issue at all - at this point you're already
> getting only 10MB/s, yet the VM isn't even involved (there's definitely no
> VM pressure here).

That's true, I'll admit. Anyway, -ac kernels don't have the problem,
and I was misleaded by the fact that only VM implementation differs in
those two branches (at least I think so).

>
> > Notice how there's planty of RAM. I'm writing sequentially to a file
> > on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> > capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!
>
> Are you sure you haven't lost some DMA setting or something?
>

No. Setup is fine. I wouldn't make such a mistake. :)
If the disk were in some PIO mode, CPU usage would be much higher, but
it isn't.

This all definitely looks like a problem either in the bdflush daemon,
or request queue/elevator, but unfortunately I don't have enough
knowledge of that areas to pinpoint it more precisely.
--
Zlatko

2001-10-25 12:48:33

by Zlatko Calusic

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Linus Torvalds <[email protected]> writes:

> I wonder if you're getting screwed by bdflush().. You do have a lot of
> context switching going on, and you do have a clear pattern: once the
> write-out gets going, you're filling new cached pages at about the same
> pace that you're writing them out, which definitely means that the dirty
> buffer balancing is nice and active.
>

Yes, but things are similar when I finally allocate whole memory, and
kswapd kicks in. Everything is behaving in the same way, so it is
definitely not the VM, as you pointed out.

> So the problem is that you're obviously not actually getting the
> throughput you should - it's not the VM, as the page cache grows nicely at
> the same rate you're writing.
>

Yes.

> Try something for me: in fs/buffer.c make "balance_dirty_state()" never
> return > 0, ie make the "return 1" be a "return 0" instead.
>

Sure. I recompiled fresh 2.4.13 at the work an rerun tests. This time
on different setup, so numbers are even smaller (tests were performed
at the last partition of the disk, where disk is capable of ~ 13MB/s)


procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 6308 600 441592 0 0 0 7788 159 132 0 7 93
0 1 0 0 3692 580 444272 0 0 0 5748 169 197 1 4 95
0 1 0 0 3180 556 444804 0 0 0 5632 228 408 1 5 94
0 1 0 0 3720 556 444284 0 0 0 7672 226 418 3 4 93
0 1 0 0 3836 556 444148 0 0 0 5928 249 509 0 8 92
0 1 0 0 3204 388 444952 0 0 0 7828 156 139 0 6 94
1 1 0 0 3456 392 444692 0 0 0 5952 157 139 0 5 95
0 1 0 0 3728 400 444428 0 0 0 7840 312 750 0 7 93
0 1 0 0 3968 404 444168 0 0 0 5952 216 364 0 5 95


> That will cause us to not wake up bdflush at all, and if you're just on
> the "border" of 40% dirty buffer usage you'll have bdflush work in
> lock-step with you, alternately writing out buffers and waiting for them.
>
> Quite frankly, just the act of doing the "write_some_buffers()" in
> balance_dirty() should cause us to block much better than the synchronous
> waiting anyway, because then we will block when the request queue fills
> up, not at random points.
>
> Even so, considering that you have such a steady 9-10MB/s, please double-
> check that it's not something even simpler and embarrassing, like just
> having forgotten to enable auto-DMA in the kernel config ;)
>

Yes, I definitely have DMA turned ON. All parameters are OK. :)

# hdparm /dev/hda

/dev/hda:
multcount = 16 (on)
I/O support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 1650/255/63, sectors = 26520480, start = 0

--
Zlatko

2001-10-25 16:31:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically



On 25 Oct 2001, Zlatko Calusic wrote:
>
> Yes, I definitely have DMA turned ON. All parameters are OK. :)

I suspect it may just be that "queue_nr_requests"/"batch_count" is
different in -ac: what happens if you tweak them to the same values?

(See drivers/block/ll_rw_block.c)

I think -ac made the queues a bit deeper the regular kernel does 128
requests and a batch-count of 16, I _think_ -ac does something like "2
requests per megabyte" and batch_count=32, so if you have 512MB you should
try with

queue_nr_requests = 1024
batch_count = 32

Does that help?

Linus

2001-10-25 17:34:17

by Jens Axboe

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

On Thu, Oct 25 2001, Linus Torvalds wrote:
>
> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
>
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
>
> (See drivers/block/ll_rw_block.c)
>
> I think -ac made the queues a bit deeper the regular kernel does 128
> requests and a batch-count of 16, I _think_ -ac does something like "2
> requests per megabyte" and batch_count=32, so if you have 512MB you should
> try with
>
> queue_nr_requests = 1024
> batch_count = 32

Right, -ac keeps the elevator flow control and proper queue sizes.

--
Jens Axboe

2001-10-26 09:46:01

by Zlatko Calusic

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Linus Torvalds <[email protected]> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
>
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
>
> (See drivers/block/ll_rw_block.c)
>
> I think -ac made the queues a bit deeper the regular kernel does 128
> requests and a batch-count of 16, I _think_ -ac does something like "2
> requests per megabyte" and batch_count=32, so if you have 512MB you should
> try with
>
> queue_nr_requests = 1024
> batch_count = 32
>
> Does that help?
>

Unfortunately not. It makes a machine quite unresponsive while it's
writing to disk, and vmstat 1 discovers strange "spiky"
behaviour. Average throughput is ~ 8MB/s (disk is capable of ~ 13MB/s)

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 0 3840 528 441900 0 0 0 34816 188 594 2 34 64
0 1 0 0 3332 536 442384 0 0 4 10624 187 519 2 8 90
0 1 0 0 3324 536 442384 0 0 0 0 182 499 0 0 100
2 1 0 0 3300 536 442384 0 0 0 0 198 486 0 1 99
1 1 0 0 3304 536 442384 0 0 0 0 186 513 0 0 100
0 1 1 0 3304 536 442384 0 0 0 0 193 473 0 1 99
0 1 1 0 3304 536 442384 0 0 0 0 191 508 1 1 98
0 1 0 0 3884 536 441840 0 0 4 44672 189 590 4 40 56
0 1 0 0 3860 536 441840 0 0 0 0 186 526 0 1 99
0 1 0 0 3852 536 441840 0 0 0 0 191 500 0 0 100
0 1 0 0 3844 536 441840 0 0 0 0 193 482 1 0 99
0 1 0 0 3844 536 441840 0 0 0 0 187 511 0 1 99
0 2 1 0 3832 540 441844 0 0 4 0 305 1004 3 2 95
0 3 1 0 3824 544 441844 0 0 4 0 410 1340 2 2 96
0 3 0 0 3764 552 441916 0 0 12 47360 346 915 6 41 53
0 3 0 0 3764 552 441916 0 0 0 0 373 887 0 0 100
0 3 0 0 3764 552 441916 0 0 0 0 278 692 1 2 97
1 3 0 0 3764 552 441916 0 0 0 0 221 579 0 3 97
0 3 0 0 3764 552 441916 0 0 0 0 286 704 0 2 98

I'll now test "batch_count = queue_nr_requests / 3", which I found in
2.4.14-pre2, but with queue_nr_request still left at 1024. And report
results after that.
--
Zlatko

2001-10-26 10:08:31

by Zlatko Calusic

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Linus Torvalds <[email protected]> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
>
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
>

Next test:

block: 1024 slots per queue, batch=341

Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)

Still very spiky, and during the write disk is uncapable of doing any
reads. IOW, no serious application can be started before writing has
finished. Shouldn't we favour reads over writes? Or is it just that
the elevator is not doing its job right, so reads suffer?


procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 1 0 3600 424 453416 0 0 0 0 190 510 2 1 97
0 1 1 0 3596 424 453416 0 0 0 40468 189 508 2 2 96
0 1 1 0 3592 424 453416 0 0 0 0 189 541 1 0 99
0 1 1 0 3592 424 453416 0 0 0 0 190 513 1 0 99
1 1 1 0 3592 424 453416 0 0 0 0 192 511 0 1 99
0 1 1 0 3596 424 453416 0 0 0 0 188 528 0 0 100
0 1 1 0 3592 424 453416 0 0 0 0 188 510 1 0 99
0 1 1 0 3592 424 453416 0 0 0 41444 195 507 0 2 98
0 1 1 0 3592 424 453416 0 0 0 0 190 514 1 1 98
1 1 1 0 3588 424 453416 0 0 0 0 192 554 0 2 98
0 1 1 0 3584 424 453416 0 0 0 0 191 506 0 1 99
0 1 1 0 3584 424 453416 0 0 0 0 186 514 0 0 100
0 1 1 0 3584 424 453416 0 0 0 0 186 515 0 0 100
1 1 1 0 3576 424 453416 0 0 0 0 434 1493 3 2 95
1 1 1 0 3564 424 453416 0 0 0 40560 301 936 3 1 96
0 1 1 0 3564 424 453416 0 0 0 0 338 1050 1 2 97
0 1 1 0 3560 424 453416 0 0 0 0 286 893 1 2 97

--
Zlatko

2001-10-26 14:39:43

by Jens Axboe

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

On Fri, Oct 26 2001, Zlatko Calusic wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On 25 Oct 2001, Zlatko Calusic wrote:
> > >
> > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> >
> > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > different in -ac: what happens if you tweak them to the same values?
> >
>
> Next test:
>
> block: 1024 slots per queue, batch=341

That's way too much, batch should just stay around 32, that is fine.

> Still very spiky, and during the write disk is uncapable of doing any
> reads. IOW, no serious application can be started before writing has
> finished. Shouldn't we favour reads over writes? Or is it just that
> the elevator is not doing its job right, so reads suffer?

You are probably just seeing starvation due to the very long queues.

--
Jens Axboe

2001-10-26 14:57:23

by Zlatko Calusic

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Jens Axboe <[email protected]> writes:

> On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > Linus Torvalds <[email protected]> writes:
> >
> > > On 25 Oct 2001, Zlatko Calusic wrote:
> > > >
> > > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > >
> > > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > > different in -ac: what happens if you tweak them to the same values?
> > >
> >
> > Next test:
> >
> > block: 1024 slots per queue, batch=341
>
> That's way too much, batch should just stay around 32, that is fine.

OK. Anyway, neither configuration works well, so the problem might be
somewhere else.

While at it, could you give short explanation of those two parameters?

>
> > Still very spiky, and during the write disk is uncapable of doing any
> > reads. IOW, no serious application can be started before writing has
> > finished. Shouldn't we favour reads over writes? Or is it just that
> > the elevator is not doing its job right, so reads suffer?
>
> You are probably just seeing starvation due to the very long queues.
>

Is there anything we could do about that? I remember Linux once had a
favoured reads, but I'm not sure if we do that likewise these days.

When I find some time, I'll dig around that code. It is very
interesting part of the kernel, I'm sure, I just didn't have enough
time so far, to spend hacking on that part.
--
Zlatko

2001-10-26 15:01:43

by Jens Axboe

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > > Linus Torvalds <[email protected]> writes:
> > >
> > > > On 25 Oct 2001, Zlatko Calusic wrote:
> > > > >
> > > > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > > >
> > > > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > > > different in -ac: what happens if you tweak them to the same values?
> > > >
> > >
> > > Next test:
> > >
> > > block: 1024 slots per queue, batch=341
> >
> > That's way too much, batch should just stay around 32, that is fine.
>
> OK. Anyway, neither configuration works well, so the problem might be
> somewhere else.

Most likely, yes.

> While at it, could you give short explanation of those two parameters?

Sure. queue_nr_requests is the total number of free request slots per
queue. There are queue_nr_requests / 2 free slots for READ and WRITE.
Each request can be anywhere from fs block size and up to 127kB of data
per default. batch only matters once the request free list has been
depleted. In order to give the elevator some input to work with, we free
request slots in batches of 'batch' to get decent merging etc. That's
why numbers bigger than ~ 32 would not be such a good idea and only add
to bad latency.

> > > Still very spiky, and during the write disk is uncapable of doing any
> > > reads. IOW, no serious application can be started before writing has
> > > finished. Shouldn't we favour reads over writes? Or is it just that
> > > the elevator is not doing its job right, so reads suffer?
> >
> > You are probably just seeing starvation due to the very long queues.
> >
>
> Is there anything we could do about that? I remember Linux once had a
> favoured reads, but I'm not sure if we do that likewise these days.

It still favors reads, take a look at the initial sequence numbers given
to reads and writes. We use to favor reads in the request slots too --
you could try and change the blk_init_freelist split so that you get a
1/3 - 2/3 ratio between WRITE's and READ's and see if that makes the
system more smooth.

> When I find some time, I'll dig around that code. It is very
> interesting part of the kernel, I'm sure, I just didn't have enough
> time so far, to spend hacking on that part.

Indeed it is.

--
Jens Axboe

2001-10-26 16:05:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On 26 Oct 2001, Zlatko Calusic wrote:
>
> OK. Anyway, neither configuration works well, so the problem might be
> somewhere else.
>
> While at it, could you give short explanation of those two parameters?

Did you try the ones 2.4.14-2 does?

Basically, the "queue_nr_requests" means how many requests there can be
for this queue. Half of them are allocated to reads, half of them are
allocated to writes.

The "batch_requests" thing is something that kicks in when the queue has
emptied - we don't want to "trickle" requests to users, because if we do
that means that a new large write will not be able to merge its new
requests sanely because it basically has to do them one at a time. So when
we run out of requests (ie "queue_nr_requests" isn't enough), we start
putting the freed-up requests on a "pending" list, and we release them
only when the pending list is bigger than "batch_requests".

Now, one thing to remember is that "queue_nr_requests" is for the whole
queue (half of them for reads, half for writes), and "batch_requests" is a
per-type thing (ie we batch reads and writes separately). So
"batch_requests" must be less than half of "queue_nr_requests", or we will
never release anything at all.

Now, in Alan's tree, there is a separate tuning thing, which is the "max
nr of _sectors_ in flight", which in my opinion is pretty bogus. It's
really a memory-management thing, but it also does something else: it has
low-and-high water-marks, and those might well be a good idea. It is
possible that we should just ditch the "batch_requests" thing, and use the
watermarks instead.

Side note: all of this is relevant really only for writes - reads pretty
much only care about the maximum queue-size, and it's very hard to get a
_huge_ queue-size with reads unless you do tons of read-ahead.

Now, the "batching" is technically equivalent with water-marking if there
is _one_ writer. But if there are multiple writers, water-marking may
actually has some advantages: it might allow the other writer to make some
progress when the first one has stopped, while the batching will stop
everybody until the batch is released. Who knows.

Anyway, the reason I think Alan's "max nr of sectors" is bogus is because:

- it's a global count, and if you have 10 controllers and want to write
to all 10, you _should_ be able to - you can write 10 times as many
requests in the same latency, so there is nothing "global" with it.

(It turns out that one advantage of the globalism is that it ends up
limiting MM write-outs, but I personally think that is a _MM_ thing, ie
we might want to have a "we have half of all our pages in flight, we
have to throttle now" thing in "writepage()", not in the queue)

- "nr of sectors" has very little to do with request latency on most
hardware. You can do 255 sectors (ie one request) almost as fast as you
can do just one, if you do them in one request. While just _two_
sectors might be much slower than the 255, if they are in separate
requests and cause seeking.

So from a latency standpoint, the "request" is a much better number.

So Alan almost never throttles on requests (on big machines, the -ac tree
allows thousands of requests in flight per queue), while he _does_ have
this water-marking for sectors.

So I have two suspicions:

- 128 requests (ie 64 for writes) like the default kernel should be
_plenty_ enough to keep the disks busy, especially for streaming
writes. It's small enough that you don't get the absolutely _huge_
spikes you get with thousands of requests, while being large enough for
fast writers that even if they _do_ block for 32 of the 64 requests,
they'll have time to refill the next 32 long before the 32 pending one
have finished.

Also: limiting the write queue to 128 requests means that you can
pretty much guarantee that you can get at least a few read requests
per second, even if the write queue is constantly full, and even if
your reader is serialized.

BUT:

- the hard "batch" count is too harsh. It works as a watermark in the
degenerate case, but doesn't allow a second writer to use up _some_ of
the requests while the first writer is blocked due to watermarking.

So with batching, when the queue is full and another process wants
memory, that _OTHER_ process will also always block untilt he queue has
emptied.

With watermarks, when the writer has filled up the queue and starts
waiting, other processes can still do some writing as long as they
don't fill up the queue again. So if you have MM pressure but the
writer is blocked (and some requests _have_ completed, but the writer
waits for the low-water-mark), you can still push out requests.

That's also likely to be a lot more fair - batching tends to give the
whole batch to the big writer, while watermarking automatically allows
others to get a look at the queue.

I'll whip up a patch for testing (2.4.14-2 made the batching slightly
saner, but the same "hard" behaviour is pretty much unavoidable with
batching)

Linus

2001-10-26 17:20:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On Fri, 26 Oct 2001, Linus Torvalds wrote:
>
> Attached is a very untested patch (but hey, it compiles, so it must work,
> right?)

And it actually does seem to.

Zlatko, does this make a difference for your disk?

Linus

2001-10-26 16:58:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On 26 Oct 2001, Zlatko Calusic wrote:
>
> When I find some time, I'll dig around that code. It is very
> interesting part of the kernel, I'm sure, I just didn't have enough
> time so far, to spend hacking on that part.

Attached is a very untested patch (but hey, it compiles, so it must work,
right?) against 2.4.14-pre2, that makes the batching be a high/low
watermark thing instead. It actually simplified the code, but that is, of
course, assuming that it works at all ;)

(If I got the comparisons wrong, of if I update the counts wrong, your IO
queue will probably stop cold. So be careful. The code is obvious
enough, but typos and thinkos happen).

Linus


Attachments:
p2p3 (5.37 kB)

2001-10-27 13:28:02

by Giuliano Pochini

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


> block: 1024 slots per queue, batch=341
>
> Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)
>
> Still very spiky, and during the write disk is uncapable of doing any
> reads. IOW, no serious application can be started before writing has
> finished. Shouldn't we favour reads over writes? Or is it just that
> the elevator is not doing its job right, so reads suffer?
>
> procs memory swap io system cpu
> r b w swpd free buff cache si so bi bo in cs us sy id
> 0 1 1 0 3596 424 453416 0 0 0 40468 189 508 2 2 96

341*127K = ~40M.

Batch is too high. It doesn't explain why reads get delayed so much, anyway.

Bye.

2001-10-28 05:05:53

by Mike Fedyk

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

On Sat, Oct 27, 2001 at 03:14:44PM +0200, Giuliano Pochini wrote:
>
> > block: 1024 slots per queue, batch=341
> >
> > Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)
> >
> > Still very spiky, and during the write disk is uncapable of doing any
> > reads. IOW, no serious application can be started before writing has
> > finished. Shouldn't we favour reads over writes? Or is it just that
> > the elevator is not doing its job right, so reads suffer?
> >
> > procs memory swap io system cpu
> > r b w swpd free buff cache si so bi bo in cs us sy id
> > 0 1 1 0 3596 424 453416 0 0 0 40468 189 508 2 2 96
>
> 341*127K = ~40M.
>
> Batch is too high. It doesn't explain why reads get delayed so much, anyway.
>

Try modifying the elivator queue length with elvtune.

BTW, 2.2.19 has the queue lengths in the hundreds, and 2.4.xx has it in the
thousands. I've set 2.4 kernels back to the 2.2 defaults, and interactive
performance has gone up considerably. These are subjective tests though.

Mike

2001-10-28 17:30:18

by Zlatko Calusic

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Linus Torvalds <[email protected]> writes:

> On Fri, 26 Oct 2001, Linus Torvalds wrote:
> >
> > Attached is a very untested patch (but hey, it compiles, so it must work,
> > right?)
>
> And it actually does seem to.
>
> Zlatko, does this make a difference for your disk?
>

First, sorry for such a delay in answering, I was busy.

I compiled 2.4.14-pre3 as it seems to be identical to your p2p3 patch,
with regard to queue processing.

Unfortunately, things didn't change on my first disk (IBM 7200rpm
@home). I'm still getting low numbers, check the vmstat output at the
end of the email.

But, now I found something interesting, other two disk which are on
the standard IDE controller work correctly (writing is at 17-22
MB/sec). The disk which doesn't work well is on the HPT366 interface,
so that may be our culprit. Now I got the idea to check patches
retrogradely to see where it started behaving poorely.

Also, one more thing, I'm pretty sure that under strange circumstances
(specific alignment of stars) it behaves well (with appropriate
writing speed). I just haven't yet pinpointed what needs to be done to
get to that point.

I know I haven't supplied you with a lot of information, but I'll keep
investigating until I have some more solid data on the problem.

BTW, thank you and Jens for nice explanation of the numbers, very good
reading.

0 2 0 13208 2924 516 450716 0 0 0 11808 179 113 0 6 93
0 1 0 13208 2656 524 450964 0 0 0 8432 174 86 1 6 93
0 1 0 13208 3676 532 449924 0 0 0 8432 174 91 1 4 95
0 1 0 13208 3400 540 450172 0 0 0 8432 231 343 1 4 94
0 2 0 13208 3520 548 450036 0 0 0 8440 180 179 2 5 93
0 1 0 20216 3544 728 456976 32 0 32 8432 175 94 0 4 95
0 2 0 20212 3280 728 457232 0 0 0 8440 174 88 0 5 95
0 2 0 20208 3032 728 457480 0 0 0 8364 174 84 1 4 95
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 2 0 20208 3412 732 457092 0 0 0 6964 175 111 0 4 96
0 2 0 20208 3272 728 457224 0 0 0 1216 207 89 0 1 99
0 2 0 20208 3164 728 457352 0 0 0 1300 256 77 1 2 97
0 2 1 20208 2928 732 457604 0 0 0 1444 283 77 1 0 99
0 2 1 20208 2764 732 457732 0 0 0 1316 278 73 1 1 98
0 2 1 20208 3420 728 457096 0 0 0 1652 273 117 0 1 99
0 2 1 20208 3180 732 457348 0 0 0 1404 240 90 0 0 99
0 2 1 20208 3696 728 456840 0 0 0 1784 247 80 0 1 98
0 2 1 20204 3432 728 457096 0 0 0 1404 237 77 1 0 99
0 2 1 20204 2896 732 457604 0 0 0 1672 255 77 1 1 98
0 1 0 20204 3284 728 457224 0 0 0 1976 257 112 0 2 98
0 1 0 20204 2772 728 457736 0 0 0 7628 260 100 0 4 96
0 1 0 20204 3540 728 456968 0 0 0 8492 178 83 1 4 95
0 2 0 20204 3584 736 456916 0 0 4 4848 175 88 0 2 97

Regards,
--
Zlatko

2001-10-28 17:36:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On 28 Oct 2001, Zlatko Calusic wrote:
>
> But, now I found something interesting, other two disk which are on
> the standard IDE controller work correctly (writing is at 17-22
> MB/sec). The disk which doesn't work well is on the HPT366 interface,
> so that may be our culprit. Now I got the idea to check patches
> retrogradely to see where it started behaving poorely.

Ok. That _is_ indeed a big clue.

Does the -ac patches have any hpt366-specific stuff? Although I suspect
you're right, and that it's just the driver (or controller itself) being
very very sensitive to some random alignment of stars, rather than any
real code itself.

> 0 2 0 13208 2924 516 450716 0 0 0 11808 179 113 0 6 93
> 0 1 0 13208 2656 524 450964 0 0 0 8432 174 86 1 6 93
> 0 1 0 13208 3676 532 449924 0 0 0 8432 174 91 1 4 95
> 0 1 0 13208 3400 540 450172 0 0 0 8432 231 343 1 4 94
> 0 2 0 13208 3520 548 450036 0 0 0 8440 180 179 2 5 93
> 0 1 0 20216 3544 728 456976 32 0 32 8432 175 94 0 4 95
> 0 2 0 20212 3280 728 457232 0 0 0 8440 174 88 0 5 95
> 0 2 0 20208 3032 728 457480 0 0 0 8364 174 84 1 4 95
> procs memory swap io system cpu
> r b w swpd free buff cache si so bi bo in cs us sy id
> 0 2 0 20208 3412 732 457092 0 0 0 6964 175 111 0 4 96
> 0 2 0 20208 3272 728 457224 0 0 0 1216 207 89 0 1 99
> 0 2 0 20208 3164 728 457352 0 0 0 1300 256 77 1 2 97
> 0 2 1 20208 2928 732 457604 0 0 0 1444 283 77 1 0 99
> 0 2 1 20208 2764 732 457732 0 0 0 1316 278 73 1 1 98

So it actually slows down to just 1.5MB/s at times? That's just
disgusting. I wonder what the driver is doing..

Linus

2001-10-28 17:42:28

by Alan

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

> Does the -ac patches have any hpt366-specific stuff? Although I suspect
> you're right, and that it's just the driver (or controller itself) being

The IDE code matches between the two. It isnt a driver change


Alan

2001-10-28 18:01:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On Sun, 28 Oct 2001, Alan Cox wrote:
>
> > Does the -ac patches have any hpt366-specific stuff? Although I suspect
> > you're right, and that it's just the driver (or controller itself) being
>
> The IDE code matches between the two. It isnt a driver change

It might, of course, just be timing, but that sounds like a bit _too_ easy
an explanation. Even if it could easily be true.

The fact that -ac gets higher speeds, and -ac has a very different
request watermark strategy makes me suspect that that might be the cause.

In particular, the standard kernel _requires_ that in order to get good
performance you can merge many bh's onto one request. That's a very
reasonable assumption: it basically says that any high-performance driver
has to accept merging, because that in turn is required for the elevator
overhead to not grow without bounds. And if the driver doesn't accept big
requests, that driver cannot perform well because it won't have many
requests pending.

In contrast, the -ac logic says roughly "Who the hell cares if the driver
can merge requests or not, we can just give it thousands of small requests
instead, and cap the total number of _sectors_ instead of capping the
total number of requests earlier".

In my opinion, the -ac logic is really bad, but one thing it does allow is
for stupid drivers that look like high-performance drivers. Which may be
why it got implemented.

And it may be that the hpt366 IDE driver has always had this braindamage,
which the -ac code hides. Or something like this.

Does anybody know the hpt driver? Does it, for example, limit the maximum
number of sectors per merge somehow for some reason?

Jens?

Linus

2001-10-28 18:15:35

by Alan

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

> In contrast, the -ac logic says roughly "Who the hell cares if the driver
> can merge requests or not, we can just give it thousands of small requests
> instead, and cap the total number of _sectors_ instead of capping the
> total number of requests earlier"

If you think about it the major resource constraint is sectors - or another
way to think of it "number of pinned pages the VM cannot rescue until the
I/O is done". We also have many devices where the latency is horribly
important - IDE is one because it lacks sensible overlapping I/O. I'm less
sure what the latency trade offs are. Less commands means less turnarounds
so there is counterbalance.

In the case of IDE the -ac tree will do basically the same merging - the
limitations on IDE DMA are pretty reasonable. DMA IDE has scatter gather
tables and is actually smarter than many older scsi controllers. The IDE
layer supports up to 128 chunks of up to just under 64Kb (should be 64K
but some chipsets get 64K = 0 wrong and its not pretty)

> In my opinion, the -ac logic is really bad, but one thing it does allow is
> for stupid drivers that look like high-performance drivers. Which may be
> why it got implemented.

Well I'm all for making dumb hardware go as fast as smart stuff but that
wasn't the original goal - the original goal was to fix the bad behaviour
with the base kernel and large I/O queues to slow devices like M/O disks.

2001-10-28 18:48:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically


On Sun, 28 Oct 2001, Alan Cox wrote:
>
> > In contrast, the -ac logic says roughly "Who the hell cares if the driver
> > can merge requests or not, we can just give it thousands of small requests
> > instead, and cap the total number of _sectors_ instead of capping the
> > total number of requests earlier"
>
> If you think about it the major resource constraint is sectors - or another
> way to think of it "number of pinned pages the VM cannot rescue until the
> I/O is done".

Yes. But that's a VM decision, and that's a decision the VM _can_ and does
make. At least in newer VM's.

So counting sectors is only hiding problems at a higher level, and it's
hiding problems that the higher level can know about.

In contrast, one thing that the higher level _cannot_ know about is the
latency of the request queue, because that latency depends on the layout
of the requests. Contiguous requests are fast, seeks are slow. So the
number of requests (as long as they aren't infinitely sized) fairly well
approximates the latency.

Note that you are certainly right that the Linux VM system did not use to
be very good at throttling, and you could make it try to write out all of
memory on small machines. But that's really a VM issue.

(And have we have VM's that tried to push all of memory onto the disk, and
then returned Out-of-Memory when all pages were locked? Sure we have. But
I know mine doesn't, don't know about yours).

> We also have many devices where the latency is horribly
> important - IDE is one because it lacks sensible overlapping I/O. I'm less
> sure what the latency trade offs are. Less commands means less turnarounds
> so there is counterbalance.

Note that from a latency standpoint, you only need to have enough requests
to fill the queue - and right now we have a total of 128 requests, of
which half a for reads, and half are for the watermarking, so you end up
having 32 requests "in flight" while you refill the queue.

Which is _plenty_. Because each request can be 255 sectors (or 128,
depending on where the limit is today ;), which means that if you actually
have something throughput-limited, you can certainly keep the disk busy.

(And if the requests aren't localized enough to coalesce well, you cannot
keep the disk at platter-speed _anyway_, plus the requests will take
longer to process, so you'll have even more time to fill the queue).

The important part for real throughput is not to have thousands of
requests in flight, but to have _big_enough_ requests in flight. You can
keep even a fast disk busy with just a few requests, if you just keep
refilling them quickly enough and if they are _big_ enough.

> In the case of IDE the -ac tree will do basically the same merging - the
> limitations on IDE DMA are pretty reasonable. DMA IDE has scatter gather
> tables and is actually smarter than many older scsi controllers. The IDE
> layer supports up to 128 chunks of up to just under 64Kb (should be 64K
> but some chipsets get 64K = 0 wrong and its not pretty)

Yes. My question is more: does the dpt366 thing limit the queueing some
way?

> Well I'm all for making dumb hardware go as fast as smart stuff but that
> wasn't the original goal - the original goal was to fix the bad behaviour
> with the base kernel and large I/O queues to slow devices like M/O disks.

Now, that's a _latency_ issue, and should be fixed by having the max
number of requests (and the max _size_ of a request too) be a per-queue
thing.

But notice how that actually doesn't have anything to do with memory size,
and makes your "scale by max memory" thing illogical.

Linus

2001-10-28 19:02:01

by Andrew Morton

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

Linus Torvalds wrote:
>
> And it may be that the hpt366 IDE driver has always had this braindamage,
> which the -ac code hides. Or something like this.
>

My hpt366, running stock 2.4.14-pre3 performs OK.
time ( dd if=/dev/zero of=foo bs=10240k count=100 ; sync )
takes 35 seconds (30 megs/sec). The same on current -ac kernels.

Maybe Zlatko's drive stopped doing DMA?

2001-10-28 19:13:06

by Barry K. Nathan

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

> Unfortunately, things didn't change on my first disk (IBM 7200rpm
> @home). I'm still getting low numbers, check the vmstat output at the
> end of the email.
>
> But, now I found something interesting, other two disk which are on
> the standard IDE controller work correctly (writing is at 17-22
> MB/sec). The disk which doesn't work well is on the HPT366 interface,
> so that may be our culprit. Now I got the idea to check patches
> retrogradely to see where it started behaving poorely.
>
> Also, one more thing, I'm pretty sure that under strange circumstances
> (specific alignment of stars) it behaves well (with appropriate
> writing speed). I just haven't yet pinpointed what needs to be done to
> get to that point.

I didn't read the entire thread, so this is a bit of a stab in the dark,
but:

This really reminds me of a problem I once had with a hard drive of
mine. It would usually go at 15-20MB/sec, but sometimes (under both
Linux and Windows) would slow down to maybe 350KB/sec. The slowdown, or
lack thereof, did seem to depend on the alignment of the stars. I lived
with it for a number of months, then started getting intermittent I/O
errors as well, as if the drive had bad sectors on disk.

The problem turned out to be insufficient ventilation for the controller
board on the bottom of the drive -- it was in the lowest 3.5" drive bay
in my case, so the bottom of the drive was snuggled next to a piece of
metal with ventilation holes. The holes were rather large (maybe 0.5"
diameter) -- and so were the areas without holes. Guess where one of the
drive's controller chips happened to be positioned, relative to the
holes? :( Moving the drive up a bit in the case, so as to allow 0.5"-1"
of space for air beneath the drive, fixed the problem (both the slowdown
and the I/O errors).

I don't know if this is your problem, but I'm mentioning it just in
case it is...

-Barry K. Nathan <[email protected]>

2001-10-28 19:23:07

by Alan

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

> Yes. My question is more: does the dpt366 thing limit the queueing some
> way?

Nope. The HPT366 is a bog standard DMA IDE controller. At least unless Andre
can point out something I've forgotten any behaviour seen on it should be
the same as seen on any other IDE controller with DMA support.

In practical terms that should mean you can obsere the same HPT366 problem
he does on whatever random IDE controller is on your desktop box

> But notice how that actually doesn't have anything to do with memory size,
> and makes your "scale by max memory" thing illogical.

When you are dealing with the VM limit which the limiter was originally
added for then it makes a lot of sense. When you want to use it solely for
other purposes then it doesnt.

2001-10-28 21:53:58

by Jonathan Morton

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

> > Unfortunately, things didn't change on my first disk (IBM 7200rpm
>> @home). I'm still getting low numbers, check the vmstat output at the
>> end of the email.
>>
>> But, now I found something interesting, other two disk which are on
>> the standard IDE controller work correctly (writing is at 17-22
>> MB/sec). The disk which doesn't work well is on the HPT366 interface,
>> so that may be our culprit. Now I got the idea to check patches
> > retrogradely to see where it started behaving poorely.

>This really reminds me of a problem I once had with a hard drive of
>mine. It would usually go at 15-20MB/sec, but sometimes (under both
>Linux and Windows) would slow down to maybe 350KB/sec. The slowdown, or
>lack thereof, did seem to depend on the alignment of the stars. I lived
>with it for a number of months, then started getting intermittent I/O
>errors as well, as if the drive had bad sectors on disk.
>
>The problem turned out to be insufficient ventilation for the controller
>board on the bottom of the drive

As an extra datapoint, my IBM Deskstar 60GXP's (40Gb version) runs
slightly slower with writing than with reading. This is on a VIA
686a controller, UDMA/66 active. The drive also has plenty of air
around it, being in a 5.25" bracket with fans in front.

Writing 1GB from /dev/zero takes 34.27s = 29.88MB/sec, 19% CPU
Reading 1GB from test file takes 29.64s = 34.58MB/sec, 18% CPU

Hmm, that's almost as fast as the 10000rpm Ultrastar sited just above
it, but with higher CPU usage. Ultrastar gets 36MB/sec on reading
with hdparm, haven't tested write performance due to probable
fragmentation.

Both tests conducted using 'dd bs=1k' on my 1GHz Athlon with 256Mb
RAM. Test file is on a freshly-created ext2 filesystem starting at
10Gb into the 40Gb drive (knowing IBM's recent trend, this'll still
be fairly close to the outer rim). Write test includes a sync at the
end. Kernel is Linus 2.4.9, no relevant patches.

--
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
website: http://www.chromatix.uklinux.net/vnc/
geekcode: GCS$/E dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$
V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
tagline: The key to knowledge is not to rely on people to teach you it.

2001-10-28 22:06:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>> Yes. My question is more: does the dpt366 thing limit the queueing some
>> way?
>
>Nope. The HPT366 is a bog standard DMA IDE controller. At least unless Andre
>can point out something I've forgotten any behaviour seen on it should be
>the same as seen on any other IDE controller with DMA support.
>
>In practical terms that should mean you can obsere the same HPT366 problem
>he does on whatever random IDE controller is on your desktop box

Well, the thing is, I obviously _don't_ observe that problem. Neither
does anybody else I have heard about. I get a nice 20MB/s on my IDE
disks both at home and at work, whether reading or writing.

Which was why I was suspecting the hpt366 code. But considering that
others report good performance with the same controller, it might be
something even more localized, either in just Zlatko's setup (ie disk or
controller breakage), or some subtle timing issue that is general but
you have to have just the right timing to hit it.

Linus

2001-10-30 08:56:17

by Jens Axboe

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

On Sun, Oct 28 2001, Linus Torvalds wrote:
>
> On Sun, 28 Oct 2001, Alan Cox wrote:
> >
> > > Does the -ac patches have any hpt366-specific stuff? Although I suspect
> > > you're right, and that it's just the driver (or controller itself) being
> >
> > The IDE code matches between the two. It isnt a driver change
>
> It might, of course, just be timing, but that sounds like a bit _too_ easy
> an explanation. Even if it could easily be true.
>
> The fact that -ac gets higher speeds, and -ac has a very different
> request watermark strategy makes me suspect that that might be the cause.
>
> In particular, the standard kernel _requires_ that in order to get good
> performance you can merge many bh's onto one request. That's a very
> reasonable assumption: it basically says that any high-performance driver
> has to accept merging, because that in turn is required for the elevator
> overhead to not grow without bounds. And if the driver doesn't accept big
> requests, that driver cannot perform well because it won't have many
> requests pending.

Nod

> In contrast, the -ac logic says roughly "Who the hell cares if the driver
> can merge requests or not, we can just give it thousands of small requests
> instead, and cap the total number of _sectors_ instead of capping the
> total number of requests earlier".

Not true, that was not the intended goal. We always want the driver to
get merged requests, even if we can have ridicilously large queue
lengths. The large queues were a benchmark win (blush), since it allowed
the elevator to reorder seeks across a big bench run effieciently. I've
later done more real life testing and I don't think it matters too much
here, in fact it only seems to incur greater latency and starvation.

> In my opinion, the -ac logic is really bad, but one thing it does allow is
> for stupid drivers that look like high-performance drivers. Which may be
> why it got implemented.

Don't mix up the larger queues with lack of will to merge, that is not
the case.

> And it may be that the hpt366 IDE driver has always had this braindamage,
> which the -ac code hides. Or something like this.
>
> Does anybody know the hpt driver? Does it, for example, limit the maximum
> number of sectors per merge somehow for some reason?

hpt366 has no special work arounds or stuff it disables, it can't be
anything like that.

--
Jens Axboe

2001-10-30 19:07:24

by Josh McKinney

[permalink] [raw]
Subject: Re: xmm2 - monitor Linux MM active/inactive lists graphically

On approximately Tue, Oct 30, 2001 at 10:26:32AM +0100, Zlatko Calusic wrote:
>
> Followup on the problem. Yesterday I was upgrading my Debian Linux. To
> do that I have to remount /usr read-write. After the update finished,
> I tested once again disk writing speed. And there it was, full
> 22MB/sec (on the same partition). And once I get to that point, disk
> will remain performant. Then I thought (poor man's logic) that poor
> performance might have something to do with my /usr mounted read-only
> (BTW, it's on the same disk I'm having problems with).
>
> Quick test: reboot (/usr is ro), check speed -> only 8MB/sec, remount
> /usr rw, but unfortunately didn't help, writing speed remains low.
>
> So it was just an idea. I still don't know what can be done to return
> speed to normal. I don't know if I have mentioned, but reading from
> the same disk is always going at the full speed.
>
> So, something might be wrong with my setup, but I'm still unable to
> find what.
>
> I'm compiling with 2.95.4 20011006 (Debian prerelease) from the Debian
> unstable distribution. Kernel is completely monolithic (no modules).
>

I am also seeing some not_so_great performance from my ide drive. It
is a IBM 30GB 7200rpm drive on a promise ata/100 controller. I am
also using Debian unstable, but I hope that isn't really the problem.
The vmstat output seems very erratic. It has large bursts then really
slow spots.

I am running 2.4.13-ac4, with rik's swapoff patch.

This is the command I used.

time nice -n -20 dd if=/dev/zero of=/mp3/foobara bs=1024k count=1024
1024+0 records in
1024+0 records out
0.02s user 23.95s system 20% cpu 1:57.24 total

And here is the vmstat output...

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 112636 18728 78352 0 0 58 2139 145 90 5 10 85
0 0 3 0 112636 18728 78352 0 0 0 48 115 73 0 5 95
1 0 0 0 91052 18728 99856 0 0 0 65 135 81 0 40 60
1 0 1 0 41900 18728 149008 0 0 0 31474 198 32 1 99 0
0 1 1 0 6104 18756 184748 0 0 0 10608 193 35 1 78 21
0 1 1 0 3064 18768 187756 0 0 0 10296 189 52 0 35 65
0 1 5 0 3060 18796 187704 0 0 0 10808 205 62 2 21 77
0 1 2 0 3064 18948 187412 0 0 0 130132 2654 812 0 23 76
0 1 2 0 3060 18956 187404 0 0 0 10692 188 54 0 26 74
1 0 3 0 3060 18964 187392 0 0 0 9818 199 53 1 26 73
0 1 3 0 3060 18976 187368 0 0 0 7308 186 52 1 23 76
0 3 2 0 3068 19016 187296 0 0 0 73788 1459 417 0 23 77
0 3 3 0 3076 19444 186436 0 0 0 6392 188 31 1 9 90
0 3 3 0 3076 19444 186436 0 0 0 10832 188 15 0 2 98
0 3 3 0 3076 19444 186436 0 0 0 10536 188 17 0 4 96
0 3 2 0 3076 19444 186436 0 0 0 6556 191 31 0 1 99
2 0 0 0 3064 19444 186448 0 0 0 17424 724 120 0 9 91
1 0 1 0 3064 19444 186896 0 0 0 22724 201 33 2 98 0
0 1 1 0 3064 19444 186540 0 0 0 9872 187 47 1 46 53
0 1 1 0 3060 19444 186464 0 0 0 2840 193 57 0 15 85
0 1 1 0 3060 19444 186464 0 0 0 11088 201 54 0 19 81
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 2 0 3060 19500 186360 0 0 0 118288 2232 604 0 25 75
0 3 2 0 3060 19500 186360 0 0 0 7748 185 57 1 1 98
0 3 2 0 3060 19500 186360 0 0 0 11776 184 54 0 2 98
0 3 2 0 3060 19500 186360 0 0 0 7924 181 52 0 1 99
0 3 2 0 3060 19500 186360 0 0 0 4776 182 64 0 1 99
0 3 1 0 3060 19500 186360 0 0 0 952 183 50 0 2 98
0 3 1 0 3060 19500 186360 0 0 0 0 180 48 0 1 99
1 0 2 0 3064 19500 186356 0 0 0 296 242 80 0 12 88
0 1 0 0 3060 19500 186360 0 0 0 19668 730 282 1 7 93
0 1 0 0 3060 19500 186360 0 0 0 6764 200 133 1 2 97
1 0 0 0 3064 19500 186356 0 0 0 16680 171 123 1 64 35
1 0 1 0 3060 19500 186360 0 0 0 26903 193 33 1 99 0
0 1 1 0 3064 19500 186356 0 0 0 7756 184 40 1 54 45
0 1 5 0 3060 19500 186360 0 0 0 10915 182 66 0 23 77
0 3 3 0 3060 19668 186024 0 0 0 105102 2059 629 0 18 82
0 3 3 0 3060 19668 186024 0 0 0 7772 181 59 0 1 99
0 4 1 0 3060 19668 185948 0 0 1 7440 187 68 0 2 98
0 4 1 0 3060 19668 185948 0 0 0 260 187 57 0 2 98
0 4 1 0 3064 19668 185948 0 0 0 0 181 69 0 1 99
0 5 2 0 3276 19668 185944 0 0 5 522 180 69 3 5 92
1 1 2 0 3064 19668 185740 0 0 25 26593 190 96 0 78 22
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 4 4 0 3064 19788 184768 0 0 13 99062 2190 647 0 20 79
0 4 1 0 3064 19788 184768 0 0 0 2440 218 57 0 3 97
0 4 1 0 3064 19788 184768 0 0 1 0 217 51 0 0 100
1 1 5 0 3060 19788 185148 0 0 24 14692 214 108 2 46 52
0 4 2 0 3064 19824 185068 0 0 8 19965 940 240 0 24 75
0 4 2 0 3064 19824 185068 0 0 0 3224 293 64 0 0 100
0 4 2 0 3064 19824 185068 0 0 0 4624 310 64 0 3 97
0 4 2 0 3140 19824 184864 0 0 4 3316 245 86 1 4 95
0 4 2 0 3140 19824 184864 0 0 0 2608 252 58 0 1 99
0 4 2 0 3140 19824 184864 0 0 0 6208 262 58 0 0 100
1 3 2 0 3144 19824 184864 0 0 0 11864 228 58 1 1 98
0 4 1 0 3140 19824 184864 0 0 4 1200 241 86 1 5 94
0 4 3 0 3140 19824 184864 0 0 0 27 182 51 0 1 99
1 1 3 0 3064 19824 184936 0 0 16 24423 183 104 1 61 38
0 3 6 0 3064 19940 183904 0 0 69 44339 995 314 1 25 74
0 3 5 0 3064 19940 183904 0 0 0 100 180 51 0 0 100
0 3 5 0 3064 19940 183904 0 0 0 0 187 50 0 1 99



--
Linux, the choice | "Shelter," what a nice name for for a place
of a GNU generation -o) | where you polish your cat.
Kernel 2.4.13-ac4 /\ |
on a i586 _\_v |
|

2001-11-02 05:53:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Zlatko's I/O slowdown status

Hello Zlatko,

I'm not sure how the email thread ended but I noticed different
unplugging of the I/O queues in mainline (mainline was a little more
overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
bdflush to avoid blocking if the write flood could be sustained by the
bandwith of the HD was missing for example).

So you may want to give a spin to pre6aa1 and see if it makes any
difference, if it makes any difference I'll know what your problem is
(see the buffer.c part of the vm-10 patch in pre6aa1 for more details).

thanks,

Andrea

2001-11-02 20:14:57

by Zlatko Calusic

[permalink] [raw]
Subject: Re: Zlatko's I/O slowdown status

Andrea Arcangeli <[email protected]> writes:

> Hello Zlatko,
>
> I'm not sure how the email thread ended but I noticed different
> unplugging of the I/O queues in mainline (mainline was a little more
> overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
> bdflush to avoid blocking if the write flood could be sustained by the
> bandwith of the HD was missing for example).

Thank God, today it is finally solved. Just two days ago, I was pretty
sure that disk had started dying on me, and i didn't know of any
solution for that. Today, while I was about to try your patch, I got
another idea and finally pinpointed the problem.

It was write caching. Somehow disk was running with write cache turned
off and I was getting abysmal write performance. Then I found hdparm
-W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
but I don't understand how it survived through reboots and restarts!
And why only two of four disks, which I'm dealing with, got confused
with the command. And finally I don't understand how I could still got
full speed occassionaly. Weird!

I would advise users of Debian unstable to comment that part, I'm sure
it's useless on most if not all setups. You might be pleasantly
surprised with performance gains (write speed doubles).

>
> So you may want to give a spin to pre6aa1 and see if it makes any
> difference, if it makes any difference I'll know what your problem is
> (see the buffer.c part of the vm-10 patch in pre6aa1 for more details).
>

Thanks for your concern. Eventually I compiled aa1 and it is running
correctly (whole day at work, and last hour at home - SMP), although I
now don't see any performance improvements.

I would like to thank all the others that spent time helping me,
especially Linus, Jens and Marcelo, sorry guys for taking your time.
--
Zlatko

2001-11-02 20:20:58

by Jeffrey W. Baker

[permalink] [raw]
Subject: Re: Zlatko's I/O slowdown status



On 2 Nov 2001, Zlatko Calusic wrote:

> Andrea Arcangeli <[email protected]> writes:
>
> > Hello Zlatko,
> >
> > I'm not sure how the email thread ended but I noticed different
> > unplugging of the I/O queues in mainline (mainline was a little more
> > overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
> > bdflush to avoid blocking if the write flood could be sustained by the
> > bandwith of the HD was missing for example).
>
> Thank God, today it is finally solved. Just two days ago, I was pretty
> sure that disk had started dying on me, and i didn't know of any
> solution for that. Today, while I was about to try your patch, I got
> another idea and finally pinpointed the problem.
>
> It was write caching. Somehow disk was running with write cache turned
> off and I was getting abysmal write performance. Then I found hdparm
> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> but I don't understand how it survived through reboots and restarts!
> And why only two of four disks, which I'm dealing with, got confused
> with the command. And finally I don't understand how I could still got
> full speed occassionaly. Weird!
>
> I would advise users of Debian unstable to comment that part, I'm sure
> it's useless on most if not all setups. You might be pleasantly
> surprised with performance gains (write speed doubles).

That's great if you don't mind losing all of your data in a power outage!
What do you think happens if the software thinks data is committed to
permanent storage when in fact it in only in DRAM on the drive?

-jwb

2001-11-02 20:37:30

by John Alvord

[permalink] [raw]
Subject: Re: Zlatko's I/O slowdown status

On Fri, 2 Nov 2001 12:16:40 -0800 (PST), "Jeffrey W. Baker"
<[email protected]> wrote:

>
>
>On 2 Nov 2001, Zlatko Calusic wrote:
>
>> Andrea Arcangeli <[email protected]> writes:
>>
>> > Hello Zlatko,
>> >
>> > I'm not sure how the email thread ended but I noticed different
>> > unplugging of the I/O queues in mainline (mainline was a little more
>> > overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
>> > bdflush to avoid blocking if the write flood could be sustained by the
>> > bandwith of the HD was missing for example).
>>
>> Thank God, today it is finally solved. Just two days ago, I was pretty
>> sure that disk had started dying on me, and i didn't know of any
>> solution for that. Today, while I was about to try your patch, I got
>> another idea and finally pinpointed the problem.
>>
>> It was write caching. Somehow disk was running with write cache turned
>> off and I was getting abysmal write performance. Then I found hdparm
>> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
>> but I don't understand how it survived through reboots and restarts!
>> And why only two of four disks, which I'm dealing with, got confused
>> with the command. And finally I don't understand how I could still got
>> full speed occassionaly. Weird!
>>
>> I would advise users of Debian unstable to comment that part, I'm sure
>> it's useless on most if not all setups. You might be pleasantly
>> surprised with performance gains (write speed doubles).
>
>That's great if you don't mind losing all of your data in a power outage!
>What do you think happens if the software thinks data is committed to
>permanent storage when in fact it in only in DRAM on the drive?

Sounds like switching write-caching off at shutdown is a valid way to
get the data out of cache. But shouldn't it be switched back on again
later?

john

2001-11-02 20:57:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Zlatko's I/O slowdown status

On Fri, Nov 02, 2001 at 09:14:14PM +0100, Zlatko Calusic wrote:
> It was write caching. Somehow disk was running with write cache turned

Ah, I was going to ask you to try with:

/sbin/hdparm -d1 -u1 -W1 -c1 /dev/hda

(my settings, of course not safe for journaling fs, safe to use it only
with ext2 and I -W0 back during /etc/init.d/halt) but I assumed you were
using the same hdparm settings in -ac and mainline. Never mind, good
that it's solved now :).

Andrea

2001-11-02 21:17:05

by Zlatko Calusic

[permalink] [raw]
Subject: Re: Zlatko's I/O slowdown status

"Jeffrey W. Baker" <[email protected]> writes:

> On 2 Nov 2001, Zlatko Calusic wrote:
>
> > Andrea Arcangeli <[email protected]> writes:
> >
> > > Hello Zlatko,
> > >
> > > I'm not sure how the email thread ended but I noticed different
> > > unplugging of the I/O queues in mainline (mainline was a little more
> > > overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
> > > bdflush to avoid blocking if the write flood could be sustained by the
> > > bandwith of the HD was missing for example).
> >
> > Thank God, today it is finally solved. Just two days ago, I was pretty
> > sure that disk had started dying on me, and i didn't know of any
> > solution for that. Today, while I was about to try your patch, I got
> > another idea and finally pinpointed the problem.
> >
> > It was write caching. Somehow disk was running with write cache turned
> > off and I was getting abysmal write performance. Then I found hdparm
> > -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> > but I don't understand how it survived through reboots and restarts!
> > And why only two of four disks, which I'm dealing with, got confused
> > with the command. And finally I don't understand how I could still got
> > full speed occassionaly. Weird!
> >
> > I would advise users of Debian unstable to comment that part, I'm sure
> > it's useless on most if not all setups. You might be pleasantly
> > surprised with performance gains (write speed doubles).
>
> That's great if you don't mind losing all of your data in a power outage!

That has nothing to do with power outage, it is only run during
halt/poweroff.

> What do you think happens if the software thinks data is committed to
> permanent storage when in fact it in only in DRAM on the drive?
>

Bad things of course. But -W0 won't save you from file corruption when
you have megabytes of data in page cache, still not synced on disk,
and suddenly you lost power.

Of course, journalling filesystems will change things a bit...
--
Zlatko

2001-11-02 23:24:15

by Simon Kirby

[permalink] [raw]
Subject: Re: Zlatko's I/O slowdown status

On Fri, Nov 02, 2001 at 09:14:14PM +0100, Zlatko Calusic wrote:

> Thank God, today it is finally solved. Just two days ago, I was pretty
> sure that disk had started dying on me, and i didn't know of any
> solution for that. Today, while I was about to try your patch, I got
> another idea and finally pinpointed the problem.
>
> It was write caching. Somehow disk was running with write cache turned
> off and I was getting abysmal write performance. Then I found hdparm
> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> but I don't understand how it survived through reboots and restarts!
> And why only two of four disks, which I'm dealing with, got confused
> with the command. And finally I don't understand how I could still got
> full speed occassionaly. Weird!
>
> I would advise users of Debian unstable to comment that part, I'm sure
> it's useless on most if not all setups. You might be pleasantly
> surprised with performance gains (write speed doubles).

Aha! That would explain why I was seeing it as well... and why I was
seeing errors from hdparm for /dev/hdc and /dev/hdd, which are CDROMs.

Argh. :)

If they have hdparm -W 0 at shutdown, there should be a -W 1 during
startup.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-11-02 23:37:35

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: Zlatko's I/O slowdown status

In article <[email protected]>,
Simon Kirby <[email protected]> wrote:
>If they have hdparm -W 0 at shutdown, there should be a -W 1 during
>startup.

Well no. It should be set back to 'power on default' on startup.
But there is no way to do that.

Mike.
--
"Only two things are infinite, the universe and human stupidity,
and I'm not sure about the former" -- Albert Einstein.