2001-12-13 18:40:10

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: [BUG] RAID sub system

> do did you actually try this with a bunch of parallel dd's?

I just did now. Same result:

# vmstat 2
r b w swpd free buff cache si so bi bo in cs us sy id
0 200 1 1676 3200 3012 786004 0 292 42034 298 791 745 4 29 67
0 200 1 1676 3308 3136 785760 0 0 44304 0 748 758 3 15 82
0 200 1 1676 3296 3232 785676 0 0 44236 0 756 710 2 23 75
0 200 1 1676 3304 3356 785548 0 0 38662 70 778 791 3 19 78
0 200 1 1676 3200 3456 785552 0 0 33536 0 693 594 3 13 84
1 200 0 1676 3224 3528 785192 0 0 35330 24 794 712 3 16 81
0 200 0 1676 3304 3736 784324 0 0 30524 74 725 793 12 14 74
0 200 0 1676 3256 3796 783664 0 0 29984 0 718 826 4 10 86
0 200 0 1676 3288 3868 783592 0 0 25540 152 763 812 3 17 80
0 200 0 1676 3276 3908 783472 0 0 22820 0 693 731 0 7 92
0 200 0 1676 3200 3964 783540 0 0 23312 6 759 827 4 11 85
0 200 0 1676 3308 3984 783452 0 0 17506 0 687 697 0 11 89
0 200 0 1676 3388 4012 783888 0 0 14512 0 671 638 1 5 93
0 200 0 2188 3208 4048 784156 0 512 16104 548 707 833 2 10 88
0 200 0 3468 3204 4048 784788 0 66 8220 66 628 662 0 3 96
0 200 0 3468 3296 4060 784680 0 0 1036 6 687 714 1 6 93
0 200 0 3468 3316 4060 784668 0 0 1018 0 613 631 1 2 97
0 200 0 3468 3292 4060 784688 0 0 1034 0 617 638 0 3 97
0 200 0 3468 3200 4068 784772 0 0 1066 6 694 727 2 4 94

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.



2001-12-13 19:15:50

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: [BUG] RAID sub system

> > > do did you actually try this with a bunch of parallel dd's?
> >
> > I just did now. Same result:
> >
> > # vmstat 2
> > r b w swpd free buff cache si so bi bo in cs us sy id
> > 0 200 1 1676 3200 3012 786004 0 292 42034 298 791 745 4 29 67
> > 0 200 1 1676 3308 3136 785760 0 0 44304 0 748 758 3 15 82
> > 0 200 1 1676 3296 3232 785676 0 0 44236 0 756 710 2 23 75
> > 0 200 1 1676 3304 3356 785548 0 0 38662 70 778 791 3 19 78
> > 0 200 1 1676 3200 3456 785552 0 0 33536 0 693 594 3 13 84
> > 1 200 0 1676 3224 3528 785192 0 0 35330 24 794 712 3 16 81
> > 0 200 0 1676 3304 3736 784324 0 0 30524 74 725 793 12 14 74
> > 0 200 0 1676 3256 3796 783664 0 0 29984 0 718 826 4 10 86
> > 0 200 0 1676 3288 3868 783592 0 0 25540 152 763 812 3 17 80
> > 0 200 0 1676 3276 3908 783472 0 0 22820 0 693 731 0 7 92
> > 0 200 0 1676 3200 3964 783540 0 0 23312 6 759 827 4 11 85
> > 0 200 0 1676 3308 3984 783452 0 0 17506 0 687 697 0 11 89
> > 0 200 0 1676 3388 4012 783888 0 0 14512 0 671 638 1 5 93
> > 0 200 0 2188 3208 4048 784156 0 512 16104 548 707 833 2 10 88
> > 0 200 0 3468 3204 4048 784788 0 66 8220 66 628 662 0 3 96
> > 0 200 0 3468 3296 4060 784680 0 0 1036 6 687 714 1 6 93
> > 0 200 0 3468 3316 4060 784668 0 0 1018 0 613 631 1 2 97
> > 0 200 0 3468 3292 4060 784688 0 0 1034 0 617 638 0 3 97
> > 0 200 0 3468 3200 4068 784772 0 0 1066 6 694 727 2 4 94
>
> so help me out here a little. the first dozen or so lines look good,
> but obviously slowing down (which is a little odd). then at line -6,
> the VM freaks, tries swapping out, and everything goes to sleep.
> it's the going to sleep you're worried about, right?

Yes. I really need all available I/O here, not nasty bugs.

hm

trying to time the problem:

# init 6
...
# free
total used free shared buffers cached
Mem: 899712 75576 824136 0 4836 29408
-/+ buffers/cache: 41332 858380
Swap: 674720 0 674720
# dd-test ; vmstat -n 2 > vmstat &
# tail -f vmstat
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 200 0 0 767440 5036 36616 0 0 565 153 4337 276 9 20 71
0 200 0 0 733012 5412 69180 0 0 16324 0 603 628 0 6 94
0 200 0 0 667196 5428 132408 0 0 31488 56 678 530 1 12 87
0 200 0 0 600792 5580 196472 0 0 32114 0 683 536 2 10 88
0 200 0 0 528956 5616 265956 0 0 34708 30 678 482 0 15 85
0 200 0 0 451496 5724 340508 0 0 37338 0 724 640 1 16 82
0 200 0 0 383020 5936 406512 0 0 33096 8 699 689 2 11 87
0 200 0 0 301368 6032 485432 0 0 39464 0 726 522 2 15 83
0 200 0 0 216412 6124 567552 0 0 41092 0 698 613 2 17 81
0 200 0 0 131364 6248 649732 0 0 41162 8 722 701 2 18 80
0 200 0 0 52740 6372 725696 0 0 38028 0 721 461 2 14 84
0 200 1 2676 3264 2944 778932 0 308 44816 380 766 804 0 23 76
0 200 1 2676 3272 3032 778844 0 0 45562 0 764 642 2 17 81
0 200 1 2676 3292 3136 778712 0 0 39156 0 721 767 1 20 78
0 200 1 2676 3264 3260 778620 0 0 40664 8 738 624 2 11 86
0 200 0 2676 3212 3368 778480 0 0 37056 0 727 614 1 15 84
0 200 1 2676 3228 3464 778468 0 0 32052 8 654 743 2 12 86
0 200 1 2676 3196 3492 778472 0 2 30882 2 713 721 0 12 88
0 200 1 2676 3220 3556 778368 0 0 26490 0 698 739 1 10 89
0 200 0 2676 3224 3640 778212 0 0 25194 36 709 706 0 11 89
0 200 0 2676 3304 3692 778136 0 0 20998 0 678 732 1 6 93
0 200 0 2676 3272 3748 778108 0 0 19734 16 689 768 1 11 88
0 200 1 2676 3236 3780 778060 0 0 13708 0 644 748 0 6 94
0 200 0 2676 3196 3800 778076 0 0 9492 0 629 644 0 8 92
0 200 0 2676 3308 3828 778020 0 0 12978 8 664 727 1 6 93
0 200 0 2676 3512 3848 777716 0 14 11130 14 664 698 1 6 93
0 200 0 2804 3256 3860 777896 0 870 7074 878 677 674 0 3 96
0 200 0 2804 3288 3860 777856 0 0 1068 0 625 665 1 3 96
0 200 0 2804 3320 3868 777816 0 16 1068 24 627 667 1 2 97
0 200 0 2804 3212 3868 777928 0 84 1080 84 600 671 1 2 97

...and so on

This gives a total read of a little less than 800MB before giving up. Is
there a cache timeout that needs to be set any lower?

roy


--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-13 19:45:31

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: [BUG] RAID sub system

> This gives a total read of a little less than 800MB before giving up. Is
> there a cache timeout that needs to be set any lower?
>
> roy

more testing

[root@linuxserver root]# swapoff -a
[root@linuxserver root]# free
total used free shared buffers cached
Mem: 899712 74504 825208 0 4832 29408
-/+ buffers/cache: 40264 859448
Swap: 0 0 0
[root@linuxserver root]# vmstat -n 2

blah blah blah

same result.
--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.


2001-12-13 20:02:53

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: [BUG] RAID sub system


even more testing...

It seems like it's nopt the caching itself that is f..cked, as I can dod
pretty good i/o from other devices, like /dev/hda. (/dev/md0 is /dev/hde
and /dev/hdg in RAID-0 - see my first email)

roy

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-14 13:13:49

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: [BUG] RAID sub system

> > ...and so on
> >
> > This gives a total read of a little less than 800MB before giving up. Is
> > there a cache timeout that needs to be set any lower?
>
> my observation is this: once you use up all your free memory, you have
> 30 seconds of reasonable behavior. 30 seconds is the the default dirty-buffer
> age. are you tweaking /proc/sys/vm/bdflush at all? and no, I don't see
> why your application would have dirty buffers in the first place -
> I'm just noticing the ominous 30 seconds.

no bdflush tweaking...

this is the same as I've observed... Usa all memory, and everything stops
up.

> are you using elvtune?

er.. what's elvtune?

> also, which kernel?

2.4.16 + tux patches

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.


2001-12-14 14:02:52

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: [BUG] RAID sub system

> my observation is this: once you use up all your free memory, you have
> 30 seconds of reasonable behavior. 30 seconds is the the default dirty-buffer
> age. are you tweaking /proc/sys/vm/bdflush at all? and no, I don't see
> why your application would have dirty buffers in the first place -
> I'm just noticing the ominous 30 seconds.

More testing with bootparam mem=xxx shows this is it. When all memory is
used, it fails to re-use old cache, or something.

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-14 16:53:55

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG] RAID sub system

On Fri, Dec 14 2001, Roy Sigurd Karlsbakk wrote:
> > my observation is this: once you use up all your free memory, you have
> > 30 seconds of reasonable behavior. 30 seconds is the the default dirty-buffer
> > age. are you tweaking /proc/sys/vm/bdflush at all? and no, I don't see
> > why your application would have dirty buffers in the first place -
> > I'm just noticing the ominous 30 seconds.
>
> More testing with bootparam mem=xxx shows this is it. When all memory is
> used, it fails to re-use old cache, or something.

sysrq-t output from a "hung" system would give some valuable info as to
why it's stuck.

--
Jens Axboe

2001-12-15 18:17:30

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: [BUG] RAID sub system

> sysrq-t output from a "hung" system would give some valuable info as to
> why it's stuck.

It never hangs. It just slows down the i/o to ~ 1MB/s

also ... I started out with 200 'dd' processes (as the script I sent
earlier). The i/o gets stuck, and ... strange ...

# killall dd

watching vmstat

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 200 0 0 3280 4092 782072 0 0 1054 0 621 657 0 4 96
0 200 0 0 3284 4092 782108 0 0 1042 0 631 665 1 1 98
0 101 0 0 3672 4164 804600 0 0 17594 0 762 931 2 9 88
0 56 1 0 3248 4180 815332 0 0 32598 0 897 727 0 10 90
0 30 1 0 3276 4212 821640 0 0 33102 40 910 447 1 6 92
0 1 1 0 3252 4236 829080 0 0 31692 0 947 301 1 9 90
0 0 0 0 3296 4236 829320 0 0 2042 0 401 132 0 0 100
0 0 0 0 3288 4244 829320 0 0 0 26 360 132 0 0 100
0 0 0 0 3288 4244 829320 0 0 0 0 326 127 0 0 100
0 0 0 0 3288 4244 829320 0 0 0 0 358 127 0 0 100


It speeds up again when the processes start to die, although the number of
processes started has no effect of the outcome. After a while the i/o
stops completely up (see above lines 1,2).

I also noted a quite a dramatic speed reduction in 2.4.17-rc1 compared to
2.4.16. 2.4.16 gave me up to 50 megs / sec, wheras 2.4.17-rc1 only gives
me around 35.

roy

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.