LinuxLists.cc - [BUG] RAID sub system

2001-12-13 18:40:10

Subject: [BUG] RAID sub system

> do did you actually try this with a bunch
I just did now. Same result:

# vmstat 2
r b w swpd free buff cache si so 0 200 1 1676 3200 3012 786004 0 292 42034 0 200 1 1676 3308 3136 785760 0 0 44304 0 200 1 1676 3296 3232 785676 0 0 44236 0 200 1 1676 3304 3356 785548 0 0 38662 0 200 1 1676 3200 3456 785552 0 0 33536 1 200 0 1676 3224 3528 785192 0 0 35330 0 200 0 1676 3304 3736 784324 0 0 30524 0 200 0 1676 3256 3796 783664 0 0 29984 0 200 0 1676 3288 3868 783592 0 0 200 0 1676 3276 3908 783472 0 0 22820 0 200 0 1676 3200 3964 783540 0 0 23312 0 200 0 1676 3308 3984 783452 0 0 17506 0 200 0 1676 3388 4012 783888 0 0 14512 0 200 0 2188 3208 4048 784156 0 512 16104 0 200 0 3468 3204 4048 784788 0 66 8220 0 200 0 3468 3296 4060 784680 0 0 1036 0 200 0 3468 3316 4060 784668 0 0 1018 0 200 0 3468 3292 4060 784688 0 0 1034 0 200 0 3468 3200 4068 784772 0 0 1066
--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

of parallel dd's?
bi bo in cs us sy id
298 791 745 4 29 67
0 748 758 3 15 82
0 756 710 2 23 75
70 778 791 3 19 78
0 693 594 3 13 84
24 794 712 3 16 81
74 725 793 12 14 74
0 718 826 4 10 86
0 25540 152 763 812 3 17 80
0 693 731 0 7 92
6 759 827 4 11 85
0 687 697 0 11 89
0 671 638 1 5 93
548 707 833 2 10 88
66 628 662 0 3 96
6 687 714 1 6 93
0 613 631 1 2 97
0 617 638 0 3 97
6 694 727 2 4 94

2001-12-13 19:15:50

by Roy Sigurd Karlsbakk

[permalink] [raw]

Subject: Re: [BUG] RAID sub system

> > > do did you actually try this with a bunch of parallel dd's?
> >
> > I just did now. Same result:
> >
> > # vmstat 2
> > r b w swpd free buff cache si so bi bo in cs us sy id
> > 0 200 1 1676 3200 3012 786004 0 292 42034 298 791 745 4 29 67
> > 0 200 1 1676 3308 3136 785760 0 0 44304 0 748 758 3 15 82
> > 0 200 1 1676 3296 3232 785676 0 0 44236 0 756 710 2 23 75
> > 0 200 1 1676 3304 3356 785548 0 0 38662 70 778 791 3 19 78
> > 0 200 1 1676 3200 3456 785552 0 0 33536 0 693 594 3 13 84
> > 1 200 0 1676 3224 3528 785192 0 0 35330 24 794 712 3 16 81
> > 0 200 0 1676 3304 3736 784324 0 0 30524 74 725 793 12 14 74
> > 0 200 0 1676 3256 3796 783664 0 0 29984 0 718 826 4 10 86
> > 0 200 0 1676 3288 3868 783592 0 0 25540 152 763 812 3 17 80
> > 0 200 0 1676 3276 3908 783472 0 0 22820 0 693 731 0 7 92
> > 0 200 0 1676 3200 3964 783540 0 0 23312 6 759 827 4 11 85
> > 0 200 0 1676 3308 3984 783452 0 0 17506 0 687 697 0 11 89
> > 0 200 0 1676 3388 4012 783888 0 0 14512 0 671 638 1 5 93
> > 0 200 0 2188 3208 4048 784156 0 512 16104 548 707 833 2 10 88
> > 0 200 0 3468 3204 4048 784788 0 66 8220 66 628 662 0 3 96
> > 0 200 0 3468 3296 4060 784680 0 0 1036 6 687 714 1 6 93
> > 0 200 0 3468 3316 4060 784668 0 0 1018 0 613 631 1 2 97
> > 0 200 0 3468 3292 4060 784688 0 0 1034 0 617 638 0 3 97
> > 0 200 0 3468 3200 4068 784772 0 0 1066 6 694 727 2 4 94
>
> so help me out here a little. the first dozen or so lines look good,
> but obviously slowing down (which is a little odd). then at line -6,
> the VM freaks, tries swapping out, and everything goes to sleep.
> it's the going to sleep you're worried about, right?

Yes. I really need all available I/O here, not nasty bugs.

hm

trying to time the problem:

# init 6
...
# free
total used free shared buffers cached
Mem: 899712 75576 824136 0 4836 29408
-/+ buffers/cache: 41332 858380
Swap: 674720 0 674720
# dd-test ; vmstat -n 2 > vmstat &
# tail -f vmstat
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 200 0 0 767440 5036 36616 0 0 565 153 4337 276 9 20 71
0 200 0 0 733012 5412 69180 0 0 16324 0 603 628 0 6 94
0 200 0 0 667196 5428 132408 0 0 31488 56 678 530 1 12 87
0 200 0 0 600792 5580 196472 0 0 32114 0 683 536 2 10 88
0 200 0 0 528956 5616 265956 0 0 34708 30 678 482 0 15 85
0 200 0 0 451496 5724 340508 0 0 37338 0 724 640 1 16 82
0 200 0 0 383020 5936 406512 0 0 33096 8 699 689 2 11 87
0 200 0 0 301368 6032 485432 0 0 39464 0 726 522 2 15 83
0 200 0 0 216412 6124 567552 0 0 41092 0 698 613 2 17 81
0 200 0 0 131364 6248 649732 0 0 41162 8 722 701 2 18 80
0 200 0 0 52740 6372 725696 0 0 38028 0 721 461 2 14 84
0 200 1 2676 3264 2944 778932 0 308 44816 380 766 804 0 23 76
0 200 1 2676 3272 3032 778844 0 0 45562 0 764 642 2 17 81
0 200 1 2676 3292 3136 778712 0 0 39156 0 721 767 1 20 78
0 200 1 2676 3264 3260 778620 0 0 40664 8 738 624 2 11 86
0 200 0 2676 3212 3368 778480 0 0 37056 0 727 614 1 15 84
0 200 1 2676 3228 3464 778468 0 0 32052 8 654 743 2 12 86
0 200 1 2676 3196 3492 778472 0 2 30882 2 713 721 0 12 88
0 200 1 2676 3220 3556 778368 0 0 26490 0 698 739 1 10 89
0 200 0 2676 3224 3640 778212 0 0 25194 36 709 706 0 11 89
0 200 0 2676 3304 3692 778136 0 0 20998 0 678 732 1 6 93
0 200 0 2676 3272 3748 778108 0 0 19734 16 689 768 1 11 88
0 200 1 2676 3236 3780 778060 0 0 13708 0 644 748 0 6 94
0 200 0 2676 3196 3800 778076 0 0 9492 0 629 644 0 8 92
0 200 0 2676 3308 3828 778020 0 0 12978 8 664 727 1 6 93
0 200 0 2676 3512 3848 777716 0 14 11130 14 664 698 1 6 93
0 200 0 2804 3256 3860 777896 0 870 7074 878 677 674 0 3 96
0 200 0 2804 3288 3860 777856 0 0 1068 0 625 665 1 3 96
0 200 0 2804 3320 3868 777816 0 16 1068 24 627 667 1 2 97
0 200 0 2804 3212 3868 777928 0 84 1080 84 600 671 1 2 97

...and so on

This gives a total read of a little less than 800MB before giving up. Is
there a cache timeout that needs to be set any lower?

roy

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-13 19:45:31

by Roy Sigurd Karlsbakk

[permalink] [raw]

Subject: Re: [BUG] RAID sub system

> This gives a total read of a little less than 800MB before giving up. Is
> there a cache timeout that needs to be set any lower?
>
> roy

more testing

[root@linuxserver root]# swapoff -a
[root@linuxserver root]# free
total used free shared buffers cached
Mem: 899712 74504 825208 0 4832 29408
-/+ buffers/cache: 40264 859448
Swap: 0 0 0
[root@linuxserver root]# vmstat -n 2

blah blah blah

same result.
--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-13 20:02:53

by Roy Sigurd Karlsbakk

[permalink] [raw]

Subject: Re: [BUG] RAID sub system

even more testing...

It seems like it's nopt the caching itself that is f..cked, as I can dod
pretty good i/o from other devices, like /dev/hda. (/dev/md0 is /dev/hde
and /dev/hdg in RAID-0 - see my first email)

roy

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-14 13:13:49

by Roy Sigurd Karlsbakk

[permalink] [raw]

Subject: Re: [BUG] RAID sub system

> > ...and so on
> >
> > This gives a total read of a little less than 800MB before giving up. Is
> > there a cache timeout that needs to be set any lower?
>
> my observation is this: once you use up all your free memory, you have
> 30 seconds of reasonable behavior. 30 seconds is the the default dirty-buffer
> age. are you tweaking /proc/sys/vm/bdflush at all? and no, I don't see
> why your application would have dirty buffers in the first place -
> I'm just noticing the ominous 30 seconds.

no bdflush tweaking...

this is the same as I've observed... Usa all memory, and everything stops
up.

> are you using elvtune?

er.. what's elvtune?

> also, which kernel?

2.4.16 + tux patches

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-14 14:02:52

by Roy Sigurd Karlsbakk

[permalink] [raw]

Subject: Re: [BUG] RAID sub system

> my observation is this: once you use up all your free memory, you have
> 30 seconds of reasonable behavior. 30 seconds is the the default dirty-buffer
> age. are you tweaking /proc/sys/vm/bdflush at all? and no, I don't see
> why your application would have dirty buffers in the first place -
> I'm just noticing the ominous 30 seconds.

More testing with bootparam mem=xxx shows this is it. When all memory is
used, it fails to re-use old cache, or something.

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-12-14 16:53:55

by Jens Axboe

[permalink] [raw]

Subject: Re: [BUG] RAID sub system

On Fri, Dec 14 2001, Roy Sigurd Karlsbakk wrote:
> > my observation is this: once you use up all your free memory, you have
> > 30 seconds of reasonable behavior. 30 seconds is the the default dirty-buffer
> > age. are you tweaking /proc/sys/vm/bdflush at all? and no, I don't see
> > why your application would have dirty buffers in the first place -
> > I'm just noticing the ominous 30 seconds.
>
> More testing with bootparam mem=xxx shows this is it. When all memory is
> used, it fails to re-use old cache, or something.

sysrq-t output from a "hung" system would give some valuable info as to
why it's stuck.

--
Jens Axboe

2001-12-15 18:17:30

by Roy Sigurd Karlsbakk

[permalink] [raw]

Subject: Re: [BUG] RAID sub system

> sysrq-t output from a "hung" system would give some valuable info as to
> why it's stuck.

It never hangs. It just slows down the i/o to ~ 1MB/s

also ... I started out with 200 'dd' processes (as the script I sent
earlier). The i/o gets stuck, and ... strange ...

# killall dd

watching vmstat

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 200 0 0 3280 4092 782072 0 0 1054 0 621 657 0 4 96
0 200 0 0 3284 4092 782108 0 0 1042 0 631 665 1 1 98
0 101 0 0 3672 4164 804600 0 0 17594 0 762 931 2 9 88
0 56 1 0 3248 4180 815332 0 0 32598 0 897 727 0 10 90
0 30 1 0 3276 4212 821640 0 0 33102 40 910 447 1 6 92
0 1 1 0 3252 4236 829080 0 0 31692 0 947 301 1 9 90
0 0 0 0 3296 4236 829320 0 0 2042 0 401 132 0 0 100
0 0 0 0 3288 4244 829320 0 0 0 26 360 132 0 0 100
0 0 0 0 3288 4244 829320 0 0 0 0 326 127 0 0 100
0 0 0 0 3288 4244 829320 0 0 0 0 358 127 0 0 100

It speeds up again when the processes start to die, although the number of
processes started has no effect of the outcome. After a while the i/o
stops completely up (see above lines 1,2).

I also noted a quite a dramatic speed reduction in 2.4.17-rc1 compared to
2.4.16. 2.4.16 gave me up to 50 megs / sec, wheras 2.4.17-rc1 only gives
me around 35.

roy

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.