LinuxLists.cc - 2.5.59mm5, raid1 resync speed regression.

2003-01-24 16:39:12

Subject: 2.5.59mm5, raid1 resync speed regression.

Hi Andrew, list,

I'm booting 2.5.59mm5 to run a database workload benchmark that I've been
running against various kernels. I'll post those results if they are
interesting later, but I did notice that the raid1 resync is proceeding at
half the speed (at best) that it usually does (vs. 2.5.59 that is).

It currently at about 4-8 mb/sec (and falling as resync progresses),
usually at 12-15 mb/sec.

System is SMP 2xPIII 866mhz, 2GB ram, raid1 is two 15k U160 (running only
an Ultra speed :-( because the onboard controller sucks) SCSI disks, same
channel on aic7xxx.

Kernel is 2.5.59-mm5 compiled with gcc version 2.96 20000731 (Red Hat
Linux 7.3 2.96-112)

David

--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/

2003-01-24 16:46:44

by Nick Piggin

[permalink] [raw]

Subject: Re: 2.5.59mm5, raid1 resync speed regression.

David Mansfield wrote:

>Hi Andrew, list,
>
>I'm booting 2.5.59mm5 to run a database workload benchmark that I've been
>running against various kernels. I'll post those results if they are
>interesting later, but I did notice that the raid1 resync is proceeding at
>half the speed (at best) that it usually does (vs. 2.5.59 that is).
>
>It currently at about 4-8 mb/sec (and falling as resync progresses),
>usually at 12-15 mb/sec.
>
>System is SMP 2xPIII 866mhz, 2GB ram, raid1 is two 15k U160 (running only
>an Ultra speed :-( because the onboard controller sucks) SCSI disks, same
>channel on aic7xxx.
>
>Kernel is 2.5.59-mm5 compiled with gcc version 2.96 20000731 (Red Hat
>Linux 7.3 2.96-112)
>
>David
>
Thanks for the report. Please do post any results you get.

What disk workload exactly does a RAID1 resync consist of?

Nick

2003-01-24 18:09:49

by David Mansfield

[permalink] [raw]

Subject: Re: 2.5.59mm5, raid1 resync speed regression.

> David Mansfield wrote:
>
> >Hi Andrew, list,
> >
> >I'm booting 2.5.59mm5 to run a database workload benchmark that I've been
> >running against various kernels. I'll post those results if they are
> >interesting later, but I did notice that the raid1 resync is proceeding at
> >half the speed (at best) that it usually does (vs. 2.5.59 that is).
> >
> >It currently at about 4-8 mb/sec (and falling as resync progresses),
> >usually at 12-15 mb/sec.
> >
> >System is SMP 2xPIII 866mhz, 2GB ram, raid1 is two 15k U160 (running only
> >an Ultra speed :-( because the onboard controller sucks) SCSI disks, same
> >channel on aic7xxx.
> >
> >Kernel is 2.5.59-mm5 compiled with gcc version 2.96 20000731 (Red Hat
> >Linux 7.3 2.96-112)
> >
> >David
> >
> Thanks for the report. Please do post any results you get.
>
> What disk workload exactly does a RAID1 resync consist of?
>

Well, I don't know the internals of it, but it goes something like:

decide which half of the mirror is more current. Read blocks from this
partition, write to other. Periodically update raid-superblock or
something. The partitions in my case are on separate SCSI disks.

The thing about it is, it attempts to throttle the sync speed to not
interfere too much with operation of the system (background resync could
suck up all i/o 'cycles' and make a system unusable) by monitoring the
amount of requests through the raid device itself. The sysadmin can set a
'speed limit' in /proc to control this, but I have it really high, so it
*should* be syncing at max speed regardless of any i/o happening to the
raid device itself.

So it's a bit complicated. You'd have to look at the code or ask someone
(Neil Brown) who knows more about it.

.... I'm rebooting and looking at it again. Here's something strange, if
I let the system sit completely idle, the resync speed increases almost to
the 'normal' rate, but causing any (minor) disk activity in another window
causes the rate to plummet for minutes.

I think there's some strange interaction with the speed-limit code in the
raid1 resync.

David

P.S. I'll post my benchmark date if/when available.

--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/

2003-01-24 18:32:23

by Nick Piggin

[permalink] [raw]

Subject: Re: 2.5.59mm5, raid1 resync speed regression.

David Mansfield wrote:

>>David Mansfield wrote:
>>
>>
>>>Hi Andrew, list,
>>>
>>>I'm booting 2.5.59mm5 to run a database workload benchmark that I've been
>>>running against various kernels. I'll post those results if they are
>>>interesting later, but I did notice that the raid1 resync is proceeding at
>>>half the speed (at best) that it usually does (vs. 2.5.59 that is).
>>>
>>>It currently at about 4-8 mb/sec (and falling as resync progresses),
>>>usually at 12-15 mb/sec.
>>>
>>>System is SMP 2xPIII 866mhz, 2GB ram, raid1 is two 15k U160 (running only
>>>an Ultra speed :-( because the onboard controller sucks) SCSI disks, same
>>>channel on aic7xxx.
>>>
>>>Kernel is 2.5.59-mm5 compiled with gcc version 2.96 20000731 (Red Hat
>>>Linux 7.3 2.96-112)
>>>
>>>David
>>>
>>>
>>Thanks for the report. Please do post any results you get.
>>
>>What disk workload exactly does a RAID1 resync consist of?
>>
>>
>
>Well, I don't know the internals of it, but it goes something like:
>
>decide which half of the mirror is more current. Read blocks from this
>partition, write to other. Periodically update raid-superblock or
>something. The partitions in my case are on separate SCSI disks.
>
>The thing about it is, it attempts to throttle the sync speed to not
>interfere too much with operation of the system (background resync could
>suck up all i/o 'cycles' and make a system unusable) by monitoring the
>amount of requests through the raid device itself. The sysadmin can set a
>'speed limit' in /proc to control this, but I have it really high, so it
>*should* be syncing at max speed regardless of any i/o happening to the
>raid device itself.
>
>So it's a bit complicated. You'd have to look at the code or ask someone
>(Neil Brown) who knows more about it.
>
>.... I'm rebooting and looking at it again. Here's something strange, if
>I let the system sit completely idle, the resync speed increases almost to
>the 'normal' rate, but causing any (minor) disk activity in another window
>causes the rate to plummet for minutes.
>
>I think there's some strange interaction with the speed-limit code in the
>raid1 resync.
>
Perhaps. I think there is something up with request expiry that might
cause a disk to choke up like this. Especially writes. I'll fix that
over the weekend if I can.

>
>
>David
>
>P.S. I'll post my benchmark date if/when available.
>
>
>
>

2003-01-24 22:25:43

by David Mansfield

[permalink] [raw]

Subject: 2.5.59mm5 database 'benchmark' results

Hi Nick, Andrew, lists,

I've been testing some recent kernels to see how they compare with a
particular database workload. The workload is actually part of our
production process (last months run) but on a test server. I'll describe
the platform and the workload, but first, the results :-)

kernel minutes comment
------------- ----------- ---------------------------------
2.4.20-aa1 134 i consider this 'baseline'
2.5.59 124 woo-hoo
2.4.18-19.7.xsmp 128 not bad for frankenstein's montster
2.5.59-mm5 157 uh-oh

Platform:
HP LH3000 U3. Dual 866 Mhz Intel Pentium III, 2GB ram. megaraid
controller with two channels, each channel raid 5 PV on 6 15k scsi disks,
one megaraid LV per PV.

Two plain disks w/pairs of partitions in raid 1 for OS (redhat 7.3), a
second pair for Oracle redo-log (in a log 'group').

Oracle version 8.1.7 (no aio support in this release) is accessing
datafiles on the two megaraid devices via /dev/raw stacked on top of
device-mapper

Workload:
The workload consists of a few different phases.

1) Indexing: multiple indexes built against a 9 million row table. This
is mostly about sequential scans of a single table, with bursts of write
activity. 50 minutes or so.

2) Analyzing: The database scans tables and
builds statistics. Most of the time is spent analyzing the 9 million row
table. This is a completely cpu bound step on our underpowered system.
30 minutes.

3) Summarization: the large table is aggregated in about 100
different ways. Records are generated for each different summarization.
This is mixed read-write load. 50 minutes or so.

I'll test any kernel you throw my way.

David

--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/

2003-01-24 22:31:39

by Mitchell Blank Jr

[permalink] [raw]

Subject: Re: 2.5.59mm5, raid1 resync speed regression.

David Mansfield wrote:
> decide which half of the mirror is more current. Read blocks from this
> partition, write to other. Periodically update raid-superblock or
> something.

Well I haven't looked at the code (or the academic paper it's based on) but
this makes some sense based on Andrew's description of the new algorithm.
It sounds like the RAID resync is reading/writing the same block and
(I'm theorizing here) is doing the write back synchronously so it gets
delayed by the anticipatory post-read delay. Does this sound possible?

(Or does it just blindly write over the entire "old" mirror in which case I
don't know how that would affect RAID resync. I think this scenario might
be possible under at least some workloads though, read on...)

One idea (assuming it doesn't to it already) would be to cancel the post-read
delay if we get a synchronous (or maybe any) write for the same (or very near)
block. The rationale would be that it's likely from the same application and
the short seek will be cheap anyway (there's a high probability the drive's
track-buffer will take care of it).

Questions:

* Has anyone tested what happens when an application is alternately doing
reads and sync-writes to the same large file/partition? (I'm thinking
about databases here) It sounds like an algorithm like this could slow
them down.

* What are the current heuristics used to prevent write-starvation? Do
they need to be tuned now that reads are getting even more of an
advantage?

* Has anyone looked at other "hints" that the higher levels can give the
I/O scheduler that would indicate that a post-read delay is not
likely to be fruitful (like from syscalls like close() or exit())
Obviously the trick of communicating these all the way down to the
I/O scheduler would be tricky but it might be worth at least thinking
about.

Also, would it be possible to get profile data about these post-read delays?
Specifically it would be good to know:

1. How many expired without another nearby read happening
1a. How many of those had other I/O waiting to go when they expired
2. How many were cut short by another nearby read (i.e. a "success")
3. How many were cut short by some other heuristic (like described above)

That way we could see how much these delays are helping or hurting various
workloads.

Sorry if any of this is obvious or already implemented - they're just my
first thoughts after reading the announcement. Sounds like really
interesting work though.

-Mitch

2003-01-24 22:36:43

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5.59mm5 database 'benchmark' results

David Mansfield <[email protected]> wrote:
>
>
> Hi Nick, Andrew, lists,
>
> I've been testing some recent kernels to see how they compare with a
> particular database workload. The workload is actually part of our
> production process (last months run) but on a test server. I'll describe
> the platform and the workload, but first, the results :-)
>
> kernel minutes comment
> ------------- ----------- ---------------------------------
> 2.4.20-aa1 134 i consider this 'baseline'
> 2.5.59 124 woo-hoo
> 2.4.18-19.7.xsmp 128 not bad for frankenstein's montster
> 2.5.59-mm5 157 uh-oh
>
> Platform:
> HP LH3000 U3. Dual 866 Mhz Intel Pentium III, 2GB ram. megaraid
> controller with two channels, each channel raid 5 PV on 6 15k scsi disks,
> one megaraid LV per PV.
>
> Two plain disks w/pairs of partitions in raid 1 for OS (redhat 7.3), a
> second pair for Oracle redo-log (in a log 'group').
>
> Oracle version 8.1.7 (no aio support in this release) is accessing
> datafiles on the two megaraid devices via /dev/raw stacked on top of
> device-mapper

Rather impressed that you got all that to work ;)

It does appear that the IO scheduler change is not playing nicely with
software RAID.

> I'll test any kernel you throw my way.

Thanks. Could you please try 2.5.59-mm5, with

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/anticipatory_io_scheduling-2_5_59-mm3.patch

reverted?

2003-01-27 15:05:09

by David Mansfield

[permalink] [raw]

Subject: Re: 2.5.59mm5 database 'benchmark' results

On Fri, 24 Jan 2003, Andrew Morton wrote:

> > kernel minutes comment
> > ------------- ----------- ---------------------------------
> > 2.4.20-aa1 134 i consider this 'baseline'
> > 2.5.59 124 woo-hoo
> > 2.4.18-19.7.xsmp 128 not bad for frankenstein's montster
> > 2.5.59-mm5 157 uh-oh
> >

> > Oracle version 8.1.7 (no aio support in this release) is accessing
> > datafiles on the two megaraid devices via /dev/raw stacked on top of
> > device-mapper
>
> Rather impressed that you got all that to work ;)
>

Me too. It's still got some rough edges, but 2.5.59 was the first version
that made it through because of this or that reason. I'm very impressed
personally.

>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/broken-out/anticipatory_io_scheduling-2_5_59-mm3.patch

Ok. The results are basically the same as 2.5.59 vanilla:

kernel minutes
----------------------------- ----------
2.5.59-mm5-no-anticipatory-io 125

Anything else?

David

--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/