2001-07-11 00:45:38

by Brian Strand

[permalink] [raw]
Subject: 2x Oracle slowdown from 2.2.16 to 2.4.4

We are running 3 Oracle servers, each dual CPU, 1 1GB and 2 2GB memory,
between 36-180GB of RAID. On June 26, I upgraded all boxes from Suse
7.0 to Suse 7.2 (going from kernel version 2.2.16-40 to 2.4.4-14).
Reviewing Oracle job times (jobs range from a few minutes to 10 hours)
before and after, performance is almost exactly twice as poor after the
upgrade versus before the upgrade. Nothing in the hardware or Oracle
configuration has changed on any server. Does anyone have any ideas as
to what might cause this?

Thanks,
Brian Strand
CTO Switch Management



2001-07-11 01:15:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Tue, Jul 10, 2001 at 05:45:16PM -0700, Brian Strand wrote:
> We are running 3 Oracle servers, each dual CPU, 1 1GB and 2 2GB memory,
> between 36-180GB of RAID. On June 26, I upgraded all boxes from Suse
> 7.0 to Suse 7.2 (going from kernel version 2.2.16-40 to 2.4.4-14).
> Reviewing Oracle job times (jobs range from a few minutes to 10 hours)
> before and after, performance is almost exactly twice as poor after the
> upgrade versus before the upgrade. Nothing in the hardware or Oracle
> configuration has changed on any server. Does anyone have any ideas as
> to what might cause this?

We need to restrict the problem. How are you using Oracle? Through any
filesystem? If yes which one? Or with rawio? Is your workload cached
most of the time or not?

thanks,
Andrea

2001-07-11 01:54:59

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Tue, Jul 10, 2001 at 05:45:16PM -0700, Brian Strand wrote:
> We are running 3 Oracle servers, each dual CPU, 1 1GB and 2 2GB memory,
> between 36-180GB of RAID. On June 26, I upgraded all boxes from Suse
> 7.0 to Suse 7.2 (going from kernel version 2.2.16-40 to 2.4.4-14).
> Reviewing Oracle job times (jobs range from a few minutes to 10 hours)
> before and after, performance is almost exactly twice as poor after the
> upgrade versus before the upgrade. Nothing in the hardware or Oracle
> configuration has changed on any server. Does anyone have any ideas as
> to what might cause this?
>
> Thanks,
> Brian Strand
> CTO Switch Management
>
>

Oracle performance is critical in requiring fast disk access. Oracle is
virtually self-contained with regard to the subsystems it uses -- it
provides most of it's own. Oracle slowdowns are related to either
problems in the networking software for remote SQL operations, and
disk access witb regard to jobs run locally. If it's slower for local
SQL processing as well as remote I would suspect a problem with the
low level disk interface.

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-07-11 01:55:59

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Tue, Jul 10, 2001 at 05:45:16PM -0700, Brian Strand wrote:

Van,

Can you help this person?

Jeff


> We are running 3 Oracle servers, each dual CPU, 1 1GB and 2 2GB memory,
> between 36-180GB of RAID. On June 26, I upgraded all boxes from Suse
> 7.0 to Suse 7.2 (going from kernel version 2.2.16-40 to 2.4.4-14).
> Reviewing Oracle job times (jobs range from a few minutes to 10 hours)
> before and after, performance is almost exactly twice as poor after the
> upgrade versus before the upgrade. Nothing in the hardware or Oracle
> configuration has changed on any server. Does anyone have any ideas as
> to what might cause this?
>
> Thanks,
> Brian Strand
> CTO Switch Management
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-07-11 15:56:13

by Brian Strand

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4



Jeff V. Merkey wrote:

>Oracle performance is critical in requiring fast disk access. Oracle is
>virtually self-contained with regard to the subsystems it uses -- it
>provides most of it's own. Oracle slowdowns are related to either
>problems in the networking software for remote SQL operations, and
>disk access witb regard to jobs run locally. If it's slower for local
>SQL processing as well as remote I would suspect a problem with the
>low level disk interface.
>
Our Oracle jobs are almost entirely local (we got rid of all network
access for performance reasons months ago). Before the upgrade to
2.4.4, they were running well enough, but now (with the only change
being the Suse upgrade from 7.0 to 7.2) they are taking twice as long.
I am slightly suspicious of the kernel, as much swapping is happening
now which was not happening before on an identical workload. I am
trying out 2.4.6-2 (from Hubert Mantel's builds) today to see if VM
behavior improves.

Many Thanks,
Brian Strand
CTO Switch Management


2001-07-11 16:44:39

by Brian Strand

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Andrea Arcangeli wrote:

>We need to restrict the problem. How are you using Oracle? Through any
>filesystem? If yes which one? Or with rawio? Is your workload cached
>most of the time or not?
>
Our Oracle configuration is on reiserfs on lvm on Mylex. Our workload
is not entirely cached, as we are working against an 8GB table, Oracle
is configured to use slightly more than 1GB of memory, and there is
always several MB/s of IO going on during our queries. The "working
set" of the main table and indexes occupies over 2GB.

Many Thanks,
Brian Strand
CTO Switch Management


2001-07-11 17:08:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Wed, Jul 11, 2001 at 09:44:19AM -0700, Brian Strand wrote:
> Our Oracle configuration is on reiserfs on lvm on Mylex. Our workload
> is not entirely cached, as we are working against an 8GB table, Oracle
> is configured to use slightly more than 1GB of memory, and there is
> always several MB/s of IO going on during our queries. The "working
> set" of the main table and indexes occupies over 2GB.

As I suspected there is the VM in our way. Also reiserfs could be an
issue but I am not aware of any regression on the reiserfs side, Chris?

I tend to believe it is a VM regression (and I admit, this is what I
would bet as soon as I read your report before being sure the VM was in
our way).

One way to verify this could be to run Oracle on top of rawio and then
on ext2. If it's the vm you should still get the slowdown on ext2 too
and you should run as fast as 2.2 with rawio. Most people uses Oracle on
top of rawio on top of lvm, and incidentally this is was the first
slowdown report I got about 2.4 when compared to 2.2.

Andrea

2001-07-11 17:25:02

by Chris Mason

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4



On Wednesday, July 11, 2001 07:08:21 PM +0200 Andrea Arcangeli
<[email protected]> wrote:

> On Wed, Jul 11, 2001 at 09:44:19AM -0700, Brian Strand wrote:
>> Our Oracle configuration is on reiserfs on lvm on Mylex. Our workload
>> is not entirely cached, as we are working against an 8GB table, Oracle
>> is configured to use slightly more than 1GB of memory, and there is
>> always several MB/s of IO going on during our queries. The "working
>> set" of the main table and indexes occupies over 2GB.
>
> As I suspected there is the VM in our way. Also reiserfs could be an
> issue but I am not aware of any regression on the reiserfs side, Chris?

reiserfs has a big O_SYNC penalty right now, which can be fixed by a
transaction tracking patch I posted a month or so ago. It has been tested
by a few people as a large improvement. Brian, I'll update this to 2.4.6
and send along.

-chris

2001-07-11 22:53:35

by Lance Larsh

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Wed, 11 Jul 2001, Brian Strand wrote:

> Our Oracle configuration is on reiserfs on lvm on Mylex.

I can pretty much tell you it's the reiser+lvm combination that is hurting
you here. At the 2.5 kernel summit a few months back, I reported that
some of our servers experienced as much as 10-15x slowdown after we moved
to 2.4. As it turned out, the problem was that the new servers (with
identical hardware to the old servers) were configured to use reiser+lvm,
whereas the older servers were using ext2 without lvm. When we rebuilt
the new servers with ext2 alone, the problem disappeared. (Note that we
also tried reiserfs without lvm, which was 5-6x slower than ext2 without
lvm.)

I ran lots of iozone tests which illustrated a huge difference in write
throughput between reiser and ext2. Chris Mason sent me a patch which
improved the reiser case (removing an unnecessary commit), but it was
still noticeably slower than ext2. Therefore I would recommend that
at this time reiser should not be used for Oracle database files.

Thanks,
Lance

2001-07-11 23:46:57

by Brian Strand

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Lance Larsh wrote:

>On Wed, 11 Jul 2001, Brian Strand wrote:
>
>>Our Oracle configuration is on reiserfs on lvm on Mylex.
>>
>I can pretty much tell you it's the reiser+lvm combination that is hurting
>you here. At the 2.5 kernel summit a few months back, I reported that
>
Why did it get so much worse going from 2.2.16 to 2.4.4, with an
otherwise-identical configuration? We had reiserfs+lvm under 2.2.16 too.

>
>some of our servers experienced as much as 10-15x slowdown after we moved
>to 2.4. As it turned out, the problem was that the new servers (with
>identical hardware to the old servers) were configured to use reiser+lvm,
>whereas the older servers were using ext2 without lvm. When we rebuilt
>the new servers with ext2 alone, the problem disappeared. (Note that we
>also tried reiserfs without lvm, which was 5-6x slower than ext2 without
>lvm.)
>
>I ran lots of iozone tests which illustrated a huge difference in write
>throughput between reiser and ext2. Chris Mason sent me a patch which
>improved the reiser case (removing an unnecessary commit), but it was
>still noticeably slower than ext2. Therefore I would recommend that
>at this time reiser should not be used for Oracle database files.
>
How do ext2+lvm, rawio+lvm, ext2 w/o lvm, and rawio w/o lvm compare in
terms of Oracle performance? I am going to try a migration if 2.4.6
doesn't make everything better; do you have any suggestions as to the
relative performance of each strategy?

Thanks,
Brian


2001-07-12 00:24:53

by Chris Mason

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4



On Wednesday, July 11, 2001 04:03:09 PM -0700 Lance Larsh
<[email protected]> wrote:

> I ran lots of iozone tests which illustrated a huge difference in write
> throughput between reiser and ext2. Chris Mason sent me a patch which
> improved the reiser case (removing an unnecessary commit), but it was
> still noticeably slower than ext2. Therefore I would recommend that
> at this time reiser should not be used for Oracle database files.
>

Hi Lance,

Could I get a copy of the results from last benchmark you ran (with the
patch + noatime on reiserfs). I'd like to close that gap...

-chris

2001-07-12 02:30:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Wed, Jul 11, 2001 at 04:03:09PM -0700, Lance Larsh wrote:
> some of our servers experienced as much as 10-15x slowdown after we moved
[..]
> also tried reiserfs without lvm, which was 5-6x slower than ext2 without

Hmm, so lvm introduced a significant slowdown too.

The only thing I'm scared about lvm are the down() in the ll_rw_block
fast paths and sumbit_bh which should *obviously* be converted to rwsem
(the write lock is needed only while moving PV around or while taking
COW in a snapshotted device). This way the fast paths common cases will
never wait for a lock. We inherit those non rw semaphores from the
latest lvm release (more recent than beta7 there's only the head CVS).

The down() of beta7 fixes race conditions present in previous releases
so they weren't pointless, but it was obviously a suboptimal fix. When I
seen them I was just scared but it was hard to tell if they could hurt
in real life and since 'till today nobody said anything bad about lvm
performance I assumed it wasn't a problem, but now something has
changed thanks to your feedback.

I will soon somehow make those changes in the lvm (based on beta7) in my
tree and it will be interesting to see if this will make a difference. I
will also have a look to see if I can improve a little more the lvm_map
but other than those non rw semaphores there should be not a significant
overhead to remove in the lvm fast path.

Andrea

PS. hint: if the down() were the problem you should also see an higher
context switching rate with lvm+ext2 than with plain ext2.

2001-07-12 06:13:24

by parviz dey

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Hey Lance,

interesting stuff!!
Did u ever found out why this would happen??
any idea??

--- Lance Larsh <[email protected]> wrote:
> On Wed, 11 Jul 2001, Brian Strand wrote:
>
> > Our Oracle configuration is on reiserfs on lvm on
> Mylex.
>
> I can pretty much tell you it's the reiser+lvm
> combination that is hurting
> you here. At the 2.5 kernel summit a few months
> back, I reported that
> some of our servers experienced as much as 10-15x
> slowdown after we moved
> to 2.4. As it turned out, the problem was that the
> new servers (with
> identical hardware to the old servers) were
> configured to use reiser+lvm,
> whereas the older servers were using ext2 without
> lvm. When we rebuilt
> the new servers with ext2 alone, the problem
> disappeared. (Note that we
> also tried reiserfs without lvm, which was 5-6x
> slower than ext2 without
> lvm.)
>
> I ran lots of iozone tests which illustrated a huge
> difference in write
> throughput between reiser and ext2. Chris Mason
> sent me a patch which
> improved the reiser case (removing an unnecessary
> commit), but it was
> still noticeably slower than ext2. Therefore I
> would recommend that
> at this time reiser should not be used for Oracle
> database files.
>
> Thanks,
> Lance
>
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/

2001-07-12 09:26:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Thu, Jul 12, 2001 at 04:30:46AM +0200, Andrea Arcangeli wrote:
> I will soon somehow make those changes in the lvm (based on beta7) in my
> tree and it will be interesting to see if this will make a difference. I
> will also have a look to see if I can improve a little more the lvm_map
> but other than those non rw semaphores there should be not a significant
> overhead to remove in the lvm fast path.

Even if you fix the snapshot_sem you still have the down on the _pe_lock
in lvm_map. The part covered by the PE lock is only a few tenths of cycles
shorter than the part covered by the snapshot semaphore; so it is unlikely
that you see much difference unless you change both to rwsems.

Wouldn't a single semaphore be enough BTW to cover both?


-Andi



2001-07-12 09:45:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Thu, Jul 12, 2001 at 11:26:13AM +0200, Andi Kleen wrote:
> On Thu, Jul 12, 2001 at 04:30:46AM +0200, Andrea Arcangeli wrote:
> > I will soon somehow make those changes in the lvm (based on beta7) in my
> > tree and it will be interesting to see if this will make a difference. I
> > will also have a look to see if I can improve a little more the lvm_map
> > but other than those non rw semaphores there should be not a significant
^ both
> > overhead to remove in the lvm fast path.
>
> Even if you fix the snapshot_sem you still have the down on the _pe_lock
> in lvm_map. The part covered by the PE lock is only a few tenths of cycles
> shorter than the part covered by the snapshot semaphore; so it is unlikely
> that you see much difference unless you change both to rwsems.

See the above 's', plural, in case it was not obious I meant "all the
semaphores in the fast path", not just one, of course doing just one
would been nearly useless.

Both semaphore_S_ are just converted to rwsem in 2.4.7pre6aa1 so the
fast path *cannot* block any longer in my current tree.

> Wouldn't a single semaphore be enough BTW to cover both?

Actually the _pe_lock is global and it's hold for a short time so it
can make some sense. And if you look closely you'll see that _pe_lock
should _definitely_ be a rw_spinlock not a rw_semaphore. I didn't
changed that though just to keep the patch smaller and to avoid changing
the semantics of the lock, the only thing that matters for us is to
never block and to have a fast read fast path which is provided just
fine by the rwsem (i'll left the s/sem/spinlock/ to the CVS).

Andrea

2001-07-12 10:14:54

by Andi Kleen

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Lance Larsh <[email protected]> writes:
>
> I ran lots of iozone tests which illustrated a huge difference in write
> throughput between reiser and ext2. Chris Mason sent me a patch which
> improved the reiser case (removing an unnecessary commit), but it was
> still noticeably slower than ext2. Therefore I would recommend that
> at this time reiser should not be used for Oracle database files.

When I read the 2.4.6 reiserfs code correctly reiserfs does not cause
any transactions for reads/writes to allocated blocks; i.e. you're not extending
the file, you're not filling holes and you're not updating atimes.
My understanding is that this is normally true for Oracle, but probably
not for iozone so it would be better if you benchmarked random writes
to an already allocated file.
The 2.4 page cache is more or less direct write through in this case.

-Andi

2001-07-12 14:24:00

by Chris Mason

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4



On Thursday, July 12, 2001 12:14:16 PM +0200 Andi Kleen <[email protected]> wrote:

> Lance Larsh <[email protected]> writes:
>>
>> I ran lots of iozone tests which illustrated a huge difference in write
>> throughput between reiser and ext2. Chris Mason sent me a patch which
>> improved the reiser case (removing an unnecessary commit), but it was
>> still noticeably slower than ext2. Therefore I would recommend that
>> at this time reiser should not be used for Oracle database files.
>
> When I read the 2.4.6 reiserfs code correctly reiserfs does not cause
> any transactions for reads/writes to allocated blocks; i.e. you're not extending
> the file, you're not filling holes and you're not updating atimes.
> My understanding is that this is normally true for Oracle, but probably
> not for iozone so it would be better if you benchmarked random writes
> to an already allocated file.
> The 2.4 page cache is more or less direct write through in this case.
>

In general, yes. But, atime updates trigger transactions, as
and O_SYNC/fsync writes (in 2.4.x reiserfs) always force a commit of
the current tranasction. The two patches I just posted should fix
that...

-chris





2001-07-12 16:10:04

by Lance Larsh

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Andi Kleen wrote:

> My understanding is that this is normally true for Oracle, but probably
> not for iozone so it would be better if you benchmarked random writes
> to an already allocated file.

You are correct that this is true for Oracle: we preallocate the file at db create
time, and we use O_DSYNC to avoid atime updates. The same is true for iozone: it
performs writes to all the blocks (creating the file and allocating blocks), then
rewrites all of the blocks. The write and rewrite times are measured and reported
in separate. Naturally, we only care about the rewrite times, and those are the
results I'm quoting when I casually use the term "writes". Also, we pass the "-o"
option to iozone, which causes it to open the file with O_SYNC (which on Linux is
really O_DSYNC), just like Oracle does. So, the mode I'm running iozone in really
does model Oracle i/o. Sorry if that wasn't clear.

Thanks,
Lance



Attachments:
Lance.Larsh.vcf (367.00 B)
Card for Lance Larsh

2001-07-12 16:35:25

by Lance Larsh

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4



On Wed, 11 Jul 2001, Chris Mason wrote:

> Could I get a copy of the results from last benchmark you ran (with the
> patch + noatime on reiserfs). I'd like to close that gap...

I have the results in an Excel spreadsheet, but I'm only attaching the
plot in postscript format to simplify things. If you'd like me to send
you the .xls file, let me know. Note that the results included here are
only for "rewrites", not "writes".

The most interesting things I see are:

1. the reiser patch you sent me made a noticeable improvement, but it
didn't matter whether I used the noatime mount option or not.

2. reiser has a reproducible spike in throughput at 4k i/o size, and it
even beats ext2 in that single case.

3. (and sort of off topic...) ext2
throughput drifts slightly down for i/o sizes >64k as we go from 2.4.0 ->
2.4.3 -> 2.4.4

Thanks,
Lance


Attachments:
reiser.ps (50.69 kB)

2001-07-12 17:06:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Andrea writes:
> > Wouldn't a single semaphore be enough BTW to cover both?
>
> Actually the _pe_lock is global and it's hold for a short time so it
> can make some sense. And if you look closely you'll see that _pe_lock
> should _definitely_ be a rw_spinlock not a rw_semaphore. I didn't
> changed that though just to keep the patch smaller and to avoid changing
> the semantics of the lock, the only thing that matters for us is to
> never block and to have a fast read fast path which is provided just
> fine by the rwsem (i'll left the s/sem/spinlock/ to the CVS).

Actually, I have already fixed the _pe_lock problem in LVM CVS, so that
it is not acquired on the fast path. The cases where a PV is being moved
is very rare and only affects the write path, so I check rw == WRITE
and pe_lock_req.lock == LOCK_PE, before getting _pe_lock and re-checking
pe_lock_req.lock. This does not affect the semantics of the operation.

Note also that the current kernel LVM code holds the _pe_lock for the
entire time it is flushing write requests from the queue, when it does
not need to do so. My changes (also in LVM CVS) fix this as well.
I have attached the patch which should take beta7 to current CVS in
this regard. Please take a look.

Note that your current patch is broken by the use of rwsems, because
_pe_lock also protects the _pe_requests list, which you modify under
up_read() (you can't upgrade a read lock to a write lock, AFAIK), so
you always need a write lock whenever you get _pe_lock. With my changes
there will be very little contention on _pe_lock, as it is off the fast
path and only held for a few asm instructions at a time.

It is also a good thing that you fixed up lv_snapshot_sem, which was
also on the fast path, but at least that was a per-LV semaphore, unlike
_pe_lock which was global. But I don't think you can complain about it,
because I think you were the one that added it ;-).

Note, how does this all apply to 2.2 kernels? I don't think rwsems
existed then, nor rwspinlocks, did they?

Cheers, Andreas
====================== lvm-0.9.1b7-queue.diff ============================
diff -u -u -r1.7.2.96 lvm.c
--- kernel/lvm.c 2001/04/11 19:08:58 1.7.2.96
+++ kernel/lvm.c 2001/04/23 12:47:26
@@ -1267,29 +1271,30 @@
rsector_map, stripe_length, stripe_index);
}

- /* handle physical extents on the move */
- down(&_pe_lock);
- if((pe_lock_req.lock == LOCK_PE) &&
- (rdev_map == pe_lock_req.data.pv_dev) &&
- (rsector_map >= pe_lock_req.data.pv_offset) &&
- (rsector_map < (pe_lock_req.data.pv_offset + vg_this->pe_size)) &&
-#if LINUX_VERSION_CODE >= KERNEL_VERSION ( 2, 4, 0)
- (rw == WRITE)) {
-#else
- ((rw == WRITE) || (rw == WRITEA))) {
-#endif
- _queue_io(bh, rw);
- up(&_pe_lock);
- up(&lv->lv_snapshot_sem);
- return 0;
- }
- up(&_pe_lock);
+ /*
+ * Queue writes to physical extents on the move until move completes.
+ * Don't get _pe_lock until there is a reasonable expectation that
+ * we need to queue this request, because this is in the fast path.
+ */
+ if (rw == WRITE) {
+ if (pe_lock_req.lock == LOCK_PE) {
+ down(&_pe_lock);
+ if ((pe_lock_req.lock == LOCK_PE) &&
+ (rdev_map == pe_lock_req.data.pv_dev) &&
+ (rsector_map >= pe_lock_req.data.pv_offset) &&
+ (rsector_map < (pe_lock_req.data.pv_offset +
+ vg_this->pe_size))) {
+ _queue_io(bh, rw);
+ up(&_pe_lock);
+ up(&lv->lv_snapshot_sem);
+ return 0;
+ }
+ up(&_pe_lock);
+ }

- /* statistic */
- if (rw == WRITE || rw == WRITEA)
- lv->lv_current_pe[index].writes++;
- else
- lv->lv_current_pe[index].reads++;
+ lv->lv_current_pe[index].writes++; /* statistic */
+ } else
+ lv->lv_current_pe[index].reads++; /* statistic */

/* snapshot volume exception handling on physical device
address base */
@@ -1430,7 +1435,6 @@
{
pe_lock_req_t new_lock;
struct buffer_head *bh;
- int rw;
uint p;

if (vg_ptr == NULL) return -ENXIO;
@@ -1439,9 +1443,6 @@

switch (new_lock.lock) {
case LOCK_PE:
- if(pe_lock_req.lock == LOCK_PE)
- return -EBUSY;
-
for (p = 0; p < vg_ptr->pv_max; p++) {
if (vg_ptr->pv[p] != NULL &&
new_lock.data.pv_dev == vg_ptr->pv[p]->pv_dev)
@@ -1449,16 +1450,18 @@
}
if (p == vg_ptr->pv_max) return -ENXIO;

- pe_lock_req = new_lock;
-
- down(&_pe_lock);
- pe_lock_req.lock = UNLOCK_PE;
- up(&_pe_lock);
-
fsync_dev(pe_lock_req.data.lv_dev);

down(&_pe_lock);
+ if (pe_lock_req.lock == LOCK_PE) {
+ up(&_pe_lock);
+ return -EBUSY;
+ }
+ /* Should we do to_kdev_t() on the pv_dev and lv_dev??? */
pe_lock_req.lock = LOCK_PE;
+ pe_lock_req.data.lv_dev = new_lock_req.data.lv_dev;
+ pe_lock_req.data.pv_dev = new_lock_req.data.pv_dev;
+ pe_lock_req.data.pv_offset = new_lock_req.data.pv_offset;
up(&_pe_lock);
break;

@@ -1468,17 +1471,11 @@
pe_lock_req.data.lv_dev = 0;
pe_lock_req.data.pv_dev = 0;
pe_lock_req.data.pv_offset = 0;
- _dequeue_io(&bh, &rw);
+ bh = _dequeue_io();
up(&_pe_lock);

/* handle all deferred io for this PE */
- while(bh) {
- /* resubmit this buffer head */
- generic_make_request(rw, bh);
- down(&_pe_lock);
- _dequeue_io(&bh, &rw);
- up(&_pe_lock);
- }
+ _flush_io(bh);
break;

default:
@@ -2814,12 +2836,22 @@
_pe_requests = bh;
}

-static void _dequeue_io(struct buffer_head **bh, int *rw) {
- *bh = _pe_requests;
- *rw = WRITE;
- if(_pe_requests) {
- _pe_requests = _pe_requests->b_reqnext;
- (*bh)->b_reqnext = 0;
+/* Must hold _pe_lock when we dequeue this list of buffers */
+static inline struct buffer_head *_dequeue_io(void)
+{
+ struct buffer_head *bh = _pe_requests;
+ _pe_requests = NULL;
+ return bh;
+}
+
+static inline void _flush_io(struct buffer_head *bh)
+{
+ while (bh) {
+ struct buffer_head *next = bh->b_reqnext;
+ bh->b_reqnext = 0;
+ /* resubmit this buffer head */
+ generic_make_request(WRITE, bh);
+ bh = next;
}
}

--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-07-12 17:08:19

by Lance Larsh

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4



On Wed, 11 Jul 2001, Brian Strand wrote:

> Why did it get so much worse going from 2.2.16 to 2.4.4, with an
> otherwise-identical configuration? We had reiserfs+lvm under 2.2.16 too.

Don't have an answer to that. I never tried reiser on 2.2.

> How do ext2+lvm, rawio+lvm, ext2 w/o lvm, and rawio w/o lvm compare in
> terms of Oracle performance? I am going to try a migration if 2.4.6
> doesn't make everything better; do you have any suggestions as to the
> relative performance of each strategy?

The best answer I can give at the moment is to use either ext2 or rawio,
and you might want to avoid lvm for now.

I never ran any of the lvm configurations myself. What little I know
about lvm performance is conjecture based on comparing my reiser results
(5-6x slower than ext2) to the reiser+lvm results from one of our other
internal groups (10-15x slower than ext2). So, although it looks like lvm
throws in a factor of 2-3x slowdown when using reiser, I don't think we
can assume lvm slows down ext2 by the same amount or else someone probably
would have noticed by now. Perhaps there's something that sort of
resonates between reiser and lvm to cause the combination to be
particularly bad. Just guessing...

And while we're talking about comparing configurations, I'll mention that
I'm currently trying to compare raw and ext2 (no lvm in either case).
Although raw should be faster than fs, we're seeing some strange results:
it looks like ext2 can be as much as 2x faster than raw for reads, though
I'm not confident that these results are accurate. The fs might still be
getting a boost from the fs cache, even though we've tried to eliminate
that possibility by sizing things appropriately.

Has anyone else seen results like this, or can anyone think of a
possible explanation?

Thanks,
Lance

2001-07-12 18:18:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Thu, Jul 12, 2001 at 11:04:39AM -0600, Andreas Dilger wrote:
> Andrea writes:
> > > Wouldn't a single semaphore be enough BTW to cover both?
> >
> > Actually the _pe_lock is global and it's hold for a short time so it
> > can make some sense. And if you look closely you'll see that _pe_lock
> > should _definitely_ be a rw_spinlock not a rw_semaphore. I didn't
> > changed that though just to keep the patch smaller and to avoid changing
> > the semantics of the lock, the only thing that matters for us is to
> > never block and to have a fast read fast path which is provided just
> > fine by the rwsem (i'll left the s/sem/spinlock/ to the CVS).
>
> Actually, I have already fixed the _pe_lock problem in LVM CVS, so that
> it is not acquired on the fast path. The cases where a PV is being moved

ok, btw if you care to write correct C code you should also declare
.lock as volatile or gcc has the rights to miscompile your code.

> is very rare and only affects the write path, so I check rw == WRITE
> and pe_lock_req.lock == LOCK_PE, before getting _pe_lock and re-checking
> pe_lock_req.lock. This does not affect the semantics of the operation.
>
> Note also that the current kernel LVM code holds the _pe_lock for the
> entire time it is flushing write requests from the queue, when it does
> not need to do so. My changes (also in LVM CVS) fix this as well.
> I have attached the patch which should take beta7 to current CVS in
> this regard. Please take a look.

Ok, I will thanks.

> Note that your current patch is broken by the use of rwsems, because
> _pe_lock also protects the _pe_requests list, which you modify under
> up_read() (you can't upgrade a read lock to a write lock, AFAIK), so
> you always need a write lock whenever you get _pe_lock. With my changes
> there will be very little contention on _pe_lock, as it is off the fast
> path and only held for a few asm instructions at a time.

Yes, there's a race condition when people moves PV around, thanks for
noticing it.

> It is also a good thing that you fixed up lv_snapshot_sem, which was
> also on the fast path, but at least that was a per-LV semaphore, unlike
> _pe_lock which was global. But I don't think you can complain about it,
> because I think you were the one that added it ;-).

Definitely wrong, I added it only to the snapshot case, it wasn't
definitely in the fast path hitten by Oracle. Somebody (not me) moved it
into the main fast path of lvm, and as said in the earlier email when I
found it used that way I was scared as soon as I seen it. Incidentally
infact it is still called lv_snapshot_sem because it wasn't renamed yet.

Go back into the old releases (or back to 2.2) and you will see where I
put the lv_snapshot_sem.

So definitely don't complain at me if the lv_snapshot_sem was hurting
the fast path.

> Note, how does this all apply to 2.2 kernels? I don't think rwsems
> existed then, nor rwspinlocks, did they?

2.2 have no semaphores in the fast path so this doesn't apply to 2.2 at
all (it may have race conditions though). 2.2 only had the
lv_snapshot_sem in the snapshot I/O code, which is the one I added to
fix the race conditions, but it wasn't at all related to the fast path
hitten by Oracle as said above.

Andrea

2001-07-12 21:32:16

by Hans Reiser

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Lance Larsh wrote:
>
> On Wed, 11 Jul 2001, Brian Strand wrote:
>
> > Why did it get so much worse going from 2.2.16 to 2.4.4, with an
> > otherwise-identical configuration? We had reiserfs+lvm under 2.2.16 too.
>
> Don't have an answer to that. I never tried reiser on 2.2.
>
> > How do ext2+lvm, rawio+lvm, ext2 w/o lvm, and rawio w/o lvm compare in
> > terms of Oracle performance? I am going to try a migration if 2.4.6
> > doesn't make everything better; do you have any suggestions as to the
> > relative performance of each strategy?
>
> The best answer I can give at the moment is to use either ext2 or rawio,
> and you might want to avoid lvm for now.
>
> I never ran any of the lvm configurations myself. What little I know
> about lvm performance is conjecture based on comparing my reiser results

Lance, I would appreciate it if you would be more careful to identify that you are using O_SYNC,
which is a special case we are not optimized for, and which I am frankly skeptical should be used at
all by an application instead of using fsync judiciously. It is rare that an application is
inherently completely incapable of ever having two I/Os not be serialized, and using O_SYNC to force
every IO to be serialized rather than picking and choosing when to use fsync, well, I have my doubts
frankly. If a user really needs every operation to be synchronous, they should buy a system with an
SSD for the journal from applianceware.com (they sell them tuned to run ReiserFS), or else they are
just going to go real slow, no matter what the FS does.


> (5-6x slower than ext2) to the reiser+lvm results from one of our other
> internal groups (10-15x slower than ext2). So, although it looks like lvm
> throws in a factor of 2-3x slowdown when using reiser, I don't think we
> can assume lvm slows down ext2 by the same amount or else someone probably
> would have noticed by now. Perhaps there's something that sort of
> resonates between reiser and lvm to cause the combination to be
> particularly bad. Just guessing...
>
> And while we're talking about comparing configurations, I'll mention that
> I'm currently trying to compare raw and ext2 (no lvm in either case).
> Although raw should be faster than fs, we're seeing some strange results:
> it looks like ext2 can be as much as 2x faster than raw for reads, though
> I'm not confident that these results are accurate. The fs might still be
> getting a boost from the fs cache, even though we've tried to eliminate
> that possibility by sizing things appropriately.
>
> Has anyone else seen results like this, or can anyone think of a
> possible explanation?
>
> Thanks,
> Lance
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-07-12 21:53:17

by Chris Mason

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4



On Friday, July 13, 2001 01:31:42 AM +0400 Hans Reiser <[email protected]> wrote:

> Lance, I would appreciate it if you would be more careful to identify that you are using O_SYNC,
> which is a special case we are not optimized for, and which I am frankly skeptical should be used at
> all by an application instead of using fsync judiciously. It is rare that an application is
> inherently completely incapable of ever having two I/Os not be serialized, and using O_SYNC to force
> every IO to be serialized rather than picking and choosing when to use fsync, well, I have my doubts
> frankly. If a user really needs every operation to be synchronous, they should buy a system with an
> SSD for the journal from applianceware.com (they sell them tuned to run ReiserFS), or else they are
> just going to go real slow, no matter what the FS does.
>

There is no reason for reiserfs to be 5 times slower than ext2 at anything ;-)
Regardless of if O_SYNC is a good idea or not. I should have optimized the
original code for this case, as oracle is reason enough to do it.

-chris

2001-07-12 22:55:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Thu, Jul 12, 2001 at 08:18:19PM +0200, Andrea Arcangeli wrote:
> On Thu, Jul 12, 2001 at 11:04:39AM -0600, Andreas Dilger wrote:
> > Note that your current patch is broken by the use of rwsems, because
> > _pe_lock also protects the _pe_requests list, which you modify under
> > up_read() (you can't upgrade a read lock to a write lock, AFAIK), so
> > you always need a write lock whenever you get _pe_lock. With my changes
> > there will be very little contention on _pe_lock, as it is off the fast
> > path and only held for a few asm instructions at a time.
>
> Yes, there's a race condition when people moves PV around, thanks for
> noticing it.

While merging your patch that was supposed to fix the "race" in my
patch, I had a closer look and the whole design behind _pe_lock thing
seems totally broken and it can race since the first place.

With the current design of the pe_lock_req logic when you return from
the ioctl(PE_LOCK) syscall, you never have the guarantee that all the
in-flight writes are commited to disk, the
fsync_dev(pe_lock_req.data.lv_dev) is just worthless, there's an huge
race window between the fsync_dev and the pe_lock_req.lock = LOCK_PE
where whatever I/O can be started without you fiding it later in the
_pe_request list. Even despite of that window we don't even wait the
requests running just after the lock test to complete, the only lock we
have is in lvm_map, but we should really track which of those bh are
been committed successfully to the platter before we can actually copy
the pv under the lvm from userspace.

If the logic would been sane, your patch would also been ok
(besides the C breakage of the missing volatile but we abuse gcc this
way in other parts of the kernel too after all).

Your patch just makes much more obvious how the logic cannot be correct,
since you just do a plain lockless cmpl on a word in a fast path and the
other end (pe_lock()) will never know what's going on with such request
any longer.

Of course the removal of the below crap in your patch was fine too:

- pe_lock_req = new_lock;
-
- down(&_pe_lock);
- pe_lock_req.lock = UNLOCK_PE;
- up(&_pe_lock);
-

the write of pe_lock_req without a down() was an additional race
condition too per se, the lvm_map side in beta7 is reading uncoherent
informations caming from such unlocked copy.

So in short nobody should ever had moved PV around with lvm beta7 with
writes going on since the first place due various races in beta7.

I think the whole pv_move logic needs to be redesigned and rewritten, if
you could rewrite it and send patches (possibly also against beta7 if
a new lvm release is not scheduled shortly) that would be more than
welcome! At the moment those pv_move races are lower priority for me (I
think it's not a showstopper even if the user is required to stop the db
a few seconds while moving PV around to upgrading its hardware), the lvm
2.4 slowdown during production usage was a showstopper instead but the
slowdown should be just fixed and that was the first prio.

Maybe I'm missing something. Comments?

Andrea

2001-07-13 03:00:11

by Andrew Morton

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Lance Larsh wrote:
>
> And while we're talking about comparing configurations, I'll mention that
> I'm currently trying to compare raw and ext2 (no lvm in either case).

It would be interesting to see some numbers for ext3 with full
data journalling.

Some preliminary testing by Neil Brown shows that ext3 is 1.5x faster
than ext2 when used with knfsd, mounted synchronously. (This uses
O_SYNC internally).

The reason is that all the data and metadata are written to a
contiguous area of the disk: no seeks apart from the seek to the
journal are needed. Once the metadata and data are committed to
the journal, the O_SYNC (or fsync()) caller is allowed to continue.
Checkpointing of the data and metadata into the main fileystem is
allowed to proceed via normal writeback.

Make sure that you're using a *big* journal though. Use the
`-J size=400' option with tune2fs or mke2fs.

-

2001-07-13 04:17:21

by Andrew Morton

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Andrew Morton wrote:
>
> Lance Larsh wrote:
> >
> > And while we're talking about comparing configurations, I'll mention that
> > I'm currently trying to compare raw and ext2 (no lvm in either case).
>
> It would be interesting to see some numbers for ext3 with full
> data journalling.
>
> Some preliminary testing by Neil Brown shows that ext3 is 1.5x faster
> than ext2 when used with knfsd, mounted synchronously. (This uses
> O_SYNC internally).

I just did some testing with local filesystems - running `dbench 4'
on ext2-on-iDE and ext3-on-IDE, where dbench was altered to open
files O_SYNC. Journal size was 400 megs, mount options `data=journal'

ext2: Throughput 2.71849 MB/sec (NB=3.39812 MB/sec 27.1849 MBit/sec)
ext3: Throughput 12.3623 MB/sec (NB=15.4529 MB/sec 123.623 MBit/sec)

ext3 patches are at http://www.uow.edu.au/~andrewm/linux/ext3/

The difference will be less dramatic with large, individual writes.

Be aware though that ext3 breaks both RAID1 and RAID5. This
RAID patch should help:


--- linux-2.4.6/drivers/md/raid1.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 15:27:09 2001
@@ -46,6 +46,30 @@
#define PRINTK(x...) do { } while (0)
#endif

+#define __raid1_wait_event(wq, condition) \
+do { \
+ wait_queue_t __wait; \
+ init_waitqueue_entry(&__wait, current); \
+ \
+ add_wait_queue(&wq, &__wait); \
+ for (;;) { \
+ set_current_state(TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ run_task_queue(&tq_disk); \
+ schedule(); \
+ } \
+ current->state = TASK_RUNNING; \
+ remove_wait_queue(&wq, &__wait); \
+} while (0)
+
+#define raid1_wait_event(wq, condition) \
+do { \
+ if (condition) \
+ break; \
+ __raid1_wait_event(wq, condition); \
+} while (0)
+

static mdk_personality_t raid1_personality;
static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
@@ -83,7 +107,7 @@ static struct buffer_head *raid1_alloc_b
cnt--;
} else {
PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ raid1_wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
}
}
return bh;
@@ -170,7 +194,7 @@ static struct raid1_bh *raid1_alloc_r1bh
memset(r1_bh, 0, sizeof(*r1_bh));
return r1_bh;
}
- wait_event(conf->wait_buffer, conf->freer1);
+ raid1_wait_event(conf->wait_buffer, conf->freer1);
} while (1);
}

--- linux-2.4.6/drivers/md/raid5.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid5.c Thu Jul 12 21:31:55 2001
@@ -66,10 +66,11 @@ static inline void __release_stripe(raid
BUG();
if (atomic_read(&conf->active_stripes)==0)
BUG();
- if (test_bit(STRIPE_DELAYED, &sh->state))
- list_add_tail(&sh->lru, &conf->delayed_list);
- else if (test_bit(STRIPE_HANDLE, &sh->state)) {
- list_add_tail(&sh->lru, &conf->handle_list);
+ if (test_bit(STRIPE_HANDLE, &sh->state)) {
+ if (test_bit(STRIPE_DELAYED, &sh->state))
+ list_add_tail(&sh->lru, &conf->delayed_list);
+ else
+ list_add_tail(&sh->lru, &conf->handle_list);
md_wakeup_thread(conf->thread);
} else {
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -1167,10 +1168,9 @@ static void raid5_unplug_device(void *da

raid5_activate_delayed(conf);

- if (conf->plugged) {
- conf->plugged = 0;
- md_wakeup_thread(conf->thread);
- }
+ conf->plugged = 0;
+ md_wakeup_thread(conf->thread);
+
spin_unlock_irqrestore(&conf->device_lock, flags);
}

2001-07-13 07:37:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Andrea writes:
> With the current design of the pe_lock_req logic when you return from
> the ioctl(PE_LOCK) syscall, you never have the guarantee that all the
> in-flight writes are commited to disk, the
> fsync_dev(pe_lock_req.data.lv_dev) is just worthless, there's an huge
> race window between the fsync_dev and the pe_lock_req.lock = LOCK_PE
> where whatever I/O can be started without you fiding it later in the
> _pe_request list.

Yes there is a slight window there, but fsync_dev() serves to flush out the
majority of outstanding I/Os to disk (it waits for I/O completion). All
of these buffers should be on disk, right?

> Even despite of that window we don't even wait the
> requests running just after the lock test to complete, the only lock we
> have is in lvm_map, but we should really track which of those bh are
> been committed successfully to the platter before we can actually copy
> the pv under the lvm from userspace.

As soon as we set LOCK_PE, any new I/Os coming in on the LV device will
be put on the queue, so we don't need to worry about those. We have to
do something like sync_buffers(PV, 1) for the PV that is underneath the
PE being moved, to ensure any buffers that arrived between fsync_dev()
and LOCK_PE are flushed (they are the only buffers that can be in flight).
Is there another problem you are referring to?

AFAICS, there would only be a large window for missed buffers if you
were doing two PE moves at once, and had contention for _pe_lock,
otherwise fsync_dev to LOCK_PE is a very small window, I think.
However, I think we are also protected by the global LVM lock from
doing multiple PE moves at one time.

> If the logic would been sane, your patch would also been ok
> (besides the C breakage of the missing volatile but we abuse gcc this
> way in other parts of the kernel too after all).

Yes, I never thought about GCC optimizing away the two references to the
same var before and after making the check.

> I think the whole pv_move logic needs to be redesigned and rewritten, if
> you could rewrite it and send patches (possibly also against beta7 if
> a new lvm release is not scheduled shortly) that would be more than
> welcome!

Yes, well the correct solution is to do it all in a kernel thread, so
that you don't need to do kernel->user->kernel data copying. I already
discussed this with Joe Thornber (I think) and it was decided to be too
much for now (needs changes to user tools, IOP version, etc). Later.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-07-13 15:36:51

by Jeffrey W. Baker

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Fri, 13 Jul 2001, Andrew Morton wrote:

> Andrew Morton wrote:
> >
> > Lance Larsh wrote:
> > >
> > > And while we're talking about comparing configurations, I'll mention that
> > > I'm currently trying to compare raw and ext2 (no lvm in either case).
> >
> > It would be interesting to see some numbers for ext3 with full
> > data journalling.
> >
> > Some preliminary testing by Neil Brown shows that ext3 is 1.5x faster
> > than ext2 when used with knfsd, mounted synchronously. (This uses
> > O_SYNC internally).
>
> I just did some testing with local filesystems - running `dbench 4'
> on ext2-on-iDE and ext3-on-IDE, where dbench was altered to open
> files O_SYNC. Journal size was 400 megs, mount options `data=journal'
>
> ext2: Throughput 2.71849 MB/sec (NB=3.39812 MB/sec 27.1849 MBit/sec)
> ext3: Throughput 12.3623 MB/sec (NB=15.4529 MB/sec 123.623 MBit/sec)
>
> ext3 patches are at http://www.uow.edu.au/~andrewm/linux/ext3/
>
> The difference will be less dramatic with large, individual writes.

This is a totally transient effect, right? The journal acts as a faster
buffer, but if programs are writing a lot of data to the disk for a very
long time, the throughput will eventually be throttled by writing the
journal back into the filesystem.

For programs that write in bursts, it looks like a huge win!

-jwb

2001-07-13 15:49:02

by Andrew Morton

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

"Jeffrey W. Baker" wrote:
>
> > ...
> > ext2: Throughput 2.71849 MB/sec (NB=3.39812 MB/sec 27.1849 MBit/sec)
> > ext3: Throughput 12.3623 MB/sec (NB=15.4529 MB/sec 123.623 MBit/sec)
> >
> > ext3 patches are at http://www.uow.edu.au/~andrewm/linux/ext3/
> >
> > The difference will be less dramatic with large, individual writes.
>
> This is a totally transient effect, right? The journal acts as a faster
> buffer, but if programs are writing a lot of data to the disk for a very
> long time, the throughput will eventually be throttled by writing the
> journal back into the filesystem.

It varies a lot with workload. With large writes such as
'iozone -s 300m -a -i 0' it seems about the same throughput
as ext2. It would take some time to characterise fully.

> For programs that write in bursts, it looks like a huge win!

yes - lots of short writes (eg: mailspools) will benefit considerably.
The benefits come from the additional merging and sorting which
can be performed on the writeback data.

I suspect some of the dbench benefit comes from the fact that
the files are unlinked at the end of the test - if the data hasn't
been written back at that time the buffers are hunted down and
zapped - they *never* get written.

If anyone wants to test sync throughput, please be sure to use
0.9.3-pre - it fixes some rather sucky behaviour with large journals.

-

2001-07-13 16:07:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

On Fri, Jul 13, 2001 at 01:35:00AM -0600, Andreas Dilger wrote:
> Andrea writes:
> > With the current design of the pe_lock_req logic when you return from
> > the ioctl(PE_LOCK) syscall, you never have the guarantee that all the
> > in-flight writes are commited to disk, the
> > fsync_dev(pe_lock_req.data.lv_dev) is just worthless, there's an huge
> > race window between the fsync_dev and the pe_lock_req.lock = LOCK_PE
> > where whatever I/O can be started without you fiding it later in the
> > _pe_request list.
>
> Yes there is a slight window there, but fsync_dev() serves to flush out the
> majority of outstanding I/Os to disk (it waits for I/O completion). All
> of these buffers should be on disk, right?

Yes, however fsync_dev also has the problem that it cannot catch rawio.
Plus fsync_dev is useless since whatever I/O can be started between
fsync_dev and the pe_lock.

But the even bigger problem (regardless of fsync_dev) is that by the
time we set pe_lock and we return from the ioctl(PE_LOCK) we only know
all the requests passed the pe_lock-check in lvm_map, but we never know
if they are just commited on disk by the time we start moving the PVs.

>
> > Even despite of that window we don't even wait the
> > requests running just after the lock test to complete, the only lock we
> > have is in lvm_map, but we should really track which of those bh are
> > been committed successfully to the platter before we can actually copy
> > the pv under the lvm from userspace.
>
> As soon as we set LOCK_PE, any new I/Os coming in on the LV device will
> be put on the queue, so we don't need to worry about those. We have to

Correct, actually as Joe noticed that is broken too but for other reasource
management reasons (I also thought about the resource management thing
of too many bh queued waiting the pv move, but I ignored that problem
for now since that part cannot silenty generate corruption, as worse it
will deadlock the machine which is much better than silenty corrupting
the fs and letting the administrator thing the pv_move worked ok).

> do something like sync_buffers(PV, 1) for the PV that is underneath the
> PE being moved, to ensure any buffers that arrived between fsync_dev()
> and LOCK_PE are flushed (they are the only buffers that can be in flight).

correct, we need to ensure those buffers are flushed, however
sync_buffers(PV, 1) is broken way to do that, it won't work out because
no buffer cache or bh in general lives on top of the PV and secondly we must
handle anonymous bh rawio etc too before moving the PV around (anonymous
buffers in kiobufs are never visible in any lru list, they are only
telling the blkdev where to write, they're not holding any memory or
data, the kiobuf does but we don't know which kiobufs are writing to the
LV...)

One right way I can imagine to fix the in-flight-I/O race is to overload
the end_io callback with a private one for all the bh passing through
lvm_map, and atomically counting the number of in-flight-bh per-LV, then
you lock the device and you wait the count to go down to zero while you
unplug the tq_disk. The fsync_dev basically only matters to try to
reduce the size of the bh-queue while we hold the lock (so we avoid a
flood of bh caming from kupdate for example), but the fsync_dev cannot
be part of the locking logic itself. You may use the bh_async patch
from IBM that I am maintaining in my tree waiting 2.5 to avoid
allocating a new bh for the callback overload.

> Is there another problem you are referring to?

yes that's it.

> AFAICS, there would only be a large window for missed buffers if you
> were doing two PE moves at once, and had contention for _pe_lock,

Ok, let's forget the two concurrent PE moves at once for now ;) Let's
get right the one live pv_move case first ;)

Andrea

2001-07-17 09:05:23

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Hi,

On Fri, Jul 13, 2001 at 08:36:01AM -0700, Jeffrey W. Baker wrote:

> > files O_SYNC. Journal size was 400 megs, mount options `data=journal'
> >
> > ext2: Throughput 2.71849 MB/sec (NB=3.39812 MB/sec 27.1849 MBit/sec)
> > ext3: Throughput 12.3623 MB/sec (NB=15.4529 MB/sec 123.623 MBit/sec)
> >
> > The difference will be less dramatic with large, individual writes.
>
> This is a totally transient effect, right? The journal acts as a faster
> buffer, but if programs are writing a lot of data to the disk for a very
> long time, the throughput will eventually be throttled by writing the
> journal back into the filesystem.

Not for O_SYNC. For ext2, *every* O_SYNC append to a file involves
seeking between inodes and indirect blocks and data blocks. With ext3
with data journaling enabled, the synchronous part of the IO is a
single sequential write to the journal. The async writeback will
affect throughput, yes, but since it is done in the background, it can
do tons of optimisations: if you extend a file a hundred times with
O_SYNC, then you are forced to journal the inode update a hundred
times but the writeback which occurs later need only be done once.

For async traffic, you're quite correct. For synchronous traffic, the
writeback later on is still async, and the synchronous costs really do
often dominate, so the net effect over time is still a big win.

Cheers,
Stephen