2002-06-20 21:50:47

by Griffiths, Richard A

[permalink] [raw]
Subject: RE: ext3 performance bottleneck as the number of spindles gets la rge

I should have mentioned the throughput we saw on 4 adapters 6 drives was
126KB/s. The max theoretical bus bandwith is 640MB/s.

-----Original Message-----
From: Andrew Morton [mailto:[email protected]]
Sent: Thursday, June 20, 2002 2:26 PM
To: [email protected]
Cc: Griffiths, Richard A; 'Jens Axboe'; Linux Kernel Mailing List;
[email protected]
Subject: Re: ext3 performance bottleneck as the number of spindles gets
large


mgross wrote:
>
> On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote:
> > Yup. I take it back - high ext3 lock contention happens on 2.5
> > as well, which has block-highmem. With heavy write traffic onto
> > six disks, two controllers, six filesystems, four CPUs the machine
> > spends about 40% of the time spinning on locks in fs/ext3/inode.c
> > You're un dual CPU, so the contention is less.
> >
> > Not very nice. But given that the longest spin time was some
> > tens of milliseconds, with the average much lower, it shouldn't
> > affect overall I/O throughput.
>
> How could losing 40% of your CPU's to spin locks NOT spank your
throughtput?

The limiting factor is usually disk bandwidth, seek latency, rotational
latency. That's why I want to know your bandwidth.

> Can you copy your lockmeter data from its kernel_flag section? Id like to
> see it.

I don't find lockmeter very useful. Here's oprofile output for 2.5.23:

c013ec08 873 1.07487 rmqueue
c018a8e4 950 1.16968 do_get_write_access
c013b00c 969 1.19307 kmem_cache_alloc_batch
c018165c 1120 1.37899 ext3_writepage
c0193120 1457 1.79392 journal_add_journal_head
c0180e30 1458 1.79515 ext3_prepare_write
c0136948 6546 8.05969 generic_file_write
c01838ac 42608 52.4606 .text.lock.inode

So I lost two CPUs on the BKL in fs/ext3/inode.c. The remaining
two should be enough to saturate all but the most heroic disk
subsystems.

A couple of possibilities come to mind:

1: Processes which should be submitting I/O against disk "A" are
instead spending tons of time asleep in the page allocator waiting
for I/O to complete against disk "B".

2: ext3 is just too slow for the rate of data which you're trying to
push at it. This exhibits as lock contention, but the root cause
is the cost of things like ext3_mark_inode_dirty(). And *that*
is something we can fix - can shave 75% off the cost of that.

Need more data...


> >
> > Possibly something else is happening. Have you tested ext2?
>
> No. We're attempting to see if we can scale to large numbers of spindles
> with EXT3 at the moment. Perhaps we can effect positive changes to ext3
> before giving up on it and moving to another Journaled FS.

Have you tried *any* other fs?

-


2002-06-23 04:10:50

by Christopher E. Brown

[permalink] [raw]
Subject: RE: ext3 performance bottleneck as the number of spindles gets large

On Thu, 20 Jun 2002, Griffiths, Richard A wrote:

> I should have mentioned the throughput we saw on 4 adapters 6 drives was
> 126KB/s. The max theoretical bus bandwith is 640MB/s.


This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is
503MB/sec minus PCI overhead...

This of course assumes nothing else is using the PCI bus.


120 something MB/sec sounds a hell of a lot like topping out a 32bit
33Mhz PCI bus, but IIRC the earlier posting listed 39160 cards, PCI
64bit w/ backward compat to 32bit.

You do have *ALL* of these cards plugged into a full PCI 64bit/66Mhz
slot right? Not plugging them into a 32bit/33Mhz slot?


32bit/33Mhz (32 * 33,000,000) / (1024 * 1024 * 8) = 125.89 MByte/sec
64bit/33Mhz (64 * 33,000,000) / (1024 * 1024 * 8) = 251.77 MByte/sec
64bit/66Mhz (64 * 66,000,000) / (1024 * 1024 * 8) = 503.54 MByte/sec


NOTE: PCI transfer rates are often listed as

32bit/33Mhz, 132 MByte/sec
64bit/33Mhz, 264 MByte/sec
64bit/66Mhz, 528 MByte/sec

This is somewhat true, but only if we start with Mbit rates as used in
transmission rates (1,000,000 bits/sec) and work from there, instead
of 2^20 (1,048,576). I will not argue about PCI 32bit/33Mhz being
1056Mbit, if talking about line rate, but when we are talking about
storage media and transfers to/from as measured by files remember to
convert.

--
I route, therefore you are.


2002-06-23 04:34:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext3 performance bottleneck as the number of spindles gets large

On Jun 22, 2002 22:02 -0600, Christopher E. Brown wrote:
> On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
>
> > I should have mentioned the throughput we saw on 4 adapters 6 drives was
> > 126KB/s. The max theoretical bus bandwith is 640MB/s.
>
> This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is
> 503MB/sec minus PCI overhead...

Assuming you only have a single PCI bus...

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-06-23 06:08:19

by Christopher E. Brown

[permalink] [raw]
Subject: Re: ext3 performance bottleneck as the number of spindles gets large

On Sat, 22 Jun 2002, Andreas Dilger wrote:

> On Jun 22, 2002 22:02 -0600, Christopher E. Brown wrote:
> > On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
> >
> > > I should have mentioned the throughput we saw on 4 adapters 6 drives was
> > > 126KB/s. The max theoretical bus bandwith is 640MB/s.
> >
> > This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is
> > 503MB/sec minus PCI overhead...
>
> Assuming you only have a single PCI bus...


Yes, we could (for example) assume a DP264 board, it features 2/4/8
way memory interleave, dual 21264 CPUs, and 2 separate PCI 64bit 66Mhz
buses.

However, multiple busses are *rare* on x86. There are alot of chained
busses via PCI to PCI bridge, but few systems with 2 or more PCI
busses of any type with parallel access to the CPU.


--
I route, therefore you are.

2002-06-23 06:36:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large

On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> However, multiple busses are *rare* on x86. There are alot of chained
> busses via PCI to PCI bridge, but few systems with 2 or more PCI
> busses of any type with parallel access to the CPU.

NUMA-Q has them.


Cheers,
Bill

2002-06-23 17:17:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large

William Lee Irwin III <[email protected]> writes:

> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> > However, multiple busses are *rare* on x86. There are alot of chained
> > busses via PCI to PCI bridge, but few systems with 2 or more PCI
> > busses of any type with parallel access to the CPU.
>
> NUMA-Q has them.

As do the latest round of dual P4 Xeon chipsets. The Intel E7500 and
the Serverworks Grand Champion.

So on new systems this is easy to get if you want it.

Eric