LinuxLists.cc - Ext3 meta-data performance

2003-05-29 12:36:42

Subject: Ext3 meta-data performance

I've recently upgraded our company rsync/backup server and have been running
into performance slowdowns. The old system was a dual processor Pentium III
(Coppermine) 866MHz running Redhat 7.3 with IDE disks (ext2 filesystems).
We have since upgraded it to Redhat 9, added a 3Ware 7500-8 RAID controller
and more disks (w/ ext3 filesystems + external journal).

The primary use for this system is to provide live rsync snapshots of
several of our primary servers. For each system we maintain a "current"
snapshot, from which a hard-linked image is taken after each rsync update.
i.e., we rsync and then 'cp -Rl current snapshot-$DATE'. After the update
to Redhat 9, the rsync itself was faster, but the time to make the
hard-links became an order of magnitude slower (~4min -> ~50min for a tree
with 500,000 files). Not only was it slower, but it destroyed system
interactivity for minutes at a time.

Since these rsync backups are done in addition to traditional daily tape
backups, we've taken the system out of production use and opened the door
for experimentation. So, the next logical step was to try a 2.5 kernel.
After some work, I've gotten 2.5.70-mm2 booting and it is _much_ better than
the Redhat 2.4 kernels, and the system interactivity is flawless. However,
the speed of creating hard-links is still three and a half times slower than
with the old 2.2 kernel. It now takes ~14 minutes to create the links, and
from what I can tell, the bottlenecks is not the CPU or the disk-throughput.
I was wondering if you could suggest any ways to tweak, tune, or otherwise
get this heap moving a little faster.

Here is some details output of system activity that may help:

'vmstat 1' output (taken from near start of cp -Rl):
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 5680 267472 13436 0 0 316 510 925 207 1 5 94
1 0 1 0 6512 266992 13236 0 0 536 92 1140 290 0 14 86
1 0 0 0 5744 267776 13132 0 0 396 0 1119 233 1 13 86
0 1 0 0 7024 266784 13240 0 0 376 124 1119 241 1 17 82
0 1 0 0 5424 267768 13276 0 0 492 0 1130 267 1 8 91
0 1 0 0 5680 267896 13216 0 0 508 5120 1181 355 1 13 86
0 1 0 0 4200 269120 13216 0 0 596 88 1243 698 7 13 80
0 1 0 0 5992 268160 13156 0 0 496 28 1136 327 1 8 91
0 1 0 0 4072 269184 13152 0 0 516 0 1134 272 1 16 83
1 0 0 0 4840 268056 13260 0 0 408 8 1107 225 2 24 75
0 1 0 0 6248 267036 13056 0 0 328 5812 1162 271 1 6 93
0 1 0 0 5168 268236 13080 0 0 596 12 1157 322 1 10 89
0 1 1 0 5872 267576 13060 0 0 584 8 1154 325 0 6 94
0 1 0 0 5040 268552 13104 0 0 488 0 1132 289 1 7 92
0 1 1 0 5872 267980 12724 0 0 508 20 1133 283 1 11 88
0 1 0 0 4912 268976 12748 0 0 456 4928 1165 329 1 10 89
0 1 1 0 5104 269328 12260 0 0 540 4 1140 303 1 11 88
0 1 0 0 4080 270320 12220 0 0 496 0 1131 293 1 11 88
0 1 0 0 4912 269844 11812 0 0 540 12 1146 323 1 9 90
1 0 0 0 5744 269196 11712 0 0 612 0 1159 327 1 10 89
0 1 0 0 4656 270256 11740 0 0 476 5976 1261 733 6 11 83
0 1 0 0 5616 269424 11552 0 0 460 12 1124 282 1 8 91
0 1 0 0 4528 270472 11524 0 0 524 0 1137 280 1 7 92
0 1 0 0 5552 269500 11544 0 0 496 8 1134 295 0 7 93
0 1 0 0 4592 270412 11584 0 0 452 0 1118 243 1 11 88
0 1 1 0 5100 270000 11384 0 0 212 5804 1167 192 0 11 89
[...]

Oprofile output (from whole cp -Rl):
vma samples % symbol name image
c0108bb0 113811 25.8026 default_idle /.../vmlinux
c016cb20 42919 9.73037 __d_lookup /.../vmlinux
c016e260 24583 5.57333 find_inode_fast /.../vmlinux
c0115c90 24269 5.50214 mark_offset_tsc /.../vmlinux
c0193bf0 21395 4.85056 ext3_find_entry /.../vmlinux
c01948a0 17107 3.87841 add_dirent_to_buf /.../vmlinux
c0162390 12258 2.77907 link_path_walk /.../vmlinux
42069670 11587 2.62694 getc /.../vmlinux
c018ccb0 10766 2.44081 ext3_check_dir_entry /.../vmlinux
c01116e0 6523 1.47886 timer_interrupt /.../vmlinux
00000000 5497 1.24625 (no symbol) /bin/cp
c01bb420 5202 1.17937 atomic_dec_and_lock /.../vmlinux
c0156830 3548 0.804384 __find_get_block /.../vmlinux
c01baf80 2991 0.678104 strncpy_from_user /.../vmlinux
c016c000 2449 0.555224 prune_dcache /.../vmlinux
c016c5c0 2441 0.553411 d_alloc /.../vmlinux
c019b110 2200 0.498772 do_get_write_access /.../vmlinux
c016bc00 2074 0.470206 dput /.../vmlinux
c0192000 2047 0.464085 ext3_read_inode /.../vmlinux
c013f250 1977 0.448215 kmem_cache_alloc /.../vmlinux
c0120800 1818 0.412167 profile_hook /.../vmlinux
c01622d0 1792 0.406273 do_lookup /.../vmlinux
c0162dd0 1738 0.39403 path_lookup /.../vmlinux
42074170 1726 0.39131 _int_malloc /.../vmlinux
c010b9a4 1692 0.383601 apic_timer_interrupt /.../vmlinux
4207a6d0 1668 0.37816 strlen /.../vmlinux
c018f900 1654 0.374986 ext3_get_block_handle /.../vmlinux
c011b1a0 1593 0.361157 rebalance_tick /.../vmlinux
c013f3f0 1540 0.349141 kmem_cache_free /.../vmlinux
c0233b30 1529 0.346647 tw_interrupt /.../vmlinux
c018fda0 1497 0.339392 ext3_getblk /.../vmlinux
08049204 1490 0.337805 fgetc_wrapped /.../vmlinux
c01a2c50 1448 0.328283 journal_add_journal_head /.../vmlinux
08049244 1419 0.321708 getline_wrapped /.../vmlinux
c0191df0 1384 0.313773 ext3_get_inode_loc /.../vmlinux
c019be30 1379 0.31264 journal_dirty_metadata /.../vmlinux
c0192330 1320 0.299263 ext3_do_update_inode /.../vmlinux
[...]

tune2fs -l /dev/sdb1:

tune2fs 1.32 (09-Nov-2002)
Filesystem volume name: /backup
Last mounted on: <not available>
Filesystem UUID: a8611f50-eb9d-4796-b182-84c8ce2cd0d3
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal filetype needs_recovery sparse_super
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 67862528
Block count: 135703055
Reserved block count: 1357030
Free blocks: 97698838
Free inodes: 66076067
First block: 0
Block size: 4096
Fragment size: 4096
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16384
Inode blocks per group: 512
Filesystem created: Fri Apr 18 15:16:46 2003
Last mount time: Thu May 29 07:41:09 2003
Last write time: Thu May 29 07:41:09 2003
Mount count: 3
Maximum mount count: 37
Last checked: Fri Apr 18 15:16:46 2003
Check interval: 15552000 (6 months)
Next check after: Wed Oct 15 15:16:46 2003
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal UUID: 0b11edcb-9c08-4030-bda7-141060aefd09
Journal inode: 0
Journal device: 0x0805
First orphan inode: 0

--
--
Kevin Jacobs
The OPAL Group - Enterprise Systems Architect
Voice: (216) 986-0710 x 19 E-mail: [email protected]
Fax: (216) 986-0714 WWW: http://www.theopalgroup.com

2003-05-29 12:51:58

by Nick Piggin

[permalink] [raw]

Subject: Re: Ext3 meta-data performance

Kevin Jacobs wrote:

>I've recently upgraded our company rsync/backup server and have been running
>into performance slowdowns. The old system was a dual processor Pentium III
>(Coppermine) 866MHz running Redhat 7.3 with IDE disks (ext2 filesystems).
>We have since upgraded it to Redhat 9, added a 3Ware 7500-8 RAID controller
>and more disks (w/ ext3 filesystems + external journal).
>
>The primary use for this system is to provide live rsync snapshots of
>several of our primary servers. For each system we maintain a "current"
>snapshot, from which a hard-linked image is taken after each rsync update.
>i.e., we rsync and then 'cp -Rl current snapshot-$DATE'. After the update
>to Redhat 9, the rsync itself was faster, but the time to make the
>hard-links became an order of magnitude slower (~4min -> ~50min for a tree
>with 500,000 files). Not only was it slower, but it destroyed system
>interactivity for minutes at a time.
>
>Since these rsync backups are done in addition to traditional daily tape
>backups, we've taken the system out of production use and opened the door
>for experimentation. So, the next logical step was to try a 2.5 kernel.
>After some work, I've gotten 2.5.70-mm2 booting and it is _much_ better than
>the Redhat 2.4 kernels, and the system interactivity is flawless. However,
>the speed of creating hard-links is still three and a half times slower than
>with the old 2.2 kernel. It now takes ~14 minutes to create the links, and
>from what I can tell, the bottlenecks is not the CPU or the disk-throughput.
>
Its probably seek bound.
Provide some more information about your disk/partition setup, and external
journals, and data= mode. Remember ext3 will generally always have to do
more work than ext2.

If you want to play with the scheduler, try set
/sys/block/blockdev*/queue/nr_requests = 8192
then try
/sys/block/blockdev*/queue/iosched/antic_expire = 0

Try the above combinations with and without a big TCQ depth. You should
be able to set them on the fly and see what happens to throughput during
the operation. Let me know what you see.

Nice to see interactivity is good though.

2003-05-30 09:55:56

by Kevin Jacobs

[permalink] [raw]

Subject: Re: Ext3 meta-data performance

On Thu, 29 May 2003, Nick Piggin wrote:
> Kevin Jacobs wrote:
> >[...]
> >Since these rsync backups are done in addition to traditional daily tape
> >backups, we've taken the system out of production use and opened the door
> >for experimentation. So, the next logical step was to try a 2.5 kernel.
> >After some work, I've gotten 2.5.70-mm2 booting and it is _much_ better than
> >the Redhat 2.4 kernels, and the system interactivity is flawless. However,
> >the speed of creating hard-links is still three and a half times slower than
> >with the old 2.2 kernel. It now takes ~14 minutes to create the links, and
> >from what I can tell, the bottlenecks is not the CPU or the disk-throughput.
> >
> Its probably seek bound.
> Provide some more information about your disk/partition setup, and external
> journals, and data= mode. Remember ext3 will generally always have to do
> more work than ext2.

SCSI ID 1 3ware 7500-8 ATA RAID Controller

* Array Unit 0 Mirror (RAID 1) 40.01 GB OK
+ Port 0 WDC WD400BB-00DEA0 40.02 GB OK
+ Port 1 WDC WD400BB-00DEA0 40.02 GB OK
* Array Unit 4 Striped with Parity 64K (RAID 5) 555.84 GB OK
+ Port 4 IC35L180AVV207-1 185.28 GB OK
+ Port 5 IC35L180AVV207-1 185.28 GB OK
+ Port 6 IC35L180AVV207-1 185.28 GB OK
+ Port 7 IC35L180AVV207-1 185.28 GB OK

Disk /dev/sda: 40.0 GB, 40019615744 bytes
255 heads, 63 sectors/track, 4865 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 261 2096451 83 Linux
/dev/sda2 262 1566 10482412+ 83 Linux
/dev/sda3 1567 4570 24129630 83 Linux
/dev/sda4 4571 4865 2369587+ f Win95 Ext'd (LBA)
/dev/sda5 4571 4589 152586 83 Linux
/dev/sda6 4590 4734 1164681 83 Linux
/dev/sda7 4735 4865 1052226 83 Linux

Disk /dev/sdb: 555.8 GB, 555847581696 bytes
255 heads, 63 sectors/track, 67577 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 67577 542812221 83 Linux

Unit 0 is /dev/sda and the journal is /dev/sda5. Unit 1 is /dev/sdb and the
backup filesystem is /dev/sdb1. The data= mode is whatever is default,
/dev/sdb1 is mounted noatime. I've also applied the journal_refile_buffer
patch posted by AKPM yesterday morning.

> If you want to play with the scheduler, try set
> /sys/block/blockdev*/queue/nr_requests = 8192

This killed the entire system -- livelocking it with no disk activity to the
point that I had to hit the reset button. So does setting nr_requests on
sda and sdb from 128 to 256. The problems hit before the rsync, during a
'rm -Rf' on a previously copied tree.

> then try
> /sys/block/blockdev*/queue/iosched/antic_expire = 0

This seemed to make no difference.

> Try the above combinations with and without a big TCQ depth. You should
> be able to set them on the fly and see what happens to throughput during
> the operation. Let me know what you see.

I'm not sure how to change TCQ depth on the fly. Last I knew, it was a
compiled-in parameter.

I have some more time to experiment, so please let me know if there is
anything else you think I should try.

Thanks,
-Kevin

--
--
Kevin Jacobs
The OPAL Group - Enterprise Systems Architect
Voice: (216) 986-0710 x 19 E-mail: [email protected]
Fax: (216) 986-0714 WWW: http://www.theopalgroup.com

2003-05-30 14:51:26

by Nick Piggin

[permalink] [raw]

Subject: Re: Ext3 meta-data performance

Kevin Jacobs wrote:

>On Thu, 29 May 2003, Nick Piggin wrote:
>
>>Kevin Jacobs wrote:
>>
>>>[...]
>>>Since these rsync backups are done in addition to traditional daily tape
>>>backups, we've taken the system out of production use and opened the door
>>>for experimentation. So, the next logical step was to try a 2.5 kernel.
>>>After some work, I've gotten 2.5.70-mm2 booting and it is _much_ better than
>>>the Redhat 2.4 kernels, and the system interactivity is flawless. However,
>>>the speed of creating hard-links is still three and a half times slower than
>>>with the old 2.2 kernel. It now takes ~14 minutes to create the links, and
>>>
>>>from what I can tell, the bottlenecks is not the CPU or the disk-throughput.
>>
>>Its probably seek bound.
>>Provide some more information about your disk/partition setup, and external
>>journals, and data= mode. Remember ext3 will generally always have to do
>>more work than ext2.
>>
>
> SCSI ID 1 3ware 7500-8 ATA RAID Controller
>
> * Array Unit 0 Mirror (RAID 1) 40.01 GB OK
> + Port 0 WDC WD400BB-00DEA0 40.02 GB OK
> + Port 1 WDC WD400BB-00DEA0 40.02 GB OK
> * Array Unit 4 Striped with Parity 64K (RAID 5) 555.84 GB OK
> + Port 4 IC35L180AVV207-1 185.28 GB OK
> + Port 5 IC35L180AVV207-1 185.28 GB OK
> + Port 6 IC35L180AVV207-1 185.28 GB OK
> + Port 7 IC35L180AVV207-1 185.28 GB OK
>
>Disk /dev/sda: 40.0 GB, 40019615744 bytes
>255 heads, 63 sectors/track, 4865 cylinders
>Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
>/dev/sda1 * 1 261 2096451 83 Linux
>/dev/sda2 262 1566 10482412+ 83 Linux
>/dev/sda3 1567 4570 24129630 83 Linux
>/dev/sda4 4571 4865 2369587+ f Win95 Ext'd (LBA)
>/dev/sda5 4571 4589 152586 83 Linux
>/dev/sda6 4590 4734 1164681 83 Linux
>/dev/sda7 4735 4865 1052226 83 Linux
>
>Disk /dev/sdb: 555.8 GB, 555847581696 bytes
>255 heads, 63 sectors/track, 67577 cylinders
>Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
>/dev/sdb1 * 1 67577 542812221 83 Linux
>
>Unit 0 is /dev/sda and the journal is /dev/sda5. Unit 1 is /dev/sdb and the
>backup filesystem is /dev/sdb1. The data= mode is whatever is default,
>/dev/sdb1 is mounted noatime. I've also applied the journal_refile_buffer
>patch posted by AKPM yesterday morning.
>
I think you should have your journal on its own spinde if
you are going to the trouble of having an external one.

>
>>If you want to play with the scheduler, try set
>>/sys/block/blockdev*/queue/nr_requests = 8192
>>
>
>This killed the entire system -- livelocking it with no disk activity to the
>point that I had to hit the reset button. So does setting nr_requests on
>sda and sdb from 128 to 256. The problems hit before the rsync, during a
>'rm -Rf' on a previously copied tree.
>
OK I'll have a look into that.

>
>>then try
>>/sys/block/blockdev*/queue/iosched/antic_expire = 0
>>
>
>This seemed to make no difference.
>
Thats alright then.

>
>>Try the above combinations with and without a big TCQ depth. You should
>>be able to set them on the fly and see what happens to throughput during
>>the operation. Let me know what you see.
>>
>
>I'm not sure how to change TCQ depth on the fly. Last I knew, it was a
>compiled-in parameter.
>
Don't worry too much about this. Its probably not a big issue.

>
>I have some more time to experiment, so please let me know if there is
>anything else you think I should try.
>
Andrew might be able to suggest some worthwhile tests, if nothing
else, try mounting your filesystems as ext2, so you can get a
baseline figure.

2003-05-30 16:07:53

by Andrew Morton

[permalink] [raw]

Subject: Re: Ext3 meta-data performance

Nick Piggin <[email protected]> wrote:
>
> >
> >I have some more time to experiment, so please let me know if there is
> >anything else you think I should try.
> >
> Andrew might be able to suggest some worthwhile tests, if nothing
> else, try mounting your filesystems as ext2, so you can get a
> baseline figure.

So the workload is a `cp -Rl' of a 500,000 file tree?

Vast amounts of metadata. ext2 will fly through that. Poor old ext3 has
to write everything twice, and has to keep on doing seek-intensive
checkpointing when the journal fills.

When we get Andreas's "dont bother reading empty inode blocks" speedup
going it will help both filesystems quite a bit.

Increasing the journal size _may_ help. `tune2fs -J size=400', or `mke2fs
-j J size=400'.

The Orlov allocator will help this workload significantly, but you have to
give it time to settle in: it uses different decisions for the placement of
directories on disk. If the source directory of the copy was created by a
2.4 filesystem then we won't get any of that benefit.

2003-05-30 17:32:38

by Andreas Dilger

[permalink] [raw]

Subject: Re: Ext3 meta-data performance

On May 30, 2003 09:21 -0700, Andrew Morton wrote:
> So the workload is a `cp -Rl' of a 500,000 file tree?
>
> Vast amounts of metadata. ext2 will fly through that. Poor old ext3 has
> to write everything twice, and has to keep on doing seek-intensive
> checkpointing when the journal fills.
>
> When we get Andreas's "dont bother reading empty inode blocks" speedup
> going it will help both filesystems quite a bit.

Yes, that code is specifically a win for creating lots of files at once
and also reading large chunks of itable at once, so it will help on both
sides. The difficulty is that it checks the inode bitmap to decide if
it should read/zero the inode table block, but with the advent of Alex's
no-lock inode allocation this is racy.

I'm thinking that what needs to be done is to lock the inode table buffer
head if we think all of the bits for that block are empty (before setting
a bit there). Then, we re-check the bitmap before making the read/zero
decision if the itable block is not up-to-date, and zero the buffer and
mark it up-to-date if we again find the corresponding bits are zero, and
mark one bit in-use for our current inode allocation. Other threads that
are trying to allocate in that region will wait on the buffer head when
they find it not up-to-date, and wake after it has been set up appropriately.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-06-04 16:44:53

by Petro

[permalink] [raw]

Subject: Re: Ext3 meta-data performance

On Fri, May 30, 2003 at 06:09:02AM -0400, Kevin Jacobs wrote:
> On Thu, 29 May 2003, Nick Piggin wrote:
> > Kevin Jacobs wrote:
> > >[...]
> > >Since these rsync backups are done in addition to traditional daily tape
> > >backups, we've taken the system out of production use and opened the door
> > >for experimentation. So, the next logical step was to try a 2.5 kernel.
> > >After some work, I've gotten 2.5.70-mm2 booting and it is _much_ better than
> > >the Redhat 2.4 kernels, and the system interactivity is flawless. However,
> > >the speed of creating hard-links is still three and a half times slower than
> > >with the old 2.2 kernel. It now takes ~14 minutes to create the links, and
> > >from what I can tell, the bottlenecks is not the CPU or the disk-throughput.

> SCSI ID 1 3ware 7500-8 ATA RAID Controller
> * Array Unit 0 Mirror (RAID 1) 40.01 GB OK
> + Port 0 WDC WD400BB-00DEA0 40.02 GB OK
> + Port 1 WDC WD400BB-00DEA0 40.02 GB OK
> * Array Unit 4 Striped with Parity 64K (RAID 5) 555.84 GB OK
> + Port 4 IC35L180AVV207-1 185.28 GB OK
> + Port 5 IC35L180AVV207-1 185.28 GB OK
> + Port 6 IC35L180AVV207-1 185.28 GB OK
> + Port 7 IC35L180AVV207-1 185.28 GB OK

This isn't on the linux side of things, but since this is a backup
server, and you've also got tape backups, why not just get rid of
the RAID5 costs and go with a RAID0.

(You'd have to have a double fault to take you "offline" and a
triple fault to lose the data)

--
"On two occasions, I have been asked [by members of Parliament], 'Pray,
Mr. Babbage, if you put into the machine wrong figures, will the right
answers come out?' I am not able to rightly apprehend the kind of confusion
of ideas that could provoke such a question." -- Charles Babbage