LinuxLists.cc - Bad ext4 sync performance on 16 TB GPT partition

2010-02-26 10:31:56

Subject: Bad ext4 sync performance on 16 TB GPT partition

Hi,

(please Cc: me, I'm no subscriber)

we were performing some ext4 tests on a 16 TB GPT partition and ran into
this issue when writing a single large file with dd and syncing
afterwards.

The problem: dd is fast (cached) but the following sync is *very* slow.

# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 15.9423 seconds, 658 MB/s
0.01user 441.40system 7:26.10elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+794minor)pagefaults 0swaps

dd: ~16 seconds
sync: ~7 minutes

(The same test finishes in 57s with xfs!)

Here's the "iostat -m /dev/sdb 1" output during dd write:

avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 6,62 19,35 0,00 74,03

Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 484,00 0,00 242,00 0 242

"iostat -m /dev/sdb 1" during the sync looks like this

avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 12,48 0,00 0,00 87,52

Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 22,00 0,00 8,00 0 8

However, the sync performance is fine if we ...

* use xfs or
* disable the ext4 journal or
* disable ext4 extents (but with enabled journal)

Here's a kernel profile of the test:

# readprofile -r
# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB_3 bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 15.8261 seconds, 663 MB/s
0.01user 448.55system 7:32.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+788minor)pagefaults 0swaps
# readprofile -m /boot/System.map-2.6.18-190.el5 | sort -nr -k 3 | head -15
3450304 default_idle 43128.8000
9532 mod_zone_page_state 733.2308
58594 find_get_pages 537.5596
58499 find_get_pages_tag 427.0000
72404 __set_page_dirty_nobuffers 310.7468
10740 __wake_up_bit 238.6667
7786 unlock_page 165.6596
1996 dec_zone_page_state 153.5385
12230 clear_page_dirty_for_io 63.0412
5938 page_waitqueue 60.5918
14440 release_pages 41.7341
12664 __mark_inode_dirty 34.6011
5281 copy_user_generic_unrolled 30.7035
323 redirty_page_for_writepage 26.9167
15537 write_cache_pages 18.9939

Here are three call traces from the "sync" command:

sync R running task 0 5041 5032 (NOTLB)
00000000ffffffff 000000000000000c ffff81022bb16510 00000000001ec2a3
ffff81022bb16548 ffffffffffffff10 ffffffff800d1964 0000000000000010
0000000000000286 ffff8101ff56bb48 0000000000000018 ffff81033686e970
Call Trace:
[<ffffffff800d1964>] page_mkclean+0x255/0x281
[<ffffffff8000eab5>] find_get_pages_tag+0x34/0x89
[<ffffffff800f4467>] write_cache_pages+0x21b/0x332
[<ffffffff886fad9d>] :ext4:__mpage_da_writepage+0x0/0x162
[<ffffffff886fc2f1>] :ext4:ext4_da_writepages+0x317/0x4fe
[<ffffffff8005b1f0>] do_writepages+0x20/0x2f
[<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
[<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
[<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
[<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
[<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
[<ffffffff800f3bd6>] sync_inodes+0x11/0x29
[<ffffffff800e1bb0>] do_sync+0x12/0x5a
[<ffffffff800e1c06>] sys_sync+0xe/0x12
[<ffffffff8005e28d>] tracesys+0xd5/0xe0

sync R running task 0 5041 5032 (NOTLB)
ffff81022a3e4e88 ffffffff8000e930 ffff8101ff56bee8 ffff8101ff56bcd8
0000000000000000 ffffffff886fbdbc ffff8101ff56bc68 ffff8101ff56bc68
00000000002537c3 000000000001dfcd 0000000000000007 ffff8101ff56bc90
Call Trace:
[<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9
[<ffffffff8001b694>] find_get_pages+0x2f/0x6d
[<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c
[<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe
[<ffffffff8005b1f0>] do_writepages+0x20/0x2f
[<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
[<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
[<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
[<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
[<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
[<ffffffff800f3bd6>] sync_inodes+0x11/0x29
[<ffffffff800e1bb0>] do_sync+0x12/0x5a
[<ffffffff800e1c06>] sys_sync+0xe/0x12
[<ffffffff8005e28d>] tracesys+0xd5/0xe0

sync R running task 0 5353 5348 (NOTLB)
ffff810426e04048 0000000000001200 0000000100000001 0000000000000001
0000000100000000 ffff8103dcecadf8 ffff810426dfd000 ffff8102ebd81b48
ffff810426e04048 ffff810239bbbc40 0000000000000008 00000000008447f8
Call Trace:
[<ffffffff801431db>] elv_merged_request+0x1e/0x26
[<ffffffff8000c02b>] __make_request+0x324/0x401
[<ffffffff8005c6cf>] cache_alloc_refill+0x106/0x186
[<ffffffff886f6bf9>] :ext4:walk_page_buffers+0x65/0x8b
[<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9
[<ffffffff886fbd42>] :ext4:ext4_writepage+0x9b/0x333
[<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c
[<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe
[<ffffffff8005b1f0>] do_writepages+0x20/0x2f
[<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
[<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
[<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
[<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
[<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
[<ffffffff800f3bd6>] sync_inodes+0x11/0x29
[<ffffffff800e1bb0>] do_sync+0x12/0x5a
[<ffffffff800e1c06>] sys_sync+0xe/0x12
[<ffffffff8005e28d>] tracesys+0xd5/0xe0

I've tried some more options. These do *not* influence the (bad) result:

* data=writeback and data=ordered
* disabling/enabling uninit_bg
* max_sectors_kb=512 or 4096
* io scheduler: cfq or noop

Some background information about the system:

OS: CentOS 5.4
Memory: 16 GB
CPUs: 2x Quad-Core Opteron 2356
IO scheduler: CFQ
Kernels:
* 2.6.18-164.11.1.el5 x86_64 (latest CentOS 5.4 kernel)
* 2.6.18-190.el5 x86_64 (latest Red Hat EL5 test kernel I've found from
http://people.redhat.com/jwilson/el5/ which contains an ext4 version
which (according to the rpm's changelog) was updated from the 2.6.32
ext4 codebase.
* I did not try a vanilla kernel so far.

# df -h /dev/sdb{1,2}
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 15T 9.9G 14T 1% /mnt/large
/dev/sdb2 7.3T 179M 7.0T 1% /mnt/small

(parted) print
Model: easyRAID easyRAID_Q16PS (scsi)
Disk /dev/sdb: 24,0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 16,0TB 16,0TB ext3 large
2 16,0TB 24,0TB 8003GB ext3 small

("start 1049kB" is at sector 2048)

sdb is a FC HW-RAID (easyRAID_Q16PS) and consists of a RAID6 volume
created from 14 disks with chunk size 128kb.

QLogic Fibre Channel HBA Driver: 8.03.01.04.05.05-k
QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA
ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.1 hdma+, host#=8, fw=4.04.09 (486)

The ext4 filesystem was created with

mkfs.ext4 -T ext4,largefile4 -E stride=32,stripe-width=$((32*(14-2))) /dev/sdb1
or
mkfs.ext4 -T ext4 /dev/sdb1

Mount options: defaults,noatime,data=writeback

Any ideas?

--
Karsten Weiss

2010-02-26 11:33:36

by Karsten Weiss

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

On Fri, 26 Feb 2010, Karsten Weiss wrote:

> However, the sync performance is fine if we ...
>
> * use xfs or
> * disable the ext4 journal or
> * disable ext4 extents (but with enabled journal)

ext3 is okay, too:

# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB_1 bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 39.7188 seconds, 264 MB/s
0.03user 21.00system 1:47.24elapsed 19%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+788minor)pagefaults 0swaps

Compare:

XFS: 57s
ext4: 7m26s

--
Karsten Weiss

2010-02-26 11:46:13

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

Karsten Weiss <[email protected]> writes:

> Hi,
>
> (please Cc: me, I'm no subscriber)
>
> we were performing some ext4 tests on a 16 TB GPT partition and ran into
> this issue when writing a single large file with dd and syncing
> afterwards.
>
> The problem: dd is fast (cached) but the following sync is *very* slow.
>
> # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 15.9423 seconds, 658 MB/s
> 0.01user 441.40system 7:26.10elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+794minor)pagefaults 0swaps
>
> dd: ~16 seconds
> sync: ~7 minutes
>
> (The same test finishes in 57s with xfs!)
>
> Here's the "iostat -m /dev/sdb 1" output during dd write:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0,00 0,00 6,62 19,35 0,00 74,03
>
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sdb 484,00 0,00 242,00 0 242
>
> "iostat -m /dev/sdb 1" during the sync looks like this
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0,00 0,00 12,48 0,00 0,00 87,52
>
> Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
> sdb 22,00 0,00 8,00 0 8
>
> However, the sync performance is fine if we ...
>
> * use xfs or
> * disable the ext4 journal or
> * disable ext4 extents (but with enabled journal)
>
> Here's a kernel profile of the test:
>
> # readprofile -r
> # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB_3 bs=1M count=10000 && sync"
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 15.8261 seconds, 663 MB/s
> 0.01user 448.55system 7:32.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+788minor)pagefaults 0swaps
> # readprofile -m /boot/System.map-2.6.18-190.el5 | sort -nr -k 3 | head -15
> 3450304 default_idle 43128.8000
> 9532 mod_zone_page_state 733.2308
> 58594 find_get_pages 537.5596
> 58499 find_get_pages_tag 427.0000
> 72404 __set_page_dirty_nobuffers 310.7468
> 10740 __wake_up_bit 238.6667
> 7786 unlock_page 165.6596
> 1996 dec_zone_page_state 153.5385
> 12230 clear_page_dirty_for_io 63.0412
> 5938 page_waitqueue 60.5918
> 14440 release_pages 41.7341
> 12664 __mark_inode_dirty 34.6011
> 5281 copy_user_generic_unrolled 30.7035
> 323 redirty_page_for_writepage 26.9167
> 15537 write_cache_pages 18.9939
>
> Here are three call traces from the "sync" command:
>
> sync R running task 0 5041 5032 (NOTLB)
> 00000000ffffffff 000000000000000c ffff81022bb16510 00000000001ec2a3
> ffff81022bb16548 ffffffffffffff10 ffffffff800d1964 0000000000000010
> 0000000000000286 ffff8101ff56bb48 0000000000000018 ffff81033686e970
> Call Trace:
> [<ffffffff800d1964>] page_mkclean+0x255/0x281
> [<ffffffff8000eab5>] find_get_pages_tag+0x34/0x89
> [<ffffffff800f4467>] write_cache_pages+0x21b/0x332
> [<ffffffff886fad9d>] :ext4:__mpage_da_writepage+0x0/0x162
> [<ffffffff886fc2f1>] :ext4:ext4_da_writepages+0x317/0x4fe
> [<ffffffff8005b1f0>] do_writepages+0x20/0x2f
> [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
> [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
> [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
> [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
> [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
> [<ffffffff800f3bd6>] sync_inodes+0x11/0x29
> [<ffffffff800e1bb0>] do_sync+0x12/0x5a
> [<ffffffff800e1c06>] sys_sync+0xe/0x12
> [<ffffffff8005e28d>] tracesys+0xd5/0xe0
>
> sync R running task 0 5041 5032 (NOTLB)
> ffff81022a3e4e88 ffffffff8000e930 ffff8101ff56bee8 ffff8101ff56bcd8
> 0000000000000000 ffffffff886fbdbc ffff8101ff56bc68 ffff8101ff56bc68
> 00000000002537c3 000000000001dfcd 0000000000000007 ffff8101ff56bc90
> Call Trace:
> [<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9
> [<ffffffff8001b694>] find_get_pages+0x2f/0x6d
> [<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c
> [<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe
> [<ffffffff8005b1f0>] do_writepages+0x20/0x2f
> [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
> [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
> [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
> [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
> [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
> [<ffffffff800f3bd6>] sync_inodes+0x11/0x29
> [<ffffffff800e1bb0>] do_sync+0x12/0x5a
> [<ffffffff800e1c06>] sys_sync+0xe/0x12
> [<ffffffff8005e28d>] tracesys+0xd5/0xe0
>
> sync R running task 0 5353 5348 (NOTLB)
> ffff810426e04048 0000000000001200 0000000100000001 0000000000000001
> 0000000100000000 ffff8103dcecadf8 ffff810426dfd000 ffff8102ebd81b48
> ffff810426e04048 ffff810239bbbc40 0000000000000008 00000000008447f8
> Call Trace:
> [<ffffffff801431db>] elv_merged_request+0x1e/0x26
> [<ffffffff8000c02b>] __make_request+0x324/0x401
> [<ffffffff8005c6cf>] cache_alloc_refill+0x106/0x186
> [<ffffffff886f6bf9>] :ext4:walk_page_buffers+0x65/0x8b
> [<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9
> [<ffffffff886fbd42>] :ext4:ext4_writepage+0x9b/0x333
> [<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c
> [<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe
> [<ffffffff8005b1f0>] do_writepages+0x20/0x2f
> [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
> [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
> [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
> [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
> [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
> [<ffffffff800f3bd6>] sync_inodes+0x11/0x29
> [<ffffffff800e1bb0>] do_sync+0x12/0x5a
> [<ffffffff800e1c06>] sys_sync+0xe/0x12
> [<ffffffff8005e28d>] tracesys+0xd5/0xe0
>
> I've tried some more options. These do *not* influence the (bad) result:
>
> * data=writeback and data=ordered
> * disabling/enabling uninit_bg
> * max_sectors_kb=512 or 4096
> * io scheduler: cfq or noop
>
> Some background information about the system:
>
> OS: CentOS 5.4
> Memory: 16 GB
> CPUs: 2x Quad-Core Opteron 2356
> IO scheduler: CFQ
> Kernels:
> * 2.6.18-164.11.1.el5 x86_64 (latest CentOS 5.4 kernel)
> * 2.6.18-190.el5 x86_64 (latest Red Hat EL5 test kernel I've found from
> http://people.redhat.com/jwilson/el5/ which contains an ext4 version
> which (according to the rpm's changelog) was updated from the 2.6.32
> ext4 codebase.
Hmm.. It is hard to predict differences between vanilla tree.
This is no only ext4 related. writeback path is changed dramatically.
It is not easy to port writeback code to 2.6.18 with full performance
improvements but without introducing new issues.
> * I did not try a vanilla kernel so far.
IMHO It would be really good to know vanilla kernel's stats.
>
> # df -h /dev/sdb{1,2}
> Filesystem Size Used Avail Use% Mounted on
> /dev/sdb1 15T 9.9G 14T 1% /mnt/large
> /dev/sdb2 7.3T 179M 7.0T 1% /mnt/small
>
> (parted) print
> Model: easyRAID easyRAID_Q16PS (scsi)
> Disk /dev/sdb: 24,0TB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 1049kB 16,0TB 16,0TB ext3 large
> 2 16,0TB 24,0TB 8003GB ext3 small
>
> ("start 1049kB" is at sector 2048)
>
> sdb is a FC HW-RAID (easyRAID_Q16PS) and consists of a RAID6 volume
> created from 14 disks with chunk size 128kb.
>
> QLogic Fibre Channel HBA Driver: 8.03.01.04.05.05-k
> QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA
> ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.1 hdma+, host#=8, fw=4.04.09 (486)
>
> The ext4 filesystem was created with
>
> mkfs.ext4 -T ext4,largefile4 -E stride=32,stripe-width=$((32*(14-2))) /dev/sdb1
> or
> mkfs.ext4 -T ext4 /dev/sdb1
>
> Mount options: defaults,noatime,data=writeback
>
> Any ideas?

2010-02-26 15:47:14

by Karsten Weiss

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

Hi Dmitry!

On Fri, 26 Feb 2010, Dmitry Monakhov wrote:

> > Kernels:
> > * 2.6.18-164.11.1.el5 x86_64 (latest CentOS 5.4 kernel)
> > * 2.6.18-190.el5 x86_64 (latest Red Hat EL5 test kernel I've found from
> > http://people.redhat.com/jwilson/el5/ which contains an ext4 version
> > which (according to the rpm's changelog) was updated from the 2.6.32
> > ext4 codebase.
> Hmm.. It is hard to predict differences between vanilla tree.
> This is no only ext4 related. writeback path is changed dramatically.
> It is not easy to port writeback code to 2.6.18 with full performance
> improvements but without introducing new issues.
> > * I did not try a vanilla kernel so far.
> IMHO It would be really good to know vanilla kernel's stats.

I did a quick&dirty compilation of vanilla kernel 2.6.33 and repeated the
test:

# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 50.044 seconds, 210 MB/s
0.01user 13.76system 1:04.75elapsed 21%CPU (0avgtext+0avgdata 6224maxresident)k
0inputs+0outputs (0major+1049minor)pagefaults 0swaps

=> The problem shows only with the CentOS / Red Hat 5.4 kernels (including
RH's test kernel 2.6.18-190.el5). Aadmittedly ext4 is only a technology
preview in 5.4...

I've also tried the latest CentOS 5.3 kernel-2.6.18-128.7.1.el5 but
couldn't mount the device (with -t ext4dev).

2.6.18-164.el5 (the initial CentOS 5.4 kernel) has the bug, too.

I'm willing to test patches if somebody wants to debug the problem.

--
Karsten Weiss

2010-02-26 17:50:02

by Eric Sandeen

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

Karsten Weiss wrote:
> Hi Dmitry!
>
> On Fri, 26 Feb 2010, Dmitry Monakhov wrote:
>

...

>>> * I did not try a vanilla kernel so far.
>> IMHO It would be really good to know vanilla kernel's stats.
>
> I did a quick&dirty compilation of vanilla kernel 2.6.33 and repeated the
> test:
>
> # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 50.044 seconds, 210 MB/s
> 0.01user 13.76system 1:04.75elapsed 21%CPU (0avgtext+0avgdata 6224maxresident)k
> 0inputs+0outputs (0major+1049minor)pagefaults 0swaps
>
> => The problem shows only with the CentOS / Red Hat 5.4 kernels (including
> RH's test kernel 2.6.18-190.el5). Aadmittedly ext4 is only a technology
> preview in 5.4...
>
> I've also tried the latest CentOS 5.3 kernel-2.6.18-128.7.1.el5 but
> couldn't mount the device (with -t ext4dev).
>
> 2.6.18-164.el5 (the initial CentOS 5.4 kernel) has the bug, too.
>
> I'm willing to test patches if somebody wants to debug the problem.

Ok, that's interesting. We've not had bona-fide RHEL customers report
the problem, but then maybe it hasn't been tested this way.

2.6.18-178.el5 and beyond is based on the 2.6.32 codebase for ext4.

Testing generic 2.6.32 might also be interesting as a datapoint,
if you're willing.

-Eric

2010-03-01 09:12:36

by Karsten Weiss

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

Hi Eric,

On Fri, 26 Feb 2010, Eric Sandeen wrote:

> > => The problem shows only with the CentOS / Red Hat 5.4 kernels (including
> > RH's test kernel 2.6.18-190.el5). Aadmittedly ext4 is only a technology
> > preview in 5.4...
> >
> > I've also tried the latest CentOS 5.3 kernel-2.6.18-128.7.1.el5 but
> > couldn't mount the device (with -t ext4dev).
> >
> > 2.6.18-164.el5 (the initial CentOS 5.4 kernel) has the bug, too.
> >
> > I'm willing to test patches if somebody wants to debug the problem.
>
> Ok, that's interesting. We've not had bona-fide RHEL customers report
> the problem, but then maybe it hasn't been tested this way.

I think so because, as I mentioned, the issue can be reproduced with the
RH test kernel 2.6.18-190.el5 x86_64 (http://people.redhat.com/jwilson/el5/),
too.

> 2.6.18-178.el5 and beyond is based on the 2.6.32 codebase for ext4.
>
> Testing generic 2.6.32 might also be interesting as a datapoint,
> if you're willing.

Sorry for the delay, here's the (good) 2.6.32 result:

# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 46.3369 seconds, 226 MB/s
0.00user 14.17system 0:59.53elapsed 23%CPU (0avgtext+0avgdata 6224maxresident)k
0inputs+0outputs (0major+1045minor)pagefaults 0swaps

To summarize:

Bad: 2.6.18-164.el5 (CentOS)
Bad: 2.6.18-164.11.1el5 (CentOS)
Bad: 2.6.18-190.el5 (RH)
Good: 2.6.32
Good: 2.6.33

--
Karsten Weiss

2010-03-01 16:22:17

by Eric Sandeen

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

Karsten Weiss wrote:
> Hi Eric,
>
> On Fri, 26 Feb 2010, Eric Sandeen wrote:
>
>>> => The problem shows only with the CentOS / Red Hat 5.4 kernels (including
>>> RH's test kernel 2.6.18-190.el5). Aadmittedly ext4 is only a technology
>>> preview in 5.4...
>>>
>>> I've also tried the latest CentOS 5.3 kernel-2.6.18-128.7.1.el5 but
>>> couldn't mount the device (with -t ext4dev).
>>>
>>> 2.6.18-164.el5 (the initial CentOS 5.4 kernel) has the bug, too.
>>>
>>> I'm willing to test patches if somebody wants to debug the problem.
>> Ok, that's interesting. We've not had bona-fide RHEL customers report
>> the problem, but then maybe it hasn't been tested this way.
>
> I think so because, as I mentioned, the issue can be reproduced with the
> RH test kernel 2.6.18-190.el5 x86_64 (http://people.redhat.com/jwilson/el5/),
> too.
>
>> 2.6.18-178.el5 and beyond is based on the 2.6.32 codebase for ext4.
>>
>> Testing generic 2.6.32 might also be interesting as a datapoint,
>> if you're willing.
>
> Sorry for the delay, here's the (good) 2.6.32 result:
>
> # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 46.3369 seconds, 226 MB/s
> 0.00user 14.17system 0:59.53elapsed 23%CPU (0avgtext+0avgdata 6224maxresident)k
> 0inputs+0outputs (0major+1045minor)pagefaults 0swaps
>
> To summarize:
>
> Bad: 2.6.18-164.el5 (CentOS)
> Bad: 2.6.18-164.11.1el5 (CentOS)
> Bad: 2.6.18-190.el5 (RH)
> Good: 2.6.32
> Good: 2.6.33
>

Thanks, I'll have to investigate that. I guess something may have gotten lost
in translation in the 2.6.32->2.6.18 backport.....

-Eric

2010-03-12 11:52:22

by Karsten Weiss

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

Hi Eric!

On Mon, 1 Mar 2010, Eric Sandeen wrote:

> > Sorry for the delay, here's the (good) 2.6.32 result:
> >
> > # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
> > 10000+0 records in
> > 10000+0 records out
> > 10485760000 bytes (10 GB) copied, 46.3369 seconds, 226 MB/s
> > 0.00user 14.17system 0:59.53elapsed 23%CPU (0avgtext+0avgdata 6224maxresident)k
> > 0inputs+0outputs (0major+1045minor)pagefaults 0swaps
> >
> > To summarize:
> >
> > Bad: 2.6.18-164.el5 (CentOS)
> > Bad: 2.6.18-164.11.1el5 (CentOS)
> > Bad: 2.6.18-190.el5 (RH)
> > Good: 2.6.32
> > Good: 2.6.33

In the meantime I've also reproduced the problem on another machine with a
Red Hat 5.5 Beta (x86_64) installation and decided to open a bug on RH's
bugzilla:

Bad ext4 sync performance on 16 TB GPT partition
https://bugzilla.redhat.com/show_bug.cgi?id=572930

> Thanks, I'll have to investigate that. I guess something may have gotten lost
> in translation in the 2.6.32->2.6.18 backport.....

Did you come up with anything I could test?

Is anyone else able to reproduce the problem?

--
Karsten Weiss

2010-03-12 15:20:52

by Eric Sandeen

[permalink] [raw]

Subject: Re: Bad ext4 sync performance on 16 TB GPT partition

Karsten Weiss wrote:
> Hi Eric!
>
> On Mon, 1 Mar 2010, Eric Sandeen wrote:
>
>>> Sorry for the delay, here's the (good) 2.6.32 result:
>>>
>>> # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
>>> 10000+0 records in
>>> 10000+0 records out
>>> 10485760000 bytes (10 GB) copied, 46.3369 seconds, 226 MB/s
>>> 0.00user 14.17system 0:59.53elapsed 23%CPU (0avgtext+0avgdata 6224maxresident)k
>>> 0inputs+0outputs (0major+1045minor)pagefaults 0swaps
>>>
>>> To summarize:
>>>
>>> Bad: 2.6.18-164.el5 (CentOS)
>>> Bad: 2.6.18-164.11.1el5 (CentOS)
>>> Bad: 2.6.18-190.el5 (RH)
>>> Good: 2.6.32
>>> Good: 2.6.33
>
> In the meantime I've also reproduced the problem on another machine with a
> Red Hat 5.5 Beta (x86_64) installation and decided to open a bug on RH's
> bugzilla:
>
> Bad ext4 sync performance on 16 TB GPT partition
> https://bugzilla.redhat.com/show_bug.cgi?id=572930

Thanks, and thanks for double-checking upstream.

>> Thanks, I'll have to investigate that. I guess something may have gotten lost
>> in translation in the 2.6.32->2.6.18 backport.....
>
> Did you come up with anything I could test?

I'll look into this...

> Is anyone else able to reproduce the problem?

Since it's not an upstream problem, this issue is probably best discussed
in the RHEL bug, now.

Thanks,
-Eric