2014-07-07 21:13:50

by Benjamin LaHaise

[permalink] [raw]
Subject: ext4: first write to large ext3 filesystem takes 96 seconds

Hi folks,

I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem
to exist in ext3, and was wondering if anyone has encountered this before.
I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data.
When this filesystem is freshly mounted, the first write to the filesystem
takes a whopping 96 seconds to complete, during which time the system is
reading about 1000 blocks per second. Subsequent writes are much quicker.
The problem seems to be that ext4 is loading all of the bitmaps on the
filesystem before the first write proceeds. The backtrace looks roughly as
follows:

[ 4480.921288] [<ffffffff81437472>] ? dm_request_fn+0x112/0x1c0
[ 4480.921292] [<ffffffff81244bf5>] ? __blk_run_queue+0x15/0x20
[ 4480.921294] [<ffffffff81246670>] ? queue_unplugged+0x20/0x50
[ 4480.921297] [<ffffffff8154ed05>] schedule+0x45/0x60
[ 4480.921299] [<ffffffff8154ef7c>] io_schedule+0x6c/0xb0
[ 4480.921301] [<ffffffff810fb7a9>] sleep_on_buffer+0x9/0x10
[ 4480.921303] [<ffffffff8154d435>] __wait_on_bit+0x55/0x80
[ 4480.921306] [<ffffffff810fb7a0>] ? unmap_underlying_metadata+0x40/0x40
[ 4480.921308] [<ffffffff810fb7a0>] ? unmap_underlying_metadata+0x40/0x40
[ 4480.921310] [<ffffffff8154d4d8>] out_of_line_wait_on_bit+0x78/0x90
[ 4480.921312] [<ffffffff8104d5b0>] ? autoremove_wake_function+0x40/0x40
[ 4480.921315] [<ffffffff810fb756>] __wait_on_buffer+0x26/0x30
[ 4480.921318] [<ffffffff81146258>] ext4_wait_block_bitmap+0x138/0x190
[ 4480.921321] [<ffffffff8116c816>] ext4_mb_init_cache+0x1e6/0x5f0
[ 4480.921324] [<ffffffff8109654a>] ? add_to_page_cache_locked+0x9a/0xd0
[ 4480.921327] [<ffffffff810965b1>] ? add_to_page_cache_lru+0x31/0x50
[ 4480.921330] [<ffffffff8116cd1f>] ext4_mb_init_group+0xff/0x1e0
[ 4480.921332] [<ffffffff8116ce9f>] ext4_mb_good_group+0x9f/0x130
[ 4480.921334] [<ffffffff8116e41f>] ext4_mb_regular_allocator+0x1bf/0x3d0
[ 4480.921337] [<ffffffff8116c0ac>] ? ext4_mb_normalize_request+0x26c/0x4d0
[ 4480.921339] [<ffffffff811716ce>] ext4_mb_new_blocks+0x2ee/0x490
[ 4480.921342] [<ffffffff81174c11>] ? ext4_get_branch+0x101/0x130
[ 4480.921345] [<ffffffff8117639c>] ext4_ind_map_blocks+0x9bc/0xc10
[ 4480.921347] [<ffffffff810fae11>] ? __getblk+0x21/0x2b0
[ 4480.921350] [<ffffffff8114c9a3>] ext4_map_blocks+0x293/0x390
[ 4480.921353] [<ffffffff8117a462>] ? do_get_write_access+0x1d2/0x450
[ 4480.921355] [<ffffffff810cb0b4>] ? kmem_cache_alloc+0xa4/0xc0
[ 4480.921358] [<ffffffff8114d9e9>] _ext4_get_block+0xa9/0x140
[ 4480.921360] [<ffffffff8114dab1>] ext4_get_block+0x11/0x20
[ 4480.921362] [<ffffffff810fbda5>] __block_write_begin+0x2b5/0x470
[ 4480.921365] [<ffffffff8114daa0>] ? noalloc_get_block_write+0x20/0x20
[ 4480.921368] [<ffffffff81096679>] ? grab_cache_page_write_begin+0xa9/0x100
[ 4480.921370] [<ffffffff8114c1e2>] ext4_write_begin+0x132/0x2f0
[ 4480.921373] [<ffffffff81095869>] generic_file_buffered_write+0x119/0x260
[ 4480.921376] [<ffffffff81096eef>] __generic_file_aio_write+0x27f/0x430
[ 4480.921379] [<ffffffff810cfbba>] ? do_huge_pmd_anonymous_page+0x1ea/0x2d0
[ 4480.921382] [<ffffffff81097101>] generic_file_aio_write+0x61/0xc0
[ 4480.921384] [<ffffffff81147c18>] ext4_file_write+0x68/0x2a0
[ 4480.921387] [<ffffffff8154e723>] ? __schedule+0x2c3/0x800
[ 4480.921389] [<ffffffff810d2a41>] do_sync_write+0xe1/0x120
[ 4480.921392] [<ffffffff8154eefa>] ? _cond_resched+0x2a/0x40
[ 4480.921395] [<ffffffff810d31c9>] vfs_write+0xc9/0x170
[ 4480.921397] [<ffffffff810d3910>] sys_write+0x50/0x90
[ 4480.921400] [<ffffffff8155155f>] sysenter_dispatch+0x7/0x1a

Any thoughts? Have there been any changes to this area of the ext4 code?

-ben
--
"Thought is the essence of where you are now."


2014-07-08 00:16:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

On Mon, Jul 07, 2014 at 05:13:49PM -0400, Benjamin LaHaise wrote:s
> Hi folks,
>
> I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem
> to exist in ext3, and was wondering if anyone has encountered this before.
> I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data.
> When this filesystem is freshly mounted, the first write to the filesystem
> takes a whopping 96 seconds to complete, during which time the system is
> reading about 1000 blocks per second. Subsequent writes are much quicker.
> The problem seems to be that ext4 is loading all of the bitmaps on the
> filesystem before the first write proceeds. The backtrace looks roughly as
> follows:

So the issue is that ext3 will just allocate the first free block it
can find, even if it is a single free block in block group #1001,
followed by a single free block in block group #2002. Ext4 tries a
harder to find contiguous blocks.

If you are using an ext3 file system format, the block allocation
bitmaps are scattered across the entire file system, so we end up
doing a lot random 4k seeks.

We can try to be a bit smarter about how we try to search the file
system for free blocks.

Out of curiosity, can you send me a copy of the contents of:

/proc/fs/ext4/dm-XX/mb_groups

Thanks!!

- Ted

2014-07-08 01:35:11

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

Hi Ted,

On Mon, Jul 07, 2014 at 08:16:55PM -0400, Theodore Ts'o wrote:
> On Mon, Jul 07, 2014 at 05:13:49PM -0400, Benjamin LaHaise wrote:s
> > Hi folks,
> >
> > I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem
> > to exist in ext3, and was wondering if anyone has encountered this before.
> > I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data.
> > When this filesystem is freshly mounted, the first write to the filesystem
> > takes a whopping 96 seconds to complete, during which time the system is
> > reading about 1000 blocks per second. Subsequent writes are much quicker.
> > The problem seems to be that ext4 is loading all of the bitmaps on the
> > filesystem before the first write proceeds. The backtrace looks roughly as
> > follows:
>
> So the issue is that ext3 will just allocate the first free block it
> can find, even if it is a single free block in block group #1001,
> followed by a single free block in block group #2002. Ext4 tries a
> harder to find contiguous blocks.
>
> If you are using an ext3 file system format, the block allocation
> bitmaps are scattered across the entire file system, so we end up
> doing a lot random 4k seeks.

Yeah, we're kinda stuck with ext3 on disk for now due to a bunch of reasons.
The main reason for using the ext4 codebase instead of ext3 has mostly to do
with slightly better performance for some metadata intensive operations
(like unlink and sync writes).

> We can try to be a bit smarter about how we try to search the file
> system for free blocks.
>
> Out of curiosity, can you send me a copy of the contents of:
>
> /proc/fs/ext4/dm-XX/mb_groups

Sure -- I put a copy at http://www.kvack.org/~bcrl/mb_groups as it's a bit
too big for the mailing list. The filesystem in question has a couple of
11GB files on it, with the remainder of the space being taken up by files
7200016 bytes in size. Cheers,

-ben

> Thanks!!
>
> - Ted

--
"Thought is the essence of where you are now."

2014-07-08 03:54:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

On Mon, Jul 07, 2014 at 09:35:11PM -0400, Benjamin LaHaise wrote:
>
> Sure -- I put a copy at http://www.kvack.org/~bcrl/mb_groups as it's a bit
> too big for the mailing list. The filesystem in question has a couple of
> 11GB files on it, with the remainder of the space being taken up by files
> 7200016 bytes in size.

Right, so looking at mb_groups we see a bunch of the problems. There
are a large number block groups which look like this:

#group: free frags first [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 ]
#288 : 1540 7 13056 [ 0 0 1 0 0 0 0 0 6 0 0 0 0 0 ]

It would be very interesting to see what allocation pattern resulted
in so many block groups with this layout. Before we read in
allocation bitmap, all we know from the block group descriptors is
that there are 1540 free blocks. What we don't know is that they are
broken up into 6 256 block free regions, plus a 4 block region.

If we try to allocate a 1024 block region, we'll end up searching a
large number of these block groups before find one which is suitable.

Or there is a large collection of block groups that look like this:

#834 : 4900 39 514 [ 0 20 5 5 16 6 4 8 6 1 1 0 0 0 ]

Similarly, we could try to look for a contiguous 2048 range, but even
though there is 4900 blocks available, we can't tell the difference
between something a free block layout which looks like like the above,
versus one that looks like this:

#834 : 4900 39 514 [ 0 6 0 1 3 5 1 4 0 0 0 2 0 0 ]

We could try going straight for the largely empty block groups, but
that's more likely to fragment the file system more quickly, and then
once those largely empty block groups are partially used, then we'll
end up taking a long time while we scan all of the block groups.

- Ted



2014-07-08 05:12:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

The main problem here is that reading all of the block bitmaps takes
a huge amount of time for a large filesystem.

7.8TB / 128MB/group ~= 8000 groups
8000 bitmaps / 100 seeks/sec = 80s

So that is what is making things slow. Once the allocator has all the
blocks in memory there are no problems. There are some heuristics
to skip bitmaps that are totally full, but they don't work in your case.

This is why the flex_bg feature was created - to allow the bitmaps
to be read from disk without seeks. This also speeds up e2fsck by
the same 96s that would otherwise be wasted waiting for the disk.

Backporting flex_bg to ext3 would be fairly trivial - just disable the checks
for the location of the bitmaps at mount time. However, using it
requires that you reformat your filesystem with "-O flex_bg" to
get the improved layout.

The other option (if your runtime environment allows it) is to prefetch
the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the
filesystem is in use. This still takes 90s, but can be started early in
the boot process on each disk in parallel.

Cheers, Andreas

> On Jul 7, 2014, at 18:16, Theodore Ts'o <[email protected]> wrote:
>
> On Mon, Jul 07, 2014 at 05:13:49PM -0400, Benjamin LaHaise wrote:s
>> Hi folks,
>>
>> I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem
>> to exist in ext3, and was wondering if anyone has encountered this before.
>> I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data.
>> When this filesystem is freshly mounted, the first write to the filesystem
>> takes a whopping 96 seconds to complete, during which time the system is
>> reading about 1000 blocks per second. Subsequent writes are much quicker.
>> The problem seems to be that ext4 is loading all of the bitmaps on the
>> filesystem before the first write proceeds. The backtrace looks roughly as
>> follows:
>
> So the issue is that ext3 will just allocate the first free block it
> can find, even if it is a single free block in block group #1001,
> followed by a single free block in block group #2002. Ext4 tries a
> harder to find contiguous blocks.
>
> If you are using an ext3 file system format, the block allocation
> bitmaps are scattered across the entire file system, so we end up
> doing a lot random 4k seeks.
>
> We can try to be a bit smarter about how we try to search the file
> system for free blocks.
>
> Out of curiosity, can you send me a copy of the contents of:
>
> /proc/fs/ext4/dm-XX/mb_groups
>
> Thanks!!
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-08 14:53:54

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

On Mon, Jul 07, 2014 at 11:54:05PM -0400, Theodore Ts'o wrote:
> On Mon, Jul 07, 2014 at 09:35:11PM -0400, Benjamin LaHaise wrote:
> >
> > Sure -- I put a copy at http://www.kvack.org/~bcrl/mb_groups as it's a bit
> > too big for the mailing list. The filesystem in question has a couple of
> > 11GB files on it, with the remainder of the space being taken up by files
> > 7200016 bytes in size.
>
> Right, so looking at mb_groups we see a bunch of the problems. There
> are a large number block groups which look like this:
>
> #group: free frags first [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 ]
> #288 : 1540 7 13056 [ 0 0 1 0 0 0 0 0 6 0 0 0 0 0 ]
>
> It would be very interesting to see what allocation pattern resulted
> in so many block groups with this layout. Before we read in
> allocation bitmap, all we know from the block group descriptors is
> that there are 1540 free blocks. What we don't know is that they are
> broken up into 6 256 block free regions, plus a 4 block region.

I did have to make a change to the ext4 inode allocator to bias things
towards allocating inodes at the beginning of the disk (see below).
Without that change the allocation pattern of writes to the filesystem
resulted in a significant performance regression relative to ext3,
owing mostly to the fact that fallocate() on ext4 is unimplemented for
indirect style metadata. (Note that we mount the filesystem with this
noorlov mount option.)

With that change, the workload essentially consists of writing 7200016
files in one write() operation rotating between 100 subdirectories off
the root of the filesystem.

> If we try to allocate a 1024 block region, we'll end up searching a
> large number of these block groups before find one which is suitable.
>
> Or there is a large collection of block groups that look like this:
>
> #834 : 4900 39 514 [ 0 20 5 5 16 6 4 8 6 1 1 0 0 0 ]
>
> Similarly, we could try to look for a contiguous 2048 range, but even
> though there is 4900 blocks available, we can't tell the difference
> between something a free block layout which looks like like the above,
> versus one that looks like this:
>
> #834 : 4900 39 514 [ 0 6 0 1 3 5 1 4 0 0 0 2 0 0 ]
>
> We could try going straight for the largely empty block groups, but
> that's more likely to fragment the file system more quickly, and then
> once those largely empty block groups are partially used, then we'll
> end up taking a long time while we scan all of the block groups.

Fragmentation is not a significant concern for the workload in question.
Write performance is much more important to us than read performance, and
read performance tends to degrade to random reads owing to the fact that
the system can have many queues (~16k) issuing reads. Hence, getting the
block allocator to make writes get allocated as close to sequential on
disk as possible is an important corner of performance. Ext4 with indirect
blocks has a tendency to leave gaps between files, which degrades
performance for this workload, since files tend not to be packed as closely
together as they were with ext3. Ext4 with extents + fallocate() packs
files on disk without any gaps, but turning on extents is not an option
(unfortunately, as a 20+ minute fsck time / outage as part of an upgrade is
not viable).

-ben

> - Ted
>

--
"Thought is the essence of where you are now."

diff -pu ./fs/ext4/ext4.h /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ext4.h
--- ./fs/ext4/ext4.h 2014-03-12 16:32:21.077386952 -0400
+++ /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ext4.h 2014-07-03 14:05:14.000000000 -0400
@@ -962,6 +962,7 @@ struct ext4_inode_info {

#define EXT4_MOUNT2_EXPLICIT_DELALLOC 0x00000001 /* User explicitly
specified delalloc */
+#define EXT4_MOUNT2_NO_ORLOV 0x00000002 /* Disable orlov */

#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
diff -pu ./fs/ext4/ialloc.c /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ialloc.c
--- ./fs/ext4/ialloc.c 2014-03-12 16:32:21.078386958 -0400
+++ /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ialloc.c 2014-05-26 14:22:23.000000000 -0400
@@ -517,6 +517,9 @@ static int find_group_other(struct super
struct ext4_group_desc *desc;
int flex_size = ext4_flex_bg_size(EXT4_SB(sb));

+ if (test_opt2(sb, NO_ORLOV))
+ goto do_linear;
+
/*
* Try to place the inode is the same flex group as its
* parent. If we can't find space, use the Orlov algorithm to
@@ -589,6 +592,7 @@ static int find_group_other(struct super
return 0;
}

+do_linear:
/*
* That failed: try linear search for a free inode, even if that group
* has no free blocks.
@@ -655,7 +659,7 @@ struct inode *ext4_new_inode(handle_t *h
goto got_group;
}

- if (S_ISDIR(mode))
+ if (!test_opt2(sb, NO_ORLOV) && S_ISDIR(mode))
ret2 = find_group_orlov(sb, dir, &group, mode, qstr);
else
ret2 = find_group_other(sb, dir, &group, mode);
diff -pu ./fs/ext4/super.c /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/super.c
--- ./fs/ext4/super.c 2014-03-12 16:32:21.080386971 -0400
+++ /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/super.c 2014-05-26 14:22:23.000000000 -0400
@@ -1191,6 +1201,7 @@ enum {
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
+ Opt_noorlov
};

static const match_table_t tokens = {
@@ -1210,6 +1221,7 @@ static const match_table_t tokens = {
{Opt_debug, "debug"},
{Opt_removed, "oldalloc"},
{Opt_removed, "orlov"},
+ {Opt_noorlov, "noorlov"},
{Opt_user_xattr, "user_xattr"},
{Opt_nouser_xattr, "nouser_xattr"},
{Opt_acl, "acl"},
@@ -1376,6 +1388,7 @@ static const struct mount_opts {
int token;
int mount_opt;
int flags;
+ int mount_opt2;
} ext4_mount_opts[] = {
{Opt_minix_df, EXT4_MOUNT_MINIX_DF, MOPT_SET},
{Opt_bsd_df, EXT4_MOUNT_MINIX_DF, MOPT_CLEAR},
@@ -1444,6 +1457,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
+ {Opt_noorlov, 0, MOPT_SET, EXT4_MOUNT2_NO_ORLOV},
{Opt_err, 0, 0}
};

@@ -1562,6 +1576,7 @@ static int handle_mount_opt(struct super
} else {
clear_opt(sb, DATA_FLAGS);
sbi->s_mount_opt |= m->mount_opt;
+ sbi->s_mount_opt2 |= m->mount_opt2;
}
#ifdef CONFIG_QUOTA
} else if (m->flags & MOPT_QFMT) {
@@ -1585,10 +1600,13 @@ static int handle_mount_opt(struct super
WARN_ON(1);
return -1;
}
- if (arg != 0)
+ if (arg != 0) {
sbi->s_mount_opt |= m->mount_opt;
- else
+ sbi->s_mount_opt2 |= m->mount_opt2;
+ } else {
sbi->s_mount_opt &= ~m->mount_opt;
+ sbi->s_mount_opt2 &= ~m->mount_opt2;
+ }
}
return 1;
}
@@ -1736,11 +1754,15 @@ static int _ext4_show_options(struct seq
if (((m->flags & (MOPT_SET|MOPT_CLEAR)) == 0) ||
(m->flags & MOPT_CLEAR_ERR))
continue;
- if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
+ if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)) &&
+ !(m->mount_opt2 & sbi->s_mount_opt2))
continue; /* skip if same as the default */
- if ((want_set &&
- (sbi->s_mount_opt & m->mount_opt) != m->mount_opt) ||
- (!want_set && (sbi->s_mount_opt & m->mount_opt)))
+ if (want_set &&
+ (((sbi->s_mount_opt & m->mount_opt) != m->mount_opt) ||
+ ((sbi->s_mount_opt2 & m->mount_opt2) != m->mount_opt2)))
+ continue; /* select Opt_noFoo vs Opt_Foo */
+ if (!want_set && ((sbi->s_mount_opt & m->mount_opt) ||
+ (sbi->s_mount_opt2 & m->mount_opt2)))
continue; /* select Opt_noFoo vs Opt_Foo */
SEQ_OPTS_PRINT("%s", token2str(m->token));
}

2014-07-30 14:49:29

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

Hi Andreas, Ted,

I've finally had some more time to dig into this problem, and it's worse
than I initially thought in that it occurs on normal ext4 filesystems.

On Mon, Jul 07, 2014 at 11:11:58PM -0600, Andreas Dilger wrote:
...
> The main problem here is that reading all of the block bitmaps takes
> a huge amount of time for a large filesystem.

Very true.

...
>
> 7.8TB / 128MB/group ~= 8000 groups
> 8000 bitmaps / 100 seeks/sec = 80s
>
> So that is what is making things slow. Once the allocator has all the
> blocks in memory there are no problems. There are some heuristics
> to skip bitmaps that are totally full, but they don't work in your case.
>
> This is why the flex_bg feature was created - to allow the bitmaps
> to be read from disk without seeks. This also speeds up e2fsck by
> the same 96s that would otherwise be wasted waiting for the disk.

Unfortunately, that isn't the case.

> Backporting flex_bg to ext3 would be fairly trivial - just disable the checks
> for the location of the bitmaps at mount time. However, using it
> requires that you reformat your filesystem with "-O flex_bg" to
> get the improved layout.

flex_bg is not sufficient to resolve this issue. Using a native ext4
formatted filesystem initialized with mke4fs 1.41.12, this problem still
occurs. I created a 7.1TB filesystem, filled it to about 92% full with
8MB files. The time to create a new 8MB file after a fresh mount ranges
from 0.017 seconds 13.2 seconds. The outlier correlates with bitmaps
being read from disk. A copy of /proc/fs/ext4/dm-2/mb_groups from this
92% full fs is available at http://www.kvack.org/~bcrl/mb_groups.ext4-92

Note that is isn't the first allocating write to the filesystem that is
the worst in terms of timing, it can end up being the 10th or even the
100th attempt.

> The other option (if your runtime environment allows it) is to prefetch
> the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the
> filesystem is in use. This still takes 90s, but can be started early in
> the boot process on each disk in parallel.

That isn't a solution. Prefetching is impossible in my particular use-case,
as the filesystem is being mounted after a failover from another node --
any data prefetched prior to switching active nodes is not guaranteed to be
valid.

This seems like a pretty serious regression relative to ext3. Why can't
ext4's mballoc pick better block groups to attempt allocating from based
on the free block counts in the block group summaries?

-ben
--
"Thought is the essence of where you are now."

2014-07-31 13:03:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

On Wed, Jul 30, 2014 at 10:49:28AM -0400, Benjamin LaHaise wrote:
> This seems like a pretty serious regression relative to ext3. Why can't
> ext4's mballoc pick better block groups to attempt allocating from based
> on the free block counts in the block group summaries?

Allocation algorithms are *always* tradeoffs. So I don't think
regression is necessarily the best way to think about things.
Unfortuntaely, your use case really doesn't work well with how we have
set things up with ext4 now. Sure, if you your specific use case is
one where you are mostly allocating 8MB files, then we can add a
special case where if you are allocating 32768 blocks, we should
search for block groups that have 32768 blocks free. And if that's
what you are asking for, we can certainly do that.

The problem is that free block counts don't work well in general. If
I see that the free block count is 2048 blocks, that doesn't tell me
the free blocks are in a contiguous single chunk of 2048 blocks, or
2048 single block items. (We do actually pay attention to free
blocks, by the way, but it's in a nuanced way.)

If the only goal you have is fast block allocation after fail over,
you can always use the VFAT block allocation --- i.e., use the first
free block in the file system. Unfortunately, it will result in a
very badly fragmented file system, as Microsoft and its users
discovered.

I'm sure that are things we could do that would make things better for
your workload (if you want to tell us in great detail exactly what the
file/block allocation patterns are for your workload), and perhaps
even better in general, but the challenge is making sure we don't
regress for other workloads --- and this includes long-term
fragmentation resistance. This is a hard problem. Kvetching about
how it's so horrible just for you isn't really helpful for solving it.

(BTW, one of the problems is that ext4_mb_normalize_request caps large
allocations so that we use the same goal length for multiple passes as
we search for good block groups. We might want to use the original
goal length --- so long as it is less than 32768 blocks --- for the
first scan, or at least for goal lengths which are powers of two. So
if your application is regularly allocating files which are exactly
8MB, there are probably some optimizations that we could apply. But
if they aren't exactly 8MB, life gets a bit trickier.)

Regards,

- Ted

2014-07-31 14:04:35

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

On Thu, Jul 31, 2014 at 09:03:32AM -0400, Theodore Ts'o wrote:
> On Wed, Jul 30, 2014 at 10:49:28AM -0400, Benjamin LaHaise wrote:
> > This seems like a pretty serious regression relative to ext3. Why can't
> > ext4's mballoc pick better block groups to attempt allocating from based
> > on the free block counts in the block group summaries?
>
> Allocation algorithms are *always* tradeoffs. So I don't think
> regression is necessarily the best way to think about things.
> Unfortuntaely, your use case really doesn't work well with how we have
> set things up with ext4 now. Sure, if you your specific use case is
> one where you are mostly allocating 8MB files, then we can add a
> special case where if you are allocating 32768 blocks, we should
> search for block groups that have 32768 blocks free. And if that's
> what you are asking for, we can certainly do that.

The workload targets allocation 8MB files, mostly because that is a size
that is large enough to perform fairly decently, but small enough to not
incur too much latency for each write. Depending on other dynamics in the
system, it's possible to end up with files as small as 8K, or as large as
30MB. The target file size can certainly be tuned up or down if that makes
life easier for the filesystem.

> The problem is that free block counts don't work well in general. If
> I see that the free block count is 2048 blocks, that doesn't tell me
> the free blocks are in a contiguous single chunk of 2048 blocks, or
> 2048 single block items. (We do actually pay attention to free
> blocks, by the way, but it's in a nuanced way.)
>
> If the only goal you have is fast block allocation after fail over,
> you can always use the VFAT block allocation --- i.e., use the first
> free block in the file system. Unfortunately, it will result in a
> very badly fragmented file system, as Microsoft and its users
> discovered.

Fragmentation is not a huge concern, but is more acceptable if the time
to perform an allocation increases. Time to perform a write is hugely
important, as the system will have more and more data coming in as time
progresses. At present under load the system has to be able to sustain
550MB/s of writes to disk for an extended period of time. With 8MB
writes that means we can't tolerate very many multi second writes.
I am of the opinion that expecting the filesystem to be able to sustain
550MB/s is reasonable given that the underlying disk array can perform
sequential reads/writes at more than 1GB/s and has a reasonably large
amount of write back cache (512MB) on the RAID controller.

The use-case is essentially making use of the filesystem as an elastic
buffer for queues of messages. Under normal conditions all of the data
is received and then sent out within a fairly short period of time, but
sometimes there are receivers that are slow or offline which means that
the in memory buffers get filled and need to be spilled out to disk.
Many users of the system cycle this behaviour over the course of a single
day. They receive a lot of data during business hours, then process and
drain it over the course of the evening. Since everything is cyclic, and
reads are slow anyways, long term fragmentation of the filesystem isn't a
significant concern.

> I'm sure that are things we could do that would make things better for
> your workload (if you want to tell us in great detail exactly what the
> file/block allocation patterns are for your workload), and perhaps
> even better in general, but the challenge is making sure we don't
> regress for other workloads --- and this includes long-term
> fragmentation resistance. This is a hard problem. Kvetching about
> how it's so horrible just for you isn't really helpful for solving it.

I'm kvetching mostly because the mballoc code is hugely complicated and
easy to break (and oh have I broken it). If you can point me in the right
direction for possible improvements that you think might improve mballoc,
I'll certainly give them a try. Hopefully the above descriptions of the
workload make it a bit easier to understand what's going on in the big
picture.

I also don't think this problem is limited to my particular use-case.
Any ext4 filesystem that is 7TB or more and gets up into the 80-90%
utilization will probably start exhibiting this problem. I do wonder if
it is at all possible to fix this issue without replacing the bitmaps used
to track free space with something better suited to the task on such large
filesystems. Pulling in hundreds of megabytes of bitmap blocks is always
going to hurt. Fixing that would mean either compressing the bitmaps into
something that can be read more quickly, or wholesale replacement of the
bitmaps with something else.

> (BTW, one of the problems is that ext4_mb_normalize_request caps large
> allocations so that we use the same goal length for multiple passes as
> we search for good block groups. We might want to use the original
> goal length --- so long as it is less than 32768 blocks --- for the
> first scan, or at least for goal lengths which are powers of two. So
> if your application is regularly allocating files which are exactly
> 8MB, there are probably some optimizations that we could apply. But
> if they aren't exactly 8MB, life gets a bit trickier.)

And sadly, they're not always 8MB. If there's anything I can do on the
application side to make the filesystem's life easier, I would happily
do so, but we're already doing fallocate() and making the writes in a
single write() operation. There's not much more I can think of that's
low hanging fruit.

Cheers,

-ben

> Regards,
>
> - Ted

--
"Thought is the essence of where you are now."

2014-07-31 15:27:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

On Thu, Jul 31, 2014 at 10:04:34AM -0400, Benjamin LaHaise wrote:
>
> I'm kvetching mostly because the mballoc code is hugely complicated and
> easy to break (and oh have I broken it). If you can point me in the right
> direction for possible improvements that you think might improve mballoc,
> I'll certainly give them a try. Hopefully the above descriptions of the
> workload make it a bit easier to understand what's going on in the big
> picture.

Yes, the mballoc code is hugely complicated. A lot of this was
because there are a lot of special case hacks added over the years to
fix various corner cases that have showed up. In particular, some of
the magic in normalize_request is probably there for Lustre, and it
gives *me* headaches.

One of the things which is clearly impacting you is that you need fast
failover, where as for most of us, we're either (a) not trying to use
large file systems (whenever possible I recommend the use of single
disk file systems), or (b) we are less worried about what happens
immediately after the file system is mounted, and more about the
steady state.

> I also don't think this problem is limited to my particular use-case.
> Any ext4 filesystem that is 7TB or more and gets up into the 80-90%
> utilization will probably start exhibiting this problem. I do wonder if
> it is at all possible to fix this issue without replacing the bitmaps used
> to track free space with something better suited to the task on such large
> filesystems. Pulling in hundreds of megabytes of bitmap blocks is always
> going to hurt. Fixing that would mean either compressing the bitmaps into
> something that can be read more quickly, or wholesale replacement of the
> bitmaps with something else.

Yes, it may be that the only solution, if you really want to stick
with ext4, is to architect using some kind of extent-based tracking
system for block allocation. I wouldn't be horribly against that,
since the nice thing about block allocation is that if we lose the
extent tree, we can always regenerate the information very easily as
part of e2fsck pass 5. So moving to a tree-based allocation tracking
system is much easier from a file system recovery perspective than,
say, going to a fully dynamic inode table.

So if someone were willing to do the engineering work, I wouldn't be
opposed to having that be added to ext4.

I do have to ask, though, that while I always like to see more people
using ext4, and I love to have more people contributing to ext4, have
you considered using some other file system? It might be that
something like xfs is a much closer match to your requirements? Or
perhaps more radically, have you considered going to some cluster file
system, not from size perspective (7TB is very cute from a cluster fs
perspective), but from a reliability and robustness against server
failure perspective.

Cheers,

- Ted