Hi,
The first patch fixes an incorrect assert in ext4_mb_normalize_request();
the second patch fixes a kernel panic occurring when using a filesystem
with a low number of blocks per group.
Maurizio Lombardi (2):
ext4: fix wrong assert in ext4_mb_normalize_request()
ext4: fix bug in ext4_mb_normalize_request()
fs/ext4/mballoc.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
--
Maurizio Lombardi
the variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.
Signed-off-by: Maurizio Lombardi <[email protected]>
---
fs/ext4/mballoc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 04a5c75..08ddfda 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3135,7 +3135,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
}
BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
start > ac->ac_o_ex.fe_logical);
- BUG_ON(size <= 0 || size > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
+ BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
/* now prepare goal request */
--
Maurizio Lombardi
When normalizing the data requests, the number of blocks to allocate
must not be higher than the number of blocks per group.
The current implementation does not take care of that and it may
hit a kernel panic if the number of blocks per group is very low.
This patch fixes the bug by ensuring that the number of blocks to allocate
is always less or equal to the number of blocks per group.
How to reproduce the bug:
#mkfs.ext4 -g 1024 /dev/sdX
#mount /dev/sdX /mnt
#dd if=/dev/zero of=/mnt/test bs=1M count=10
[ 147.779177] ------------[ cut here ]------------
[ 147.780015] kernel BUG at fs/ext4/mballoc.c:3145!
[ 147.780015] invalid opcode: 0000 [#1] SMP
[ 147.780015] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc loop snd_pcm cirrus snd_timer ttm snd drm_kms_helper soundcore drm parport_pc parport i2c_piix4 pcspkr i2c_core xfs libcrc32c e1000 floppy
[ 147.780015] CPU: 0 PID: 66 Comm: kworker/u8:3 Not tainted 3.14.0-rc4+ #12
[ 147.780015] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 147.780015] Workqueue: writeback bdi_writeback_workfn (flush-7:0)
[ 147.780015] task: ffff88002ec16300 ti: ffff88002ed20000 task.ti: ffff88002ed20000
[ 147.780015] RIP: 0010:[<ffffffff812b779c>] [<ffffffff812b779c>] ext4_mb_normalize_request+0x60c/0x660
[ 147.780015] RSP: 0018:ffff88002ed21778 EFLAGS: 00010206
[ 147.780015] RAX: ffff88002e3bb000 RBX: 0000000000000800 RCX: 0000000000000006
[ 147.780015] RDX: 0000000000000800 RSI: 0000000000000046 RDI: ffff88002e3bb800
[ 147.780015] RBP: ffff88002ed217e8 R08: 000000000000000a R09: 00000000000003a2
[ 147.780015] R10: 0000000000000000 R11: 00000000000003a1 R12: ffff880000c17000
[ 147.780015] R13: 0000000000000000 R14: 0000000000000800 R15: ffff88003d1fc2f8
[ 147.780015] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 147.780015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 147.780015] CR2: ffffffffff600400 CR3: 000000001c218000 CR4: 00000000000006f0
[ 147.780015] Stack:
[ 147.780015] 00000000014ca000 0000000000000800 ffff88002e3bb000 ffff88003d1fc0b0
[ 147.780015] ffff88002ed21980 0000080000000800 ffffffff812bd912 ffff88003d1fc2f8
[ 147.780015] ffff88002ed217f8 ffff88002ed21980 ffff88002e3bb000 ffff88002ed21970
[ 147.780015] Call Trace:
[ 147.780015] [<ffffffff812bd912>] ? ext4_mb_new_blocks+0x122/0x8d0
[ 147.780015] [<ffffffff812bdbe3>] ext4_mb_new_blocks+0x3f3/0x8d0
[ 147.780015] [<ffffffff8116df7e>] ? free_hot_cold_page_list+0x4e/0xa0
[ 147.780015] [<ffffffff811bc72a>] ? __kmalloc+0x1ea/0x230
[ 147.780015] [<ffffffff812af4a8>] ? ext4_ext_find_extent+0x228/0x2b0
[ 147.780015] [<ffffffff812af4a8>] ? ext4_ext_find_extent+0x228/0x2b0
[ 147.780015] [<ffffffff812b38c1>] ext4_ext_map_blocks+0x611/0xfd0
[ 147.780015] [<ffffffff81284f55>] ext4_map_blocks+0x2b5/0x4d0
[ 147.780015] [<ffffffff81289dd1>] ext4_writepages+0x621/0xd00
[ 147.780015] [<ffffffff81171bbe>] do_writepages+0x1e/0x40
[ 147.780015] [<ffffffff811fecb0>] __writeback_single_inode+0x40/0x200
[ 147.780015] [<ffffffff811ff5d1>] writeback_sb_inodes+0x1c1/0x410
[ 147.780015] [<ffffffff811ff9e4>] wb_writeback+0xf4/0x2c0
[ 147.780015] [<ffffffff810a0f2f>] ? set_worker_desc+0x6f/0x80
[ 147.780015] [<ffffffff81202d98>] bdi_writeback_workfn+0x118/0x440
[ 147.780015] [<ffffffff8109d99a>] process_one_work+0x17a/0x410
[ 147.780015] [<ffffffff8109ed9c>] worker_thread+0x11c/0x370
[ 147.780015] [<ffffffff8109ec80>] ? manage_workers.isra.21+0x2b0/0x2b0
[ 147.780015] [<ffffffff810a55b9>] kthread+0xc9/0xe0
[ 147.780015] [<ffffffff81010000>] ? ftrace_raw_event_xen_mc_flush+0x50/0x180
[ 147.780015] [<ffffffff810a54f0>] ? flush_kthread_worker+0x80/0x80
[ 147.780015] [<ffffffff816ffc3c>] ret_from_fork+0x7c/0xb0
[ 147.780015] [<ffffffff810a54f0>] ? flush_kthread_worker+0x80/0x80
[ 147.780015] Code: 1a a4 81 31 c0 e8 05 50 43 00 49 8b 44 24 08 8b 75 b8 48 c7 c7 c3 1a a4 81 48 8b 80 f8 02 00 00 48 8b 50 18 31 c0 e8 e4 4f 43 00 <0f> 0b 44 89 ee 48 c7 c7 b7 1a a4 81 31 c0 e8 d1 4f 43 00 49 8b
[ 147.780015] RIP [<ffffffff812b779c>] ext4_mb_normalize_request+0x60c/0x660
[ 147.780015] RSP <ffff88002ed21778>
[ 147.830356] ---[ end trace b82d39f39fe4e04a ]---
[ 147.831058] Kernel panic - not syncing: Fatal exception
Signed-off-by: Maurizio Lombardi <[email protected]>
---
fs/ext4/mballoc.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 08ddfda..546575a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3059,6 +3059,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
size = ac->ac_o_ex.fe_len << bsbits;
}
size = size >> bsbits;
+
+ /* In any case, the size cannot be greater than the number
+ * of maximum free blocks per group.
+ */
+ if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
+ int sz_log2;
+
+ size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
+
+ /* Recalculate the start offset */
+ sz_log2 = __fls(size << bsbits);
+ start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
+ (sz_log2 - bsbits)) << sz_log2;
+ }
+
start = start_off >> bsbits;
/* don't cover already allocated blocks in selected range */
--
Maurizio Lombardi
On Mon, Mar 03, 2014 at 03:00:28PM +0100, Maurizio Lombardi wrote:
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 08ddfda..546575a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3059,6 +3059,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> size = ac->ac_o_ex.fe_len << bsbits;
> }
> size = size >> bsbits;
> +
> + /* In any case, the size cannot be greater than the number
> + * of maximum free blocks per group.
> + */
> + if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
> + int sz_log2;
> +
> + size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
> +
> + /* Recalculate the start offset */
> + sz_log2 = __fls(size << bsbits);
> + start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
> + (sz_log2 - bsbits)) << sz_log2;
> + }
> +
> start = start_off >> bsbits;
>
> /* don't cover already allocated blocks in selected range */
This definitely fixes the bug. However, there will be some cases
where if the blocks per group is sufficiently small, where for smaller
files, start_off would have been 0 instead of that complicated
expression.
Looking at ext4_mb_normalize_request(), exactly what this code is
trying to do is actually a bit opaque to me, and every time I look at
it I get a headache.
Andreas, can you take a look at this? I think you may know this code
better --- and it's somewhere I've been waiting to do some cleanup, or
at least some improved code comments.
Thanks!!
- Ted
On Thu, Mar 06, 2014 at 10:44:07AM -0500, Theodore Ts'o wrote:
> On Mon, Mar 03, 2014 at 03:00:28PM +0100, Maurizio Lombardi wrote:
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 08ddfda..546575a 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -3059,6 +3059,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> > size = ac->ac_o_ex.fe_len << bsbits;
> > }
> > size = size >> bsbits;
> > +
> > + /* In any case, the size cannot be greater than the number
> > + * of maximum free blocks per group.
> > + */
> > + if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
> > + int sz_log2;
> > +
> > + size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
> > +
> > + /* Recalculate the start offset */
> > + sz_log2 = __fls(size << bsbits);
> > + start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
> > + (sz_log2 - bsbits)) << sz_log2;
> > + }
> > +
> > start = start_off >> bsbits;
> >
> > /* don't cover already allocated blocks in selected range */
>
> This definitely fixes the bug. However, there will be some cases
> where if the blocks per group is sufficiently small, where for smaller
> files, start_off would have been 0 instead of that complicated
> expression.
Mmmm... if I correctly understood how ext4_normalize_request() works, everytime
you truncate the value of "size" you have to recalculate the correct start offset.
Note that start_off is zero only in those case where size is left untouched or
increased.
>
> Looking at ext4_mb_normalize_request(), exactly what this code is
> trying to do is actually a bit opaque to me, and every time I look at
> it I get a headache.
Yes unfortunately the code is not very easy to understand.
I may be missing something and it would be nice to have someone who knows it
better to shed some light on it.
Regards,
Maurizio Lombardi
On Thu, 6 Mar 2014, Maurizio Lombardi wrote:
> Date: Thu, 6 Mar 2014 17:54:16 +0100
> From: Maurizio Lombardi <[email protected]>
> To: Theodore Ts'o <[email protected]>
> Cc: [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 2/2] ext4: fix bug in ext4_mb_normalize_request()
>
> On Thu, Mar 06, 2014 at 10:44:07AM -0500, Theodore Ts'o wrote:
> > On Mon, Mar 03, 2014 at 03:00:28PM +0100, Maurizio Lombardi wrote:
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 08ddfda..546575a 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -3059,6 +3059,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> > > size = ac->ac_o_ex.fe_len << bsbits;
> > > }
> > > size = size >> bsbits;
> > > +
> > > + /* In any case, the size cannot be greater than the number
> > > + * of maximum free blocks per group.
> > > + */
> > > + if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
> > > + int sz_log2;
> > > +
> > > + size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
> > > +
> > > + /* Recalculate the start offset */
> > > + sz_log2 = __fls(size << bsbits);
> > > + start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
> > > + (sz_log2 - bsbits)) << sz_log2;
> > > + }
> > > +
> > > start = start_off >> bsbits;
> > >
> > > /* don't cover already allocated blocks in selected range */
> >
> > This definitely fixes the bug. However, there will be some cases
> > where if the blocks per group is sufficiently small, where for smaller
> > files, start_off would have been 0 instead of that complicated
> > expression.
>
> Mmmm... if I correctly understood how ext4_normalize_request() works, everytime
> you truncate the value of "size" you have to recalculate the correct start offset.
> Note that start_off is zero only in those case where size is left untouched or
> increased.
(ignoring the fact that the ext4_mb_normalize_request() is broken
for now)
Yes it tries to align down the start_off of the bigger requests to the 512,
1024, 2048, or 4096 respectively. What the reason for it is really I have
no idea. The fact is however that this will only affect file systems
with bs smaller than 4k since the start_off will be always aligned to
block size afterwards (obviously).
That said this alignment is only done when the request is "big
enough". With your change we also do it when the block group is
"small enough" which is the change in behaviour which I think was
not really intended.
Honestly I do not think this really matters a lot but this alignment
you've added is not necessary.
All that said, I was getting to rewrite this mess a long time ago,
it's just a reminder that it's something that needs to be done.
Especially since the bigger requests are getting split unnecessarily
which hurts especially in fallocate case.
Thanks!
-Lukas
>
> >
> > Looking at ext4_mb_normalize_request(), exactly what this code is
> > trying to do is actually a bit opaque to me, and every time I look at
> > it I get a headache.
>
> Yes unfortunately the code is not very easy to understand.
> I may be missing something and it would be nice to have someone who knows it
> better to shed some light on it.
>
> Regards,
> Maurizio Lombardi
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
>
> All that said, I was getting to rewrite this mess a long time ago,
> it's just a reminder that it's something that needs to be done.
> Especially since the bigger requests are getting split unnecessarily
> which hurts especially in fallocate case.
Hey Lukas,
If you are going to try to do some clean up work in mballoc, that
would be really really great! (Although hopefully you can still work
on writing down some thoughts about the data block checksum feature
before the ext4 developer's workshop. :-)
We should try to get input from Andreas about what some of the more
interesting hueristics in mballoc were trying to accomplish, since
there's a lot going on that's not obvious, and one of the reasons why
I've always been worried about trying to do cleanups was because
something that looks ugly might be papering over some other dark
corner of mballoc.c ---- and so I was fairly certain that one we
started opening up mballoc.c, we'd have to do a lot of work on it, and
a lot of performance measurements to make sure we didn't accidentally
introduce some performance regression.
Thanks for being willing to look at this!
Cheers,
- Ted
On Mar 6, 2014, at 11:32 AM, Theodore Ts'o <[email protected]> wrote:
> On Thu, Mar 06, 2014 at 06:54:05PM +0100, Luk?? Czerner wrote:
>>
>> All that said, I was getting to rewrite this mess a long time ago,
>> it's just a reminder that it's something that needs to be done.
>> Especially since the bigger requests are getting split unnecessarily
>> which hurts especially in fallocate case.
>
> We should try to get input from Andreas about what some of the more
> interesting hueristics in mballoc were trying to accomplish, since
> there's a lot going on that's not obvious, and one of the reasons why
> I've always been worried about trying to do cleanups was because
> something that looks ugly might be papering over some other dark
> corner of mballoc.c ---- and so I was fairly certain that one we
> started opening up mballoc.c, we'd have to do a lot of work on it, and
> a lot of performance measurements to make sure we didn't accidentally
> introduce some performance regression.
There is actually quite a lengthy description of mballoc at the start
of the file. I guess it would make sense to turn anything in this
thread into comments for ext4_mb_normalize_request() once verified.
So, below is hopefully a summary of what ext4_mb_normalize_request()
is actually doing. I've CC'd Alex to correct my mistakes. I think
the first few cases are commented accurately and self explanatorily:
* don't prealloc blocks for non-regular files (!EXT4_MB_HINT_DATA)
- should we reconsider this for larger directories?
* don't use prealloc if caller wants exact (EXT4_MB_HINT_GOAL_ONLY)
- currently unused, but would be useful for defrag
* don't reserve blocks if caller doesn't want it (EXT4_MB_HINT_NOPREALLOC)
- used for small files or if requested data fits exactly into extent
* if write is a small file, use group prealloc (EXT4_MB_HINT_GROUP_ALLOC)
- this combines multiple small writes into a single prealloc region
and avoids read-modify-write of RAID stripes
The rest of the function is about handling large file writes efficiently.
* round up small writes to a power-of-two value for better alignment
- we have a patch that makes the preallocation region sizes tunable,
if that is something of interest. That said, we don't really use it.
* if the request is large, align it to a power-of-two boundary
- the allocation goal is based on the logical file offset, so that if
a file is written sparsely by multiple threads, it can coalesce into
a densely packed file in the end. This is common for HPC jobs, or
applications like bittorrent.
* the list_for_each() loops align the prealloc region with other regions
- this helps when the file becomes fully allocated that the regions
will be contiguous on disk
I'm pretty sure some of this is not 100% accurate, hopefully Alex can
comment and correct any inconsistencies.
Cheers, Andreas
On Mon, Mar 03, 2014 at 03:00:27PM +0100, Maurizio Lombardi wrote:
> the variable "size" is expressed as number of blocks and not as
> number of clusters, this could trigger a kernel panic when using
> ext4 with the size of a cluster different from the size of a block.
>
> Signed-off-by: Maurizio Lombardi <[email protected]>
Thanks, applied.
- Ted
On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
> Yes it tries to align down the start_off of the bigger requests to the 512,
> 1024, 2048, or 4096 respectively. What the reason for it is really I have
> no idea. The fact is however that this will only affect file systems
> with bs smaller than 4k since the start_off will be always aligned to
> block size afterwards (obviously).
>
> That said this alignment is only done when the request is "big
> enough". With your change we also do it when the block group is
> "small enough" which is the change in behaviour which I think was
> not really intended.
>
> Honestly I do not think this really matters a lot but this alignment
> you've added is not necessary.
>
> All that said, I was getting to rewrite this mess a long time ago,
> it's just a reminder that it's something that needs to be done.
> Especially since the bigger requests are getting split unnecessarily
> which hurts especially in fallocate case.
Hey Lukas, where are we with respect to your plans to fix up this
code?
I'm trying to figure out what we should do with Maurizio's bug fix.
Should we apply it even though it's making some changes to the
existing behavior.
As far as to why the existing code is trying to align the starting
offset to a power of two, I believe the idea is to avoid
fragmentation, since the normalize_request function will tend to round
up allocaiton requests to the same power of two.
- Ted
On Mon, 26 May 2014, Theodore Ts'o wrote:
> Date: Mon, 26 May 2014 12:50:10 -0400
> From: Theodore Ts'o <[email protected]>
> To: Lukáš Czerner <[email protected]>
> Cc: Maurizio Lombardi <[email protected]>, [email protected],
> [email protected], [email protected]
> Subject: Re: [PATCH 2/2] ext4: fix bug in ext4_mb_normalize_request()
>
> On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
> > Yes it tries to align down the start_off of the bigger requests to the 512,
> > 1024, 2048, or 4096 respectively. What the reason for it is really I have
> > no idea. The fact is however that this will only affect file systems
> > with bs smaller than 4k since the start_off will be always aligned to
> > block size afterwards (obviously).
> >
> > That said this alignment is only done when the request is "big
> > enough". With your change we also do it when the block group is
> > "small enough" which is the change in behaviour which I think was
> > not really intended.
> >
> > Honestly I do not think this really matters a lot but this alignment
> > you've added is not necessary.
> >
> > All that said, I was getting to rewrite this mess a long time ago,
> > it's just a reminder that it's something that needs to be done.
> > Especially since the bigger requests are getting split unnecessarily
> > which hurts especially in fallocate case.
>
> Hey Lukas, where are we with respect to your plans to fix up this
> code?
>
> I'm trying to figure out what we should do with Maurizio's bug fix.
> Should we apply it even though it's making some changes to the
> existing behavior.
>
> As far as to why the existing code is trying to align the starting
> offset to a power of two, I believe the idea is to avoid
> fragmentation, since the normalize_request function will tend to round
> up allocaiton requests to the same power of two.
Hi Ted,
I think that leaving the alignment of the start offset for the small
files/allocation is not good idea. We might end up with suboptimal
file layout for smaller files. While this is not a big deal for
bigger files, with smaller ones it might cause some troubles.
Maurizio, can you resend the patch without the alignment ?
Also I started looking into normalize_request and hopefully I'll
have a patch soon. Ted, do you have any suggestion for a test to
make sure that I do not make things worse ? You mentioned something
simple on LSF, but I do not remember what it was.
Thanks!
-Lukas
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Tue, Jun 03, 2014 at 08:43:40PM +0200, Lukáš Czerner wrote:
>
> I think that leaving the alignment of the start offset for the small
> files/allocation is not good idea. We might end up with suboptimal
> file layout for smaller files. While this is not a big deal for
> bigger files, with smaller ones it might cause some troubles.
I thought we were only aligning the start offset for files > 2MB?
> Also I started looking into normalize_request and hopefully I'll
> have a patch soon. Ted, do you have any suggestion for a test to
> make sure that I do not make things worse ? You mentioned something
> simple on LSF, but I do not remember what it was.
The two mechanisms which we have currently are:
1) e2freefrag to measure free space fragmentation
2) e2fsck -fE fragcheck /dev/sdXX
The latter will give you a reports such as this:
142618(f): expecting 950605 actual extent phys 950800 log 89 len 26
142618(f): expecting 950826 actual extent phys 950550 log 282 len 10
Which would correspond to the following:
debugfs: stat <142618>
Inode: 142618 Type: regular Mode: 0666 Flags: 0x80000
Generation: 2697220261 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 1194112
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 360
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x53495656:aa328a78 -- Sat Apr 12 11:05:58 2014
atime: 0x53495643:c20a0ea4 -- Sat Apr 12 11:05:39 2014
mtime: 0x53495653:43ad6c78 -- Sat Apr 12 11:05:55 2014
crtime: 0x53495641:1332f128 -- Sat Apr 12 11:05:37 2014
Size of extra inode fields: 28
EXTENTS:
(68-76):950596-950604, (89-114):950800-950825, (282-291):950550-950559
This is admittedly imperfect. First of all, it gives false positives
when the file has holes (as in the above case). And even if it didn't
what we should do is to print the previous extent before the
contiunity, since what's interesting is how big was the previous
extent before we had a discontinuity, and because the length of the
current extent isn't all that interesting in the case of last extent
(the "tail") of the file.
Hmm... if I have time I might look into patching e2fsck so that e2fsck
-E fragcheck is more useful.
What's also missing is some way of taking all of this fine-grained
information is turning it into one to three figures of merit that
could be used to compare different tweaks to the block allocation
algorithm.
Cheers,
- Ted
On Tue, 3 Jun 2014, Theodore Ts'o wrote:
> Date: Tue, 3 Jun 2014 16:36:39 -0400
> From: Theodore Ts'o <[email protected]>
> To: Lukáš Czerner <[email protected]>
> Cc: Maurizio Lombardi <[email protected]>, [email protected],
> [email protected], [email protected]
> Subject: Re: [PATCH 2/2] ext4: fix bug in ext4_mb_normalize_request()
>
> On Tue, Jun 03, 2014 at 08:43:40PM +0200, Lukáš Czerner wrote:
> >
> > I think that leaving the alignment of the start offset for the small
> > files/allocation is not good idea. We might end up with suboptimal
> > file layout for smaller files. While this is not a big deal for
> > bigger files, with smaller ones it might cause some troubles.
>
> I thought we were only aligning the start offset for files > 2MB?
>
> > Also I started looking into normalize_request and hopefully I'll
> > have a patch soon. Ted, do you have any suggestion for a test to
> > make sure that I do not make things worse ? You mentioned something
> > simple on LSF, but I do not remember what it was.
>
> The two mechanisms which we have currently are:
>
> 1) e2freefrag to measure free space fragmentation
Yes, that's what I am using along with e4defrag and debugfs -R
"dump_extents" to figure out what changed.
>
> 2) e2fsck -fE fragcheck /dev/sdXX
I was not aware of that, thanks. However I was more interested in
workload to test on. Right now I have a simple script which is
doing:
# Test fallocate time
echo "[+] Fallocate test"
time fallocate -l70G ${MNT}/file
do_freefrag
do_defrag ${MNT}/file
do_debugfs /file
rm -f ${MNT}/file
sync
# Copy linux source
echo "[+] Copy linux source"
cp -r $linux_source ${MNT}/linux1 &
cp -r $linux_source ${MNT}/linux2 &
time wait
do_freefrag
# Tar linux source
echo "[+] Tar linux source"
tar -cf ${MNT}/linux1_1.tar ${MNT}/linux1 &
tar -cf ${MNT}/linux1_2.tar ${MNT}/linux1 &
time wait
do_freefrag
do_defrag ${MNT}/linux1_1.tar
do_debugfs /linux1_1.tar
do_defrag ${MNT}/linux1_2.tar
do_debugfs /linux1_2.tar
# Untar linux source
echo "[+] Untar linux source"
mkdir ${MNT}/tt1
mkdir ${MNT}/tt2
tar -xf ${MNT}/linux1_1.tar -C ${MNT}/tt1 &
tar -xf ${MNT}/linux1_2.tar -C ${MNT}/tt2 &
time wait
do_freefrag
# Singe dd
echo "[+] Single dd"
time dd if=/dev/zero of=${MNT}/file bs=512 count=4194304
do_freefrag
do_defrag ${MNT}/file
do_debugfs /file
# Multiple dd test
echo "[+] Multiple dd"
# 2G
dd if=/dev/zero of=${MNT}/file0 bs=4096 count=524288 &
# 2G
dd if=/dev/urandom of=${MNT}/file1 bs=512 count=4194304 &
# 2G
dd if=/dev/zero of=${MNT}/file2 bs=1M count=2048 &
# 4M
dd if=/dev/zero of=${MNT}/file3 bs=4096 count=1024 &
# 16M
dd if=/dev/zero of=${MNT}/file4 bs=4096 count=4096 &
time wait
do_freefrag
do_defrag ${MNT}/file0
do_debugfs /file0
do_defrag ${MNT}/file1
do_debugfs /file1
do_defrag ${MNT}/file2
do_debugfs /file2
do_defrag ${MNT}/file3
do_debugfs /file3
do_defrag ${MNT}/file4
do_debugfs /file4
# Run fsstress
echo "[+] Run fsstress"
time $fsstress -s 123456 -p24 -n 999 -d $MNT
do_freefrag
do_defrag ${MNT}
And then do the same thing on the loop device and investigate the
underlying image file.
Thanks!
-Lukas
On Tue, Jun 03, 2014 at 04:36:39PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 03, 2014 at 08:43:40PM +0200, Lukáš Czerner wrote:
> >
> > I think that leaving the alignment of the start offset for the small
> > files/allocation is not good idea. We might end up with suboptimal
> > file layout for smaller files. While this is not a big deal for
> > bigger files, with smaller ones it might cause some troubles.
>
> I thought we were only aligning the start offset for files > 2MB?
>
That's true.
With my patch the behaviour changes slightly because it doesn't depend
only from the file size but also from the number of blocks per group.
So, if you create a filesystem with blocksize==1Kb and 1024 blocks per group, we will start
aligning the start offset for files with size > 1Mb (1Kb * 1024)
The alignment is not performed on small files unless you have
a very low number of blocks per group.
Maurizio Lombardi