2006-11-09 11:09:48

by Takashi Sato

[permalink] [raw]
Subject: [RFC][PATCH 0/3] Extent base online defrag

Hi,

>I am considering the online defrag function for ext4 and thinking
>that your following patch set for multi-block allocation is useful
>to search contiguous free blocks for the defragmentation.
>
>"[RFC] extents,mballoc,delalloc for 2.6.16.8"
>http://marc.theaimsgroup.com/?l=linux-ext4&m=114669168616780&w=2
>
>I will send the patch of simple defrag implementation for ext4 later.

I have written the patches of ioctl for extent base online defragment
and the command which call it.
These patches are at the experimental stage so they need many
improvements. But they work well so far as basic defragmenter,
which means they are worth enough to examine my trial.

- Specify the target area in a file using the following structure:
struct ext3_ext_defrag_data {
loff_t start_offset; /* start offset to defrag in bytes */
loff_t defrag_size; /* size of defrag in bytes */
}
It uses loff_t so that the size of the structure is identical on
both 32 bits and 64 bits architecture.
Block allocation, including searching for the free contiguous
blocks, is implemented in kernel.

- The procedures for the defragment in kernel are as follows:
Blocks are allocated for the temporary inode by 16384 pages
and the block on the temporary inode is moved to the original inode
by a page. I think I need to tune the above pages unit
for the performance. It's in my TODO list.
1. Allocate blocks for the temporary inode.
2. Move the blocks on the temporary inode to the original inode
by a page.
2.1 Read the file data from the old blocks to the page
2.2 Move the block on the temporary inode to the original inode
2.3 Write the file data on the page into the new blocks

- Currently, this patch works only for ext3 because it needs
Alex Tomas's ext3 multi-block allocation patch which is for 2.6.16.8.
"[RFC] extents,mballoc,delalloc for 2.6.16.8"
http://marc.theaimsgroup.com/?l=linux-ext4&m=114669168616780&w=2
But, my target is the defragment for ext4. So I hope to start
the work for ext4 soon.

- On block allocation for the temporary inode(ext3_ext_new_extent_tree()),
the number of the modified blocks for metadata(extent block) is
calculated in ext3_ext_writepage_trans_blocks(). As the resulting
value can exceed the max blocks for the transaction(2048), passing
2048 directly to ext3_journal_start() for the provisional solution.
It's in my TODO list.

- They don't support the following:
- Not support the indirect block file(only for the extent file).
- Not optimize the depth of extent tree and the number of
extent blocks after defragmentation.
- Not support quota.
- Not support a hole file.
These are also in my TODO list.

Summary Of Patches:
*These patches apply on top of above Alex's patches.

[PATCH 1/3] Allocate new contiguous blocks with Alex's mballoc
- Search contiguous free blocks and allocate them for the temporary
inode with Alex's multi-block allocation.

[PATCH 2/3] Move the file data to the new blocks
- Move the blocks on the temporary inode to the original inode
by a page.

[PATCH 3/3] Online defrag command
- The defrag command. Usage is as follows:
o Defrag for a file.
# e4defrag file-name
o Defrag for all files on ext3.
# e4defrag device-name

I created 50 fragmented files of 1GB and ran e4defrag for each of them.
As a result, I got the following improvement.
"Fragments" is the total number of fragments on 50 files.
"I/O performance" is the elapsed time for reading 50 files
with "cat" command(cat file* > /dev/null).

Before defrag After defrag
---------------------------------------------------------------------
Fragments 12175 800
I/O performance(Second) 618.3 460.6(25% improvement!!)

Any comments are welcome.
Cheers, Takashi


2006-11-09 12:46:32

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] Extent base online defrag

[email protected] wrote:
> Hi,
>
>> I am considering the online defrag function for ext4 and thinking
>> that your following patch set for multi-block allocation is useful
>> to search contiguous free blocks for the defragmentation.
>>
>> "[RFC] extents,mballoc,delalloc for 2.6.16.8"
>> http://marc.theaimsgroup.com/?l=linux-ext4&m=114669168616780&w=2
>>
>> I will send the patch of simple defrag implementation for ext4 later.
>
> I have written the patches of ioctl for extent base online defragment
> and the command which call it.
> These patches are at the experimental stage so they need many
> improvements. But they work well so far as basic defragmenter,
> which means they are worth enough to examine my trial.
>
> - Specify the target area in a file using the following structure:
> struct ext3_ext_defrag_data {
> loff_t start_offset; /* start offset to defrag in bytes */
> loff_t defrag_size; /* size of defrag in bytes */
> }
> It uses loff_t so that the size of the structure is identical on
> both 32 bits and 64 bits architecture.
> Block allocation, including searching for the free contiguous
> blocks, is implemented in kernel.

NAK the ioctl approach.

People who like ioctls are just holdovers from non-Linux OS's.

Jeff




2006-11-09 14:11:43

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] Extent base online defrag

On Thu, 2006-11-09 at 07:46 -0500, Jeff Garzik wrote:
> [email protected] wrote:

> > I have written the patches of ioctl for extent base online defragment
> > and the command which call it.
> > These patches are at the experimental stage so they need many
> > improvements. But they work well so far as basic defragmenter,
> > which means they are worth enough to examine my trial.
> >
> > - Specify the target area in a file using the following structure:
> > struct ext3_ext_defrag_data {
> > loff_t start_offset; /* start offset to defrag in bytes */
> > loff_t defrag_size; /* size of defrag in bytes */
> > }
> > It uses loff_t so that the size of the structure is identical on
> > both 32 bits and 64 bits architecture.
> > Block allocation, including searching for the free contiguous
> > blocks, is implemented in kernel.
>
> NAK the ioctl approach.

I agree it shouldn't go into mainline this way, but while the details of
the proper interface are debated, this implementation at least allows
the core function to be tested & reviewed.

> People who like ioctls are just holdovers from non-Linux OS's.
>
> Jeff

Shaggy
--
David Kleikamp
IBM Linux Technology Center


2006-11-17 12:39:50

by Takashi Sato

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] Extent base online defrag

Hi,

>> > - Specify the target area in a file using the following structure:
>> > struct ext3_ext_defrag_data {
>> > loff_t start_offset; /* start offset to defrag in bytes */
>> > loff_t defrag_size; /* size of defrag in bytes */
>> > }
>> > It uses loff_t so that the size of the structure is identical on
>> > both 32 bits and 64 bits architecture.
>> > Block allocation, including searching for the free contiguous
>> > blocks, is implemented in kernel.
>>
>> NAK the ioctl approach.
>
> I agree it shouldn't go into mainline this way, but while the details of
> the proper interface are debated, this implementation at least allows
> the core function to be tested & reviewed.
>
>> People who like ioctls are just holdovers from non-Linux OS's.

Thank you for your comments.
My patches are at the experimental phase and the ioctl approach is
the provisional solution.
But I intend to continue this work with ioctl approach, if there are no
actual problems.

Cheers, Takashi

2006-11-29 10:22:31

by Girish Shilamkar

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] Extent base online defrag

Hi,
I tried to use defrag utility. But these are the results that, I got:

[root@metis e4defrag]# ./defrag /dev/hda8
Start defragment for device(/dev/hda8)
Total: 393
Success: 0
Failure: 393
[root@metis e4defrag]# ./defrag /mnt/test1/root
Start defragment for directory(/mnt/test1/root)
Total: 392
Success: 0
Failure: 392
I tried same thing with different directories and files, but the result
was the same.
By doing strace on defrag utility I found that ioctl always returned
ENOSPC. So I decreased the no. files, but still that didn't help, I came
down till filesystem was only 9% full.

[root@metis test2]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/hda3 15116868 9762924 4586040 69% /
tmpfs 224880 0 224880 0% /dev/shm
/dev/hda1 2016016 17940 1895664 1% /boot
/dev/hda2 20161204 1364752 17772312 8% /mnt/store
/dev/hda9 15116868 9762924 4586040 69% /mnt/test1
/dev/hda10 972532 78136 844196 9% /mnt/test2
[root@metis ~]# ./e4defrag/defrag /dev/hda10
Start defragment for device(/dev/hda10)
Total: 4426
Success: 0
Failure: 4426
[root@metis ~]# ./e4defrag/defrag /mnt/test2/
dir/ linux-2.6.12/
[root@metis ~]# ./e4defrag/defrag /mnt/test2/linux-2.6.12/
arch/ crypto/ Documentation/ drivers/ MAINTAINERS
Makefile
[root@metis ~]# ./e4defrag/defrag /mnt/test2/linux-2.6.12/Makefile
Start defragment for /mnt/test2/linux-2.6.12/Makefile
e4defrag : defrag fail.: No space left on device
"/mnt/test2/linux-2.6.12/Makefile"


I was using linux kernel 2.6.16.8. I applied Alex's mballoc patches and
then your defrag patches..
Another thing, I changed one line in the defrag utility code.
-#define FS_EXT3 "ext4dev"
+#define FS_EXT3 "ext3"

I did mount the filesystem with mballoc and extents options.

Any ideas ?

Thanks & Regards,
Girish.



On Thu, 2006-11-09 at 20:09 +0900, [email protected] wrote:
> Hi,
>
> >I am considering the online defrag function for ext4 and thinking
> >that your following patch set for multi-block allocation is useful
> >to search contiguous free blocks for the defragmentation.
> >
> >"[RFC] extents,mballoc,delalloc for 2.6.16.8"
> >http://marc.theaimsgroup.com/?l=linux-ext4&m=114669168616780&w=2
> >
> >I will send the patch of simple defrag implementation for ext4 later.
>
> I have written the patches of ioctl for extent base online defragment
> and the command which call it.
> These patches are at the experimental stage so they need many
> improvements. But they work well so far as basic defragmenter,
> which means they are worth enough to examine my trial.
>
> - Specify the target area in a file using the following structure:
> struct ext3_ext_defrag_data {
> loff_t start_offset; /* start offset to defrag in bytes */
> loff_t defrag_size; /* size of defrag in bytes */
> }
> It uses loff_t so that the size of the structure is identical on
> both 32 bits and 64 bits architecture.
> Block allocation, including searching for the free contiguous
> blocks, is implemented in kernel.
>
> - The procedures for the defragment in kernel are as follows:
> Blocks are allocated for the temporary inode by 16384 pages
> and the block on the temporary inode is moved to the original inode
> by a page. I think I need to tune the above pages unit
> for the performance. It's in my TODO list.
> 1. Allocate blocks for the temporary inode.
> 2. Move the blocks on the temporary inode to the original inode
> by a page.
> 2.1 Read the file data from the old blocks to the page
> 2.2 Move the block on the temporary inode to the original inode
> 2.3 Write the file data on the page into the new blocks
>
> - Currently, this patch works only for ext3 because it needs
> Alex Tomas's ext3 multi-block allocation patch which is for 2.6.16.8.
> "[RFC] extents,mballoc,delalloc for 2.6.16.8"
> http://marc.theaimsgroup.com/?l=linux-ext4&m=114669168616780&w=2
> But, my target is the defragment for ext4. So I hope to start
> the work for ext4 soon.
>
> - On block allocation for the temporary inode(ext3_ext_new_extent_tree()),
> the number of the modified blocks for metadata(extent block) is
> calculated in ext3_ext_writepage_trans_blocks(). As the resulting
> value can exceed the max blocks for the transaction(2048), passing
> 2048 directly to ext3_journal_start() for the provisional solution.
> It's in my TODO list.
>
> - They don't support the following:
> - Not support the indirect block file(only for the extent file).
> - Not optimize the depth of extent tree and the number of
> extent blocks after defragmentation.
> - Not support quota.
> - Not support a hole file.
> These are also in my TODO list.
>
> Summary Of Patches:
> *These patches apply on top of above Alex's patches.
>
> [PATCH 1/3] Allocate new contiguous blocks with Alex's mballoc
> - Search contiguous free blocks and allocate them for the temporary
> inode with Alex's multi-block allocation.
>
> [PATCH 2/3] Move the file data to the new blocks
> - Move the blocks on the temporary inode to the original inode
> by a page.
>
> [PATCH 3/3] Online defrag command
> - The defrag command. Usage is as follows:
> o Defrag for a file.
> # e4defrag file-name
> o Defrag for all files on ext3.
> # e4defrag device-name
>
> I created 50 fragmented files of 1GB and ran e4defrag for each of them.
> As a result, I got the following improvement.
> "Fragments" is the total number of fragments on 50 files.
> "I/O performance" is the elapsed time for reading 50 files
> with "cat" command(cat file* > /dev/null).
>
> Before defrag After defrag
> ---------------------------------------------------------------------
> Fragments 12175 800
> I/O performance(Second) 618.3 460.6(25% improvement!!)
>
> Any comments are welcome.
> Cheers, Takashi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2006-11-29 14:39:49

by Takashi Sato

[permalink] [raw]
Subject: Re:[RFC][PATCH 0/3] Extent base online defrag


Hi,

Thank you for your trial and report of the result!

> Hi,
> I tried to use defrag utility. But these are the results that, I got:
>
> [root@metis e4defrag]# ./defrag /dev/hda8
> Start defragment for device(/dev/hda8)
> Total: 393
> Success: 0
> Failure: 393
> [root@metis e4defrag]# ./defrag /mnt/test1/root
> Start defragment for directory(/mnt/test1/root)
> Total: 392
> Success: 0
> Failure: 392
> I tried same thing with different directories and files, but the result
> was the same.
> By doing strace on defrag utility I found that ioctl always returned
> ENOSPC. So I decreased the no. files, but still that didn't help, I came
> down till filesystem was only 9% full.

I think that your files didn't have many fragments, so ioctl returned
ENOSPC. e4defrag iterates ioctl per 64MB. If the specified 64MB
range has only one fragment(extent), the kernel returns ENOSPC
at the current implement.
So I'll change this behaviour to realize whether there was actual
error. I intend to update my patches in early December.

If you need to check the number of fragments(extents),
you can use the following simple program using Alex's debug code.
Example:
# ./scan_ext -e data0
Extents of data0:
start = 10240 len = 1024 block = 0
start = 212992 len = 1536 block = 1024
start = 197632 len = 1024 block = 2560
start = 450560 len = 1536 block = 3584
start = 662019 len = 1535 block = 5120
start = 721424 len = 1 block = 6655
start = 722944 len = 1024 block = 6656
start = 726016 len = 512 block = 7680

start: The physical block number.
len: The length of the extent.
block: The logical block number.

The program is following.
-----
#include <stdio.h>
#include <sys/ioctl.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>


#define GET_EXTENTS 0
#define GET_STATUS 1
#define GET_DEPTH 2

#define EXT3_IOC_GET_EXTENTS _IOR('f', 7, long)
#define EXT3_IOC_GET_TREE_DEPTH _IOR('f', 8, long)
#define EXT3_IOC_GET_TREE_STATS _IOR('f', 9, long)

#define BUF_LEN 1024

/*
* this structure is used to gather extents from the tree via ioctl
*/
struct ext3_extent_buf {
unsigned long start;
int buflen;
void *buffer;
void *cur;
int err;
};

/*
* storage for cached extent
*/
struct ext3_ext_cache {
unsigned int ec_start;
unsigned int ec_block;
unsigned int ec_len;
unsigned int ec_type;
};

/*
* this structure is used to collect stats info about the tree
*/
struct ext3_extent_tree_stats {
int depth;
int extents_num;
int leaf_num;
};

struct ext3_ext_cache ec[BUF_LEN];

int
main(int argc, char **argv) {

int cmd = EXT3_IOC_GET_EXTENTS;
char *file;
int fd;
struct ext3_extent_buf ebuf;
void *arg;
struct ext3_extent_tree_stats estat;
int ret;
int i;

if (argc < 3) {
fprintf(stderr, "Usage: scan_ext -e|-s|-d file\n");
exit(1);
}

if (!strcmp(argv[1], "-e")) {
file = argv[2];
} else if (!strcmp(argv[1], "-d")) {
file = argv[2];
cmd = EXT3_IOC_GET_TREE_DEPTH;
} else if (!strcmp(argv[1], "-s")) {
file = argv[2];
cmd = EXT3_IOC_GET_TREE_STATS;
} else {
file = argv[1];
}

if ((fd = open(file, O_RDONLY)) < 0) {
perror("open");
exit(1);
}

switch (cmd) {
case EXT3_IOC_GET_EXTENTS:
printf("Extents of %s:\n", argv[2]);
ebuf.start = 0;
ebuf.buflen = BUF_LEN * sizeof(struct ext3_ext_cache);
ebuf.buffer = ebuf.cur = ec;
ebuf.err = 0;
arg = &ebuf;
break;
case EXT3_IOC_GET_TREE_DEPTH:
arg = NULL;
break;
case EXT3_IOC_GET_TREE_STATS:
arg = &estat;
break;
}

if ((ret = ioctl(fd, cmd, arg)) < 0) {
perror("ioctl");
exit(1);
}

switch (cmd) {
case EXT3_IOC_GET_EXTENTS:
for(i = 0; i < ret; i++) {
printf("start = %u len = %u block = %u\n",
ec[i].ec_start, ec[i].ec_len,
ec[i].ec_block);
}
break;
case EXT3_IOC_GET_TREE_DEPTH:
printf("depth = %d\n", ret);
break;
case EXT3_IOC_GET_TREE_STATS:
printf("depth = %d\n", estat.depth);
printf("extents = %d\n", estat.extents_num);
printf("leaf blocks = %d\n", estat.leaf_num);
break;
}

exit(0);
}

Cheers, Takashi

2006-11-30 07:10:51

by Girish Shilamkar

[permalink] [raw]
Subject: Re: Re:[RFC][PATCH 0/3] Extent base online defrag

Hello,


On Wed, 2006-11-29 at 23:39 +0900, [email protected] wrote:
> Hi,
>
> Thank you for your trial and report of the result!
>
> > Hi,
> > I tried to use defrag utility. But these are the results that, I got:
> >
> > [root@metis e4defrag]# ./defrag /dev/hda8
> > Start defragment for device(/dev/hda8)
> > Total: 393
> > Success: 0
> > Failure: 393
> > [root@metis e4defrag]# ./defrag /mnt/test1/root
> > Start defragment for directory(/mnt/test1/root)
> > Total: 392
> > Success: 0
> > Failure: 39
> > I tried same thing with different directories and files, but the result
> > was the same.
> > By doing strace on defrag utility I found that ioctl always retunarned
> > ENOSPC. So I decreased the no. files, but still that didn't help, I came
> > down till filesystem was only 9% full.
>
> I think that your files didn't have many fragments, so ioctl returned
> ENOSPC. e4defrag iterates ioctl per 64MB. If the specified 64MB
> range has only one fragment(extent), the kernel returns ENOSPC
> at the current implement.
> So I'll change this behaviour to realize whether there was actual
> error. I intend to update my patches in early December.
>
Thanks for responding and for the scan_ext3 program.
I did further testing, i checked if the file is fragmented or not :

[root@metis e4defrag]# ./scan_ext3 /mnt/test2/root/file3
Extents of /mnt/test2/root/file3:
start = 117919 len = 11316 block = 0
start = 136356 len = 2 block = 11316
start = 165920 len = 24 block = 11318
start = 165913 len = 7 block = 11342
start = 221200 len = 48 block = 11349
start = 221282 len = 1 block = 11397
start = 221295 len = 12 block = 11398
start = 221328 len = 1 block = 11410
start = 221381 len = 3 block = 11411
start = 221496 len = 1 block = 11414
start = 221517 len = 57 block = 11415
start = 221585 len = 53 block = 11472
start = 221673 len = 6 block = 11525
start = 221681 len = 4 block = 11531
start = 221707 len = 1 block = 11535
start = 221711 len = 5 block = 11536
start = 221725 len = 25 block = 11541

>From the test the file seems to be pretty fragmented, still the defrag
utility gives ENOSPC error.

[root@metis e4defrag]# strace -e
trace=ioctl ./defrag /mnt/test2/root/file3
Start defragment for /mnt/test2/root/file3
ioctl(3, 0x4010660a, 0xbf8747c0) = -1 ENOSPC (No space left on
device)
e4defrag : defrag fail.: No space left on device
"/mnt/test2/root/file3"
[root@metis ~]# ll -h /mnt/test2/root/file3
-rw-r--r-- 1 root root 159M Sep 2 10:34 /mnt/test2/root/file3

Most probably Rupesh(cc'ed him) will be looking into it further, really
appreciate if you can add something to it.

Thanks & Regards,
Girish.

2006-12-01 10:29:59

by Takashi Sato

[permalink] [raw]
Subject: Re:[RFC][PATCH 0/3] Extent base online defrag

Hi,

Thank you for your further testing.

>From the test the file seems to be pretty fragmented, still the defrag
>utility gives ENOSPC error.

I found the problem of the workaround code related to journal in my
patches. 2048 blocks was always passed to ext3_journal_start so that
the journal blocks in transaction didn't exceed j_max_transaction_buffers.
However, j_max_transaction_buffers could be adjusted to the number of
filesystem blocks. Maybe your filesystem's j_max_transaction_buffers
would be less than 2048, so the error would occur.

When the above problem occurs, the following message will be output
in syslog. Could you check your syslog?

"JBD: e4defrag wants too many credits (2048 > XXXX)"

I am working for fixing the workaround code, but the work isn't
finished yet. I will update my patches next week.
If you need to run my defrag soon, you can use the following
patch to avoid the above problem as the provisional solution.

After Alex's patches and my previous patches are applied,
you can apply the following patch.
---
diff -uprN -X linux-2.6.16.8-tnes-org/Documentation/dontdiff linux-2.6.16.8-tnes-org/fs/ext3/extents.c linux-2.6.16.8-work/fs/ext3/extents.c
--- linux-2.6.16.8-tnes-org/fs/ext3/extents.c 2006-12-01 10:37:28.000000000 +0900
+++ linux-2.6.16.8-work/fs/ext3/extents.c 2006-12-01 17:23:38.000000000 +0900
@@ -2738,6 +2738,7 @@ ext3_ext_replace_branches(struct ext3_ex
handle_t *handle = NULL;
unsigned jnum;
struct inode *inode;
+ journal_t *journal;


from = from_page << (PAGE_CACHE_SHIFT - dest_tree->inode->i_blkbits);
@@ -2752,10 +2753,11 @@ ext3_ext_replace_branches(struct ext3_ex
*/
/* TODO:
* Need to consider the way of calculating journal blocks
- * because j_max_transaction_buffer may exceed 2048
+ * because the journal blocks may exceed j_max_transaction_buffer
* if we have a deep depth.
*/
- jnum = 2048;
+ journal = EXT3_JOURNAL(inode);
+ jnum = journal->j_max_transaction_buffers;
handle = ext3_journal_start(inode, jnum);
if (IS_ERR(handle)) {
err = PTR_ERR(handle);
@@ -3093,6 +3095,7 @@ ext3_ext_new_extent_tree(struct inode *t
unsigned jnum;
int ret = 0, err = 0, depth;
int last_extent = 0;
+ journal_t *journal;

/* (blocks_per_page * count) * (extent_blocks + index_blocks)
* + super_block + block_bitmap + group_descriptor
@@ -3100,10 +3103,11 @@ ext3_ext_new_extent_tree(struct inode *t
*/
/* TODO:
* Need to consider the way of calculating journal blocks
- * because j_max_transaction_buffer may exceed 2048
+ * because the journal blocks may exceed j_max_transaction_buffer
* if we have a deep depth.
*/
- jnum = 2048;
+ journal = EXT3_JOURNAL(tmp_inode);
+ jnum = journal->j_max_transaction_buffers;
eh = EXT_ROOT_HDR(dest_tree);
eh->eh_depth = 0;
handle = ext3_journal_start(tmp_inode, jnum);

Cheers, Takashi