LinuxLists.cc - [PATCH V1 00/17] ext4: Add inline data support.

2011-10-26 07:32:24

Subject: [PATCH V1 00/17] ext4: Add inline data support.

Hi Ted, Andreas and list,
This is my 1st attempt to add inline data support to ext4 inode. For
more information about the background, please refer to the thread
http://marc.info/?l=linux-ext4&m=131715205428067&w=2
When I sent out the RFC on Sep.27, Andreas suggested that we can use the
space of xattr to put inline data. So this is the 1st version using that
method. It should be easy to change if we decide to use other places in
inode(e.g the unused extent space) since all the inline data
manipulation function is wrapped with function like ext4_*_inline_data.

Currently I use all the space between i_extra_isize and inode_size if
inode_size = 256. For inode_size > 256, half of that space is used so as
to leave some space for other xattrs.

This is only a V1 and there are still something to do(e.g. I am thinking
of using unused extent space), but I'd like to send it out earlier so
that it can be reviewed ASAP.

Some simple tests shows that with a linux-3.0 vanilla source, the new
dir can save 1% disk space. For my "/usr", it can save about 3.2%
spaces. I guess for volume with future bigalloc support, it should save
more space for us for small dir. I also run some other tests and it
seems the code is OK for a try. I haven't found a good test cases that
can test the small file/dir(tens of bytes in my case) performance where
inline data should have some good number.

Any comments are welcomed.

git diff --stat

fs/ext4/dir.c | 117 ++++++++++-
fs/ext4/ext4.h | 15 +-
fs/ext4/ialloc.c | 4 +
fs/ext4/inode.c | 583 +++++++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/namei.c | 629
+++++++++++++++++++++++++++++++++++++++++++++---------
fs/ext4/xattr.c | 216 +++++++++++++++++++
fs/ext4/xattr.h | 68 ++++++
7 files changed, 1489 insertions(+), 143 deletions(-)

Thanks
Tao

2011-10-26 07:34:49

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 01/17] ext4: Move extra inode read to a new function.

From: Tao Ma <[email protected]>

Currently, in ext4_iget we do a simple check to see whether
there does exists some information starting from the end
of i_extra_size. With inline data added, this procedure
is more complicated. So move it to a new function named
ext4_iget_extra_inode.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/inode.c | 19 ++++++++++++-------
1 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 986e238..6638f0e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3384,6 +3384,16 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
}
}

+static inline void ext4_iget_extra_inode(struct inode *inode,
+ struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei)
+{
+ __le32 *magic = (void *)raw_inode +
+ EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize;
+ if (*magic == cpu_to_le32(EXT4_XATTR_MAGIC))
+ ext4_set_inode_state(inode, EXT4_STATE_XATTR);
+}
+
struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
{
struct ext4_iloc iloc;
@@ -3494,13 +3504,8 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
/* The extra space is currently unused. Use it. */
ei->i_extra_isize = sizeof(struct ext4_inode) -
EXT4_GOOD_OLD_INODE_SIZE;
- } else {
- __le32 *magic = (void *)raw_inode +
- EXT4_GOOD_OLD_INODE_SIZE +
- ei->i_extra_isize;
- if (*magic == cpu_to_le32(EXT4_XATTR_MAGIC))
- ext4_set_inode_state(inode, EXT4_STATE_XATTR);
- }
+ } else
+ ext4_iget_extra_inode(inode, raw_inode, ei);
} else
ei->i_extra_isize = 0;

--
1.7.0.4

2011-10-26 07:34:15

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 04/17] ext4: Add normal write support for inline data.

From: Tao Ma <[email protected]>

For a normal write case(not journalled write, not delayed allocation),
we write to the inline if the file is small and convert it to an extent
based file when the write is larger than the max inline size.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2fa8cf4..876cdfd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -741,6 +741,163 @@ static int do_journal_get_write_access(handle_t *handle,

static int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
+static int ext4_read_inline_page(struct inode *inode, struct page *page);
+
+static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
+ struct inode *inode,
+ unsigned flags)
+{
+ int ret, needed_blocks;
+ handle_t *handle = NULL;
+ int retries = 0;
+ struct page *page = NULL;
+ unsigned from, to;
+ struct ext4_iloc iloc;
+
+ if (!ext4_has_inline_data(inode)) {
+ /*
+ * clear the flag so that no new write
+ * will trap here again.
+ */
+ ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+ return 0;
+ }
+
+ needed_blocks = ext4_writepage_trans_blocks(inode);
+ from = 0;
+ to = ext4_get_max_inline_size(inode);
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+retry:
+ handle = ext4_journal_start(inode, needed_blocks);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ /* We cannot recurse into the filesystem as the transaction is already
+ * started */
+ flags |= AOP_FLAG_NOFS;
+
+ page = grab_cache_page_write_begin(mapping, 0, flags);
+ if (!page) {
+ ext4_journal_stop(handle);
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (!PageUptodate(page)) {
+ ret = ext4_read_inline_page(inode, page);
+ if (ret)
+ goto out;
+ }
+
+ ret = ext4_destroy_inline_data(handle, inode);
+ if (ret)
+ goto out;
+
+ if (ext4_should_dioread_nolock(inode))
+ ret = __block_write_begin(page, from, to, ext4_get_block_write);
+ else
+ ret = __block_write_begin(page, from, to, ext4_get_block);
+
+ if (!ret && ext4_should_journal_data(inode)) {
+ ret = walk_page_buffers(handle, page_buffers(page),
+ from, to, NULL, do_journal_get_write_access);
+ }
+
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ block_commit_write(page, from, to);
+out:
+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ if (handle)
+ ext4_journal_stop(handle);
+ brelse(iloc.bh);
+ return ret;
+}
+
+/*
+ * Try to write data in the inode.
+ * If the inode has inline data, check whether the new write can be
+ * in the inode also. If not, create the page the handle, move the data
+ * to the page make it update and let the later codes create extent for it.
+ */
+static int ext4_try_to_write_inline_data(struct address_space *mapping,
+ struct inode *inode,
+ loff_t pos, unsigned len,
+ unsigned flags,
+ struct page **pagep)
+{
+ int ret;
+ handle_t *handle;
+ struct page *page;
+ struct ext4_iloc iloc;
+
+ if (pos + len > ext4_get_max_inline_size(inode))
+ return ext4_convert_inline_data_to_extent(mapping,
+ inode, flags);
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+ /*
+ * The possible write could happen in the inode,
+ * so try to reserve the space in inode first.
+ */
+ handle = ext4_journal_start(inode, 1);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ handle = NULL;
+ goto out;
+ }
+
+ if (!ext4_has_inline_data(inode)) {
+ ret = ext4_init_inline_data(handle, inode, &iloc);
+ if (ret && ret != -ENOSPC)
+ goto out;
+
+ if (ret == -ENOSPC) {
+ ret = 0;
+ goto out;
+ }
+ }
+
+ /* We cannot recurse into the filesystem as the transaction is already
+ * started */
+ flags |= AOP_FLAG_NOFS;
+
+ page = grab_cache_page_write_begin(mapping, 0, flags);
+ if (!page) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ *pagep = page;
+
+ if (!PageUptodate(page)) {
+ ret = ext4_read_inline_page(inode, page);
+ if (ret)
+ goto out;
+ }
+
+ ret = 1;
+ handle = NULL;
+out:
+ if (handle)
+ ext4_journal_stop(handle);
+ brelse(iloc.bh);
+ return ret;
+}
+
static int ext4_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
@@ -763,6 +920,17 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;

+ if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) {
+ ret = ext4_try_to_write_inline_data(mapping, inode, pos, len,
+ flags, pagep);
+ if (ret < 0)
+ goto out;
+ if (ret == 1) {
+ ret = 0;
+ goto out;
+ }
+ }
+
retry:
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
@@ -780,6 +948,7 @@ retry:
ret = -ENOMEM;
goto out;
}
+
*pagep = page;

if (ext4_should_dioread_nolock(inode))
@@ -826,6 +995,37 @@ out:
return ret;
}

+static int ext4_write_inline_data_end(struct inode *inode,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page)
+{
+ int ret;
+ void *kaddr;
+ struct ext4_iloc iloc;
+
+ if (unlikely(copied < len)) {
+ if (!PageUptodate(page)) {
+ copied = 0;
+ goto out;
+ }
+ }
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret) {
+ ext4_std_error(inode->i_sb, ret);
+ copied = 0;
+ goto out;
+ }
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ ext4_write_inline_data(inode, &iloc, kaddr, pos, len);
+ kunmap_atomic(kaddr, KM_USER0);
+
+ brelse(iloc.bh);
+out:
+ return copied;
+}
+
/* For write_end() in data=journal mode */
static int write_end_fn(handle_t *handle, struct buffer_head *bh)
{
@@ -844,7 +1044,12 @@ static int ext4_generic_write_end(struct file *file,
struct inode *inode = mapping->host;
handle_t *handle = ext4_journal_current_handle();

- copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+ if (ext4_has_inline_data(inode))
+ copied = ext4_write_inline_data_end(inode, pos, len,
+ copied, page);
+ else
+ copied = block_write_end(file, mapping, pos,
+ len, copied, page, fsdata);

/*
* No need to use i_size_read() here, the i_size
@@ -3462,7 +3667,8 @@ static inline void ext4_iget_extra_inode(struct inode *inode,
if (*magic == cpu_to_le32(EXT4_XATTR_MAGIC)) {
ext4_set_inode_state(inode, EXT4_STATE_XATTR);
ext4_find_inline_data(inode, iloc);
- }
+ } else
+ EXT4_I(inode)->i_inline_off = 0;
}

struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
--
1.7.0.4

2011-10-26 07:34:14

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 03/17] ext4: Add read support for inline data.

From: Tao Ma <[email protected]>

Let readpage and readpages handle the case when we
want to read an inlined file.

Signed-off-by: Tao Ma <[email protected].
---
fs/ext4/inode.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 68 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 017e119..2fa8cf4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -540,6 +540,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
int ret = 0, started = 0;
int dio_credits;

+ if (ext4_has_inline_data(inode))
+ return -ERANGE;
+
map.m_lblk = iblock;
map.m_len = bh->b_size >> inode->i_blkbits;

@@ -2483,6 +2486,12 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
journal_t *journal;
int err;

+ /*
+ * XXX: Can we arrive here for a inline file? Maybe not.
+ */
+ if (ext4_has_inline_data(inode))
+ return 0;
+
if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
test_opt(inode->i_sb, DELALLOC)) {
/*
@@ -2526,9 +2535,58 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping, block, ext4_get_block);
}

+static int ext4_read_inline_page(struct inode *inode, struct page *page)
+{
+ void *kaddr;
+ loff_t size;
+ int ret;
+ struct ext4_iloc iloc;
+
+ BUG_ON(!PageLocked(page));
+ BUG_ON(!ext4_has_inline_data(inode));
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+ size = i_size_read(inode);
+
+ if (size > PAGE_CACHE_SIZE ||
+ size > ext4_get_max_inline_size(inode)) {
+ ext4_error_inode(inode, __func__, __LINE__, 0,
+ "bad size %ld for a inline file", (long)size);
+ return -EROFS;
+ }
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ ext4_read_inline_data(inode, &iloc, kaddr, size);
+ flush_dcache_page(page);
+ kunmap_atomic(kaddr, KM_USER0);
+ SetPageUptodate(page);
+
+ brelse(iloc.bh);
+ return 0;
+}
+
+static int ext4_readpage_inline(struct inode *inode, struct page *page)
+{
+ int ret;
+
+ ret = ext4_read_inline_page(inode, page);
+
+ unlock_page(page);
+ return ret;
+}
+
static int ext4_readpage(struct file *file, struct page *page)
{
+ struct inode *inode = file->f_mapping->host;
+
trace_ext4_readpage(page);
+
+ if (ext4_has_inline_data(inode))
+ return ext4_readpage_inline(inode, page);
+
return mpage_readpage(page, ext4_get_block);
}

@@ -2536,6 +2594,12 @@ static int
ext4_readpages(struct file *file, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages)
{
+ struct inode *inode = mapping->host;
+
+ /* If the file has inline data, no need to do readpages. */
+ if (ext4_has_inline_data(inode))
+ return 0;
+
return mpage_readpages(mapping, pages, nr_pages, ext4_get_block);
}

@@ -2854,6 +2918,10 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
struct inode *inode = file->f_mapping->host;
ssize_t ret;

+ /* Let buffer I/O handle the inline data case. */
+ if (ext4_has_inline_data(inode))
+ return 0;
+
trace_ext4_direct_IO_enter(inode, offset, iov_length(iov, nr_segs), rw);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
ret = ext4_ext_direct_IO(rw, iocb, iov, offset, nr_segs);
--
1.7.0.4

2011-10-26 07:34:13

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

From: Tao Ma <[email protected]>

Implement inline data with xattr. This idea is inspired by Andreas.
So now we use "system.data" to store xattr.
For inode_size = 256, currently we uses all the space between i_extra_isize
and inode_size. For inode_size > 256, we use half of that space.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/ext4.h | 5 ++
fs/ext4/inode.c | 9 ++-
fs/ext4/xattr.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/xattr.h | 68 +++++++++++++++++
4 files changed, 295 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b7d7bd0..9a60193 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -846,6 +846,10 @@ struct ext4_inode_info {
/* on-disk additional length */
__u16 i_extra_isize;

+ /* Indicate the inline data space. */
+ u16 i_inline_off;
+ u16 i_inline_size;
+
#ifdef CONFIG_QUOTA
/* quota space reservation, managed internally by quota code */
qsize_t i_reserved_quota;
@@ -1261,6 +1265,7 @@ enum {
EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_DELALLOC_RESERVED, /* blks already reserved for delalloc */
+ EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
};

#define EXT4_INODE_BIT_FNS(name, field, offset) \
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6638f0e..017e119 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3386,12 +3386,15 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,

static inline void ext4_iget_extra_inode(struct inode *inode,
struct ext4_inode *raw_inode,
- struct ext4_inode_info *ei)
+ struct ext4_inode_info *ei,
+ struct ext4_iloc *iloc)
{
__le32 *magic = (void *)raw_inode +
EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize;
- if (*magic == cpu_to_le32(EXT4_XATTR_MAGIC))
+ if (*magic == cpu_to_le32(EXT4_XATTR_MAGIC)) {
ext4_set_inode_state(inode, EXT4_STATE_XATTR);
+ ext4_find_inline_data(inode, iloc);
+ }
}

struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
@@ -3505,7 +3508,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
ei->i_extra_isize = sizeof(struct ext4_inode) -
EXT4_GOOD_OLD_INODE_SIZE;
} else
- ext4_iget_extra_inode(inode, raw_inode, ei);
+ ext4_iget_extra_inode(inode, raw_inode, ei, &iloc);
} else
ei->i_extra_isize = 0;

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c757adc..37418a9 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -944,6 +944,222 @@ ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
return 0;
}

+#define EXT4_XATTR_SYSTEM_DATA_NAME "data"
+
+/*
+ * the max inline size we can have for an inline inode.
+ *
+ * Currently we uses the maximum free space if inode size is 256
+ * and half of the space left if inode size > 256.
+ * This can be tuned later and the codes should work with
+ * the old size since if an inode has been initialized already,
+ * it will uses the value it detected.
+ */
+int ext4_get_max_inline_size(struct inode *inode)
+{
+ if (EXT4_SB(inode->i_sb)->s_inode_size == EXT4_GOOD_OLD_INODE_SIZE)
+ return 0;
+
+ if (EXT4_I(inode)->i_inline_off)
+ return EXT4_I(inode)->i_inline_size;
+
+ if (EXT4_SB(inode->i_sb)->s_inode_size == 256)
+ return EXT4_XATTR_SIZE(EXT4_SB(inode->i_sb)->s_inode_size -
+ EXT4_GOOD_OLD_INODE_SIZE -
+ EXT4_I(inode)->i_extra_isize -
+ sizeof(struct ext4_xattr_ibody_header) -
+ EXT4_XATTR_LEN(sizeof(EXT4_XATTR_SYSTEM_DATA_NAME)) -
+ EXT4_XATTR_ROUND);
+
+ return EXT4_XATTR_SIZE((EXT4_SB(inode->i_sb)->s_inode_size -
+ EXT4_GOOD_OLD_INODE_SIZE - EXT4_I(inode)->i_extra_isize) / 2);
+}
+
+int ext4_has_inline_data(struct inode *inode)
+{
+ return EXT4_I(inode)->i_inline_off;
+}
+
+int ext4_find_inline_data(struct inode *inode, struct ext4_iloc *iloc)
+{
+ struct ext4_xattr_ibody_find is = {
+ .s = { .not_found = -ENODATA, },
+ .iloc = *iloc,
+ };
+ struct ext4_xattr_info i = {
+ .name_index = EXT4_XATTR_INDEX_SYSTEM_DATA,
+ .name = EXT4_XATTR_SYSTEM_DATA_NAME,
+ };
+ int error;
+
+ if (EXT4_I(inode)->i_extra_isize == 0)
+ return 0;
+
+ error = ext4_xattr_ibody_find(inode, &i, &is);
+ if (error)
+ return error;
+ if (!is.s.not_found) {
+ EXT4_I(inode)->i_inline_off = (u16)((void *)is.s.here -
+ (void *)ext4_raw_inode(iloc));
+ EXT4_I(inode)->i_inline_size =
+ le32_to_cpu(is.s.here->e_value_size);
+ ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+ }
+ return 0;
+}
+
+void* ext4_get_inline_data_pos(struct inode *inode, struct ext4_iloc *iloc)
+{
+ struct ext4_xattr_entry *entry;
+ struct ext4_xattr_ibody_header *header;
+
+ BUG_ON(!EXT4_I(inode)->i_inline_off);
+
+ header = IHDR(inode, ext4_raw_inode(iloc));
+ entry = (struct ext4_xattr_entry *)((void *)ext4_raw_inode(iloc) +
+ EXT4_I(inode)->i_inline_off);
+
+ return (void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs);
+
+}
+
+void ext4_read_inline_data(struct inode *inode, struct ext4_iloc *iloc,
+ void *buffer, size_t len)
+{
+ struct ext4_xattr_entry *entry;
+ struct ext4_xattr_ibody_header *header;
+
+ BUG_ON(!EXT4_I(inode)->i_inline_off);
+
+ header = IHDR(inode, ext4_raw_inode(iloc));
+ entry = (struct ext4_xattr_entry *)((void *)ext4_raw_inode(iloc) +
+ EXT4_I(inode)->i_inline_off);
+ BUG_ON(len > EXT4_I(inode)->i_inline_size);
+
+ memcpy(buffer,
+ (void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs), len);
+}
+
+void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
+ void *buffer, loff_t pos, unsigned len)
+{
+ struct ext4_xattr_entry *entry;
+ struct ext4_xattr_ibody_header *header;
+
+ BUG_ON(!EXT4_I(inode)->i_inline_off);
+ BUG_ON(pos + len > EXT4_I(inode)->i_inline_size);
+
+ header = IHDR(inode, ext4_raw_inode(iloc));
+ entry = (struct ext4_xattr_entry *)((void *)ext4_raw_inode(iloc) +
+ EXT4_I(inode)->i_inline_off);
+ memcpy((void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs) + pos,
+ buffer + pos, len);
+}
+
+int ext4_init_inline_data(handle_t *handle, struct inode *inode,
+ struct ext4_iloc *iloc)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ int size, error;
+ char *value;
+ struct ext4_xattr_ibody_find is = {
+ .s = { .not_found = -ENODATA, },
+ .iloc = *iloc,
+ };
+ struct ext4_xattr_info i = {
+ .name_index = EXT4_XATTR_INDEX_SYSTEM_DATA,
+ .name = EXT4_XATTR_SYSTEM_DATA_NAME,
+ };
+
+ if (!ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
+ return -ENOSPC;
+
+ BUG_ON(ei->i_inline_off != 0);
+
+ size = ext4_get_max_inline_size(inode);
+ /*
+ * XXX: We'd better try to check whether we can insert it into inode
+ * before allocating the value.
+ */
+ value = kzalloc(size, GFP_NOFS);
+ if (!value)
+ return -ENOMEM;
+
+ i.value = value;
+ i.value_len = size;
+
+ error = ext4_xattr_ibody_find(inode, &i, &is);
+ if (error)
+ goto out;
+
+ BUG_ON(!is.s.not_found);
+
+ error = ext4_journal_get_write_access(handle, iloc->bh);
+ if (error)
+ goto out;
+
+ error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+ if (error) {
+ if (error == -ENOSPC)
+ ext4_clear_inode_state(inode,
+ EXT4_STATE_MAY_INLINE_DATA);
+ goto out;
+ }
+
+ get_bh(iloc->bh);
+ error = ext4_mark_iloc_dirty(handle, inode, iloc);
+
+ EXT4_I(inode)->i_inline_off = (u16)((void *)is.s.here -
+ (void *)ext4_raw_inode(iloc));
+ EXT4_I(inode)->i_inline_size = size;
+ ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+out:
+ kfree(value);
+ return error;
+}
+
+int ext4_destroy_inline_data(handle_t *handle, struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_xattr_ibody_find is = {
+ .s = { .not_found = 0, },
+ };
+ struct ext4_xattr_info i = {
+ .name_index = EXT4_XATTR_INDEX_SYSTEM_DATA,
+ .name = EXT4_XATTR_SYSTEM_DATA_NAME,
+ .value = NULL,
+ .value_len = 0,
+ };
+ int error;
+
+ if (!ei->i_inline_off)
+ return 0;
+
+ error = ext4_get_inode_loc(inode, &is.iloc);
+ if (error)
+ return error;
+
+ error = ext4_xattr_ibody_find(inode, &i, &is);
+ if (error)
+ goto out;
+
+ error = ext4_journal_get_write_access(handle, is.iloc.bh);
+ if (error)
+ goto out;
+
+ error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+ if (error)
+ goto out;
+
+ error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
+
+ EXT4_I(inode)->i_inline_off = 0;
+ EXT4_I(inode)->i_inline_size = 0;
+ ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+out:
+ return error;
+}
+
/*
* ext4_xattr_set_handle()
*
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 25b7387..ca3b05b 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -21,6 +21,7 @@
#define EXT4_XATTR_INDEX_TRUSTED 4
#define EXT4_XATTR_INDEX_LUSTRE 5
#define EXT4_XATTR_INDEX_SECURITY 6
+#define EXT4_XATTR_INDEX_SYSTEM_DATA 7

struct ext4_xattr_header {
__le32 h_magic; /* magic number for identification */
@@ -88,6 +89,18 @@ extern void ext4_exit_xattr(void);

extern const struct xattr_handler *ext4_xattr_handlers[];

+extern int ext4_has_inline_data(struct inode *inode);
+extern int ext4_get_max_inline_size(struct inode *inode);
+extern int ext4_find_inline_data(struct inode *inode, struct ext4_iloc *iloc);
+extern void* ext4_get_inline_data_pos(struct inode *inode,
+ struct ext4_iloc *iloc);
+extern void ext4_read_inline_data(struct inode *inode, struct ext4_iloc *iloc,
+ void *buffer, size_t len);
+extern void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
+ void *buffer, loff_t pos, unsigned len);
+extern int ext4_init_inline_data(handle_t *handle, struct inode *inode,
+ struct ext4_iloc *iloc);
+extern int ext4_destroy_inline_data(handle_t *handle, struct inode *inode);
# else /* CONFIG_EXT4_FS_XATTR */

static inline int
@@ -141,6 +154,61 @@ ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,

#define ext4_xattr_handlers NULL

+static inline int ext4_find_inline_data(struct inode *inode,
+ struct ext4_iloc *iloc)
+{
+ return 0;
+}
+
+static inline void* ext4_get_inline_data_pos(struct inode *inode,
+ struct ext4_iloc *iloc)
+{
+ return NULL;
+}
+
+static inline int ext4_has_inline_data(struct inode *inode)
+{
+ return 0;
+}
+
+static inline int ext4_get_max_inline_size(struct inode *inode)
+{
+ return 0;
+}
+
+static inline int ext4_find_inline_data(struct inode *inode,
+ struct ext4_iloc *iloc)
+{
+ return 0;
+}
+
+static inline void ext4_read_inline_data(struct inode *inode,
+ struct ext4_iloc *iloc,
+ void *buffer, size_t len)
+{
+ return;
+}
+
+static inline void ext4_write_inline_data(struct inode *inode,
+ struct ext4_iloc *iloc,
+ void *buffer, loff_t pos,
+ unsigned len)
+{
+ return;
+}
+
+static inline int ext4_init_inline_data(handle_t *handle,
+ struct inode *inode,
+ struct ext4_iloc *iloc)
+{
+ return 0;
+}
+
+static inline int ext4_destroy_inline_data(handle_t *handle,
+ struct inode *inode)
+{
+ return 0;
+}
# endif /* CONFIG_EXT4_FS_XATTR */

#ifdef CONFIG_EXT4_FS_SECURITY
--
1.7.0.4

2011-10-26 07:34:17

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 06/17] ext4: Add delalloc support for inline data.

From: Tao Ma <[email protected]>

For delayed allocation mode, we write to inline data if the file
is small enough. And in case of we write to some offset larger
than the inline size, the 1st page is dirtied, so that
ext4_da_writepages can handle the conversion. When the 1st page
is initialized with blocks, the inline part is removed.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/inode.c | 204 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bbd4b3c..a7da980 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2121,7 +2121,8 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode)
* mpage_da_map_and_submit to map a single contiguous memory region
* and then write them.
*/
-static int write_cache_pages_da(struct address_space *mapping,
+static int write_cache_pages_da(handle_t *handle,
+ struct address_space *mapping,
struct writeback_control *wbc,
struct mpage_da_data *mpd,
pgoff_t *done_index)
@@ -2200,6 +2201,14 @@ static int write_cache_pages_da(struct address_space *mapping,
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));

+ /*
+ * If we have inline data and arrive here, it means that
+ * we will soon create the block for the 1st page, so
+ * we'd better clear the inline data here.
+ */
+ if (!page->index && ext4_has_inline_data(inode))
+ ext4_destroy_inline_data(handle, inode);
+
if (mpd->next_page != page->index)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;
@@ -2403,7 +2412,8 @@ retry:
* contiguous region of logical blocks that need
* blocks to be allocated by ext4 and submit them.
*/
- ret = write_cache_pages_da(mapping, wbc, &mpd, &done_index);
+ ret = write_cache_pages_da(handle, mapping,
+ wbc, &mpd, &done_index);
/*
* If we have a contiguous extent of pages and we
* haven't done the I/O yet, map the blocks and submit
@@ -2466,6 +2476,7 @@ out_writepages:
}

#define FALL_BACK_TO_NONDELALLOC 1
+#define CONVERT_INLINE_DATA 2
static int ext4_nonda_switch(struct super_block *sb)
{
s64 free_blocks, dirty_blocks;
@@ -2499,6 +2510,135 @@ static int ext4_nonda_switch(struct super_block *sb)
return 0;
}

+/*
+ * Try to make the page cache and handle ready for the inline data case.
+ * We can call this function in 2 cases:
+ * 1. The inode is created and the first write exceeds inine size. We can
+ * clear the inode state safely.
+ * 2. The inode has inline data, then we need to read the data, make it
+ * update and dirty so that ext4_da_writepages can handle it. We don't
+ * need to start the journal since the file's metatdata isn't changed now.
+ */
+static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
+ struct inode *inode,
+ unsigned flags,
+ void **fsdata)
+{
+ int ret, inline_size;
+ struct page *page;
+
+ if (!ext4_has_inline_data(inode)) {
+ ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+ return 0;
+ }
+
+ inline_size = ext4_get_max_inline_size(inode);
+
+ page = grab_cache_page_write_begin(mapping, 0, flags);
+ if (!page)
+ return -ENOMEM;
+
+ if (!PageUptodate(page)) {
+ ret = ext4_read_inline_page(inode, page);
+ if (ret)
+ goto out;
+ }
+
+ ret = __block_write_begin(page, 0, inline_size,
+ ext4_da_get_block_prep);
+ if (ret) {
+ ext4_truncate_failed_write(inode);
+ goto out;
+ }
+
+ SetPageDirty(page);
+ SetPageUptodate(page);
+ ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+ *fsdata = (void *)CONVERT_INLINE_DATA;
+
+out:
+ unlock_page(page);
+ page_cache_release(page);
+ return ret;
+}
+/*
+ * Prepare the write for the inline data.
+ * If the the data can be written into the inode, we just read
+ * the page and make it uptodate, and start the journal.
+ * Otherwise read the page, makes it dirty so that it can be
+ * handle in writepages(the i_disksize update is left to the
+ * normal ext4_da_write_end.
+ */
+static int ext4_da_write_inline_data_begin(struct address_space *mapping,
+ struct inode *inode,
+ loff_t pos, unsigned len,
+ unsigned flags,
+ struct page **pagep,
+ void **fsdata)
+{
+ int ret, inline_size;
+ handle_t *handle;
+ struct page *page;
+ struct ext4_iloc iloc;
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+ inline_size = ext4_get_max_inline_size(inode);
+
+ if (pos + len > inline_size)
+ return ext4_da_convert_inline_data_to_extent(mapping,
+ inode, flags,
+ fsdata);
+
+ handle = ext4_journal_start(inode, 1);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ handle = NULL;
+ goto out;
+ }
+
+ if (!ext4_has_inline_data(inode)) {
+ ret = ext4_init_inline_data(handle, inode, &iloc);
+ if (ret && ret != -ENOSPC)
+ goto out;
+
+ if (ret == -ENOSPC) {
+ ret = 0;
+ goto out;
+ }
+ }
+
+ /*
+ * We cannot recurse into the filesystem as the transaction
+ * is already started.
+ */
+ flags |= AOP_FLAG_NOFS;
+
+ page = grab_cache_page_write_begin(mapping, 0, flags);
+ if (!page) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ *pagep = page;
+
+ if (!PageUptodate(page)) {
+ ret = ext4_read_inline_page(inode, page);
+ if (ret)
+ goto out;
+ }
+
+ ret = 1;
+ handle = NULL;
+out:
+ if (handle)
+ ext4_journal_stop(handle);
+ brelse(iloc.bh);
+ return ret;
+}
+
static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
@@ -2518,6 +2658,19 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
}
*fsdata = (void *)0;
trace_ext4_da_write_begin(inode, pos, len, flags);
+
+ if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) {
+ ret = ext4_da_write_inline_data_begin(mapping, inode,
+ pos, len, flags,
+ pagep, fsdata);
+ if (ret < 0)
+ goto out;
+ if (ret == 1) {
+ ret = 0;
+ goto out;
+ }
+ }
+
retry:
/*
* With delayed allocation, we don't log the i_disksize update
@@ -2585,6 +2738,41 @@ static int ext4_da_should_update_i_disksize(struct page *page,
return 1;
}

+static int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
+ unsigned len, unsigned copied,
+ struct page *page)
+{
+ int i_size_changed = 0;
+
+ copied = ext4_write_inline_data_end(inode, pos, len, copied, page);
+
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold i_mutex.
+ *
+ * But it's important to update i_size while still holding page lock:
+ * page writeout could otherwise come in and zero beyond i_size.
+ */
+ if (pos+copied > inode->i_size) {
+ i_size_write(inode, pos+copied);
+ i_size_changed = 1;
+ }
+
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * Don't mark the inode dirty under page lock. First, it unnecessarily
+ * makes the holding time of page lock longer. Second, it forces lock
+ * ordering of page lock and transaction start for journaling
+ * filesystems.
+ */
+ if (i_size_changed)
+ mark_inode_dirty(inode);
+
+ return copied;
+}
+
static int ext4_da_write_end(struct file *file,
struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
@@ -2618,10 +2806,10 @@ static int ext4_da_write_end(struct file *file,
* changes. So let's piggyback the i_disksize mark_inode_dirty
* into that.
*/
-
new_i_size = pos + copied;
if (new_i_size > EXT4_I(inode)->i_disksize) {
- if (ext4_da_should_update_i_disksize(page, end)) {
+ if (ext4_has_inline_data(inode) ||
+ ext4_da_should_update_i_disksize(page, end)) {
down_write(&EXT4_I(inode)->i_data_sem);
if (new_i_size > EXT4_I(inode)->i_disksize) {
/*
@@ -2642,8 +2830,12 @@ static int ext4_da_write_end(struct file *file,
ext4_mark_inode_dirty(handle, inode);
}
}
- ret2 = generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
+ if (write_mode != CONVERT_INLINE_DATA && ext4_has_inline_data(inode))
+ ret2 = ext4_da_write_inline_data_end(inode, pos, len, copied,
+ page);
+ else
+ ret2 = generic_write_end(file, mapping, pos, len, copied,
+ page, fsdata);
copied = ret2;
if (ret2 < 0)
ret = ret2;
--
1.7.0.4

2011-10-26 07:34:19

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 08/17] ext4: Refactor __ext4_check_dir_entry to accepts start and size.

From: Tao Ma <[email protected]>

__ext4_check_dir_entry is used to check whether the de is over
the block boundary. Now with inline data, it could be within the
block boundary while exceeds the inode size. So check this function
to check the overflow more precisely.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/dir.c | 16 ++++++++--------
fs/ext4/ext4.h | 7 ++++---
fs/ext4/namei.c | 13 +++++++++----
3 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 164c560..6c2199e 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -68,7 +68,7 @@ static unsigned char get_dtype(struct super_block *sb, int filetype)
int __ext4_check_dir_entry(const char *function, unsigned int line,
struct inode *dir, struct file *filp,
struct ext4_dir_entry_2 *de,
- struct buffer_head *bh,
+ struct buffer_head *bh, char *buf, int size,
unsigned int offset)
{
const char *error_msg = NULL;
@@ -81,9 +81,8 @@ int __ext4_check_dir_entry(const char *function, unsigned int line,
error_msg = "rec_len % 4 != 0";
else if (unlikely(rlen < EXT4_DIR_REC_LEN(de->name_len)))
error_msg = "rec_len is too small for name_len";
- else if (unlikely(((char *) de - bh->b_data) + rlen >
- dir->i_sb->s_blocksize))
- error_msg = "directory entry across blocks";
+ else if (unlikely(((char *) de - buf) + rlen > size))
+ error_msg = "directory entry across range";
else if (unlikely(le32_to_cpu(de->inode) >
le32_to_cpu(EXT4_SB(dir->i_sb)->s_es->s_inodes_count)))
error_msg = "inode out of bounds";
@@ -94,14 +93,14 @@ int __ext4_check_dir_entry(const char *function, unsigned int line,
ext4_error_file(filp, function, line, bh ? bh->b_blocknr : 0,
"bad entry in directory: %s - offset=%u(%u), "
"inode=%u, rec_len=%d, name_len=%d",
- error_msg, (unsigned) (offset%bh->b_size),
+ error_msg, (unsigned) (offset%size),
offset, le32_to_cpu(de->inode),
rlen, de->name_len);
else
ext4_error_inode(dir, function, line, bh ? bh->b_blocknr : 0,
"bad entry in directory: %s - offset=%u(%u), "
"inode=%u, rec_len=%d, name_len=%d",
- error_msg, (unsigned) (offset%bh->b_size),
+ error_msg, (unsigned) (offset%size),
offset, le32_to_cpu(de->inode),
rlen, de->name_len);

@@ -210,8 +209,9 @@ revalidate:
while (!error && filp->f_pos < inode->i_size
&& offset < sb->s_blocksize) {
de = (struct ext4_dir_entry_2 *) (bh->b_data + offset);
- if (ext4_check_dir_entry(inode, filp, de,
- bh, offset)) {
+ if (ext4_check_dir_entry(inode, filp, de, bh,
+ bh->b_data, bh->b_size,
+ offset)) {
/*
* On error, skip the f_pos to the next block
*/
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9a60193..585ccca 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1762,10 +1762,11 @@ ext4_fsblk_t ext4_inode_to_goal_block(struct inode *);
extern int __ext4_check_dir_entry(const char *, unsigned int, struct inode *,
struct file *,
struct ext4_dir_entry_2 *,
- struct buffer_head *, unsigned int);
-#define ext4_check_dir_entry(dir, filp, de, bh, offset) \
+ struct buffer_head *, char *, int,
+ unsigned int);
+#define ext4_check_dir_entry(dir, filp, de, bh, buf, size, offset) \
unlikely(__ext4_check_dir_entry(__func__, __LINE__, (dir), (filp), \
- (de), (bh), (offset)))
+ (de), (bh), (buf), (size), (offset)))
extern int ext4_htree_store_dirent(struct file *dir_file, __u32 hash,
__u32 minor_hash,
struct ext4_dir_entry_2 *dirent);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index c2ca5f4..d7661cb 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -583,6 +583,7 @@ static int htree_dirblock_to_tree(struct file *dir_file,
EXT4_DIR_REC_LEN(0));
for (; de < top; de = ext4_next_entry(de, dir->i_sb->s_blocksize)) {
if (ext4_check_dir_entry(dir, NULL, de, bh,
+ bh->b_data, bh->b_size,
(block<<EXT4_BLOCK_SIZE_BITS(dir->i_sb))
+ ((char *)de - bh->b_data))) {
/* On error, skip the f_pos to the next block. */
@@ -821,7 +822,8 @@ static inline int search_dirblock(struct buffer_head *bh,
if ((char *) de + namelen <= dlimit &&
ext4_match (namelen, name, de)) {
/* found a match - just to be sure, do a full check */
- if (ext4_check_dir_entry(dir, NULL, de, bh, offset))
+ if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data,
+ bh->b_size, offset))
return -1;
*res_dir = de;
return 1;
@@ -1267,7 +1269,8 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
de = (struct ext4_dir_entry_2 *)bh->b_data;
top = bh->b_data + blocksize - reclen;
while ((char *) de <= top) {
- if (ext4_check_dir_entry(dir, NULL, de, bh, offset))
+ if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data,
+ bh->b_size, offset))
return -EIO;
if (ext4_match(namelen, name, de))
return -EEXIST;
@@ -1650,7 +1653,8 @@ static int ext4_delete_entry(handle_t *handle,
pde = NULL;
de = (struct ext4_dir_entry_2 *) bh->b_data;
while (i < bh->b_size) {
- if (ext4_check_dir_entry(dir, NULL, de, bh, i))
+ if (ext4_check_dir_entry(dir, NULL, de, bh,
+ bh->b_data, bh->b_size, i))
return -EIO;
if (de == de_del) {
BUFFER_TRACE(bh, "get_write_access");
@@ -1964,7 +1968,8 @@ static int empty_dir(struct inode *inode)
}
de = (struct ext4_dir_entry_2 *) bh->b_data;
}
- if (ext4_check_dir_entry(inode, NULL, de, bh, offset)) {
+ if (ext4_check_dir_entry(inode, NULL, de, bh,
+ bh->b_data, bh->b_size, offset)) {
de = (struct ext4_dir_entry_2 *)(bh->b_data +
sb->s_blocksize);
offset = (offset | (sb->s_blocksize - 1)) + 1;
--
1.7.0.4

2011-10-26 07:34:21

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 10/17] ext4: let add_dir_entry handle inline data properly.

From: Tao Ma <[email protected]>

This patch let add_dir_entry handle the inline data case. So the
dir is initialized as inline dir first and then we can try to add
some files to it, when the inline space can't hold all the entries,
a dir block will be created and the dir entry will be moved to it.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/namei.c | 193 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 195 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 585ccca..4319c95 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2178,6 +2178,8 @@ static inline void ext4_mark_super_dirty(struct super_block *sb)

/* dir.c */
extern const struct file_operations ext4_dir_operations;
+extern void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh,
+ void *inline_start, int inline_size);

/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2e83b92..4f36137 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -23,6 +23,7 @@
* Hash Tree Directory indexing cleanup
* Theodore Ts'o, 2002
*/
+#define INLINE_DIR_DEBUG

#include <linux/fs.h>
#include <linux/pagemap.h>
@@ -1467,6 +1468,147 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
return retval;
}

+void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh,
+ void *inline_start, int inline_size)
+{
+ int offset;
+ unsigned short de_len;
+ struct ext4_dir_entry_2 *de = inline_start;
+ void *dlimit = inline_start + inline_size;
+
+ trace_printk("inode %lu %u\n", dir->i_ino, EXT4_I(dir)->i_inline_off);
+ offset = 0;
+ while ((void *)de < dlimit) {
+ de_len = ext4_rec_len_from_disk(de->rec_len, inline_size);
+ trace_printk("de: offset %u reclen %u name %*.s "
+ "namelen %u ino %u\n",
+ offset, de_len, de->name_len, de->name,
+ de->name_len, le32_to_cpu(de->inode));
+ if (ext4_check_dir_entry(dir, NULL, de, bh,
+ inline_start, inline_size, offset))
+ BUG();
+
+ offset += de_len;
+ de = (struct ext4_dir_entry_2 *) ((char *) de + de_len);
+ }
+}
+
+/*
+ * Add a new entry into a inline dir.
+ * It will return -ENOSPC if no space is available, and -EIO
+ * and -EEXIST if directory entry already exists.
+ */
+static int ext4_add_dirent_to_inline(handle_t *handle,
+ struct dentry *dentry,
+ struct inode *inode,
+ struct ext4_iloc *iloc,
+ void *inline_start, int inline_size)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short reclen;
+ int err;
+ struct ext4_dir_entry_2 *de;
+
+ reclen = EXT4_DIR_REC_LEN(namelen);
+ err = __ext4_find_dest_de(dir, inode, iloc->bh,
+ inline_start, inline_size,
+ name, namelen, &de);
+ if (err)
+ return err;
+
+ err = ext4_journal_get_write_access(handle, iloc->bh);
+ if (err)
+ return err;
+ __ext4_insert_dentry(inode, de, inline_size, name, namelen);
+
+ ext4_show_inline_dir(dir, iloc->bh, inline_start, inline_size);
+
+ /*
+ * XXX shouldn't update any times until successful
+ * completion of syscall, but too many callers depend
+ * on this.
+ *
+ * XXX similarly, too many callers depend on
+ * ext4_new_inode() setting the times, but error
+ * recovery deletes the inode, so the worst that can
+ * happen is that the times are slightly out of date
+ * and/or different from the directory change time.
+ */
+ dir->i_mtime = dir->i_ctime = ext4_current_time(dir);
+ ext4_update_dx_flag(dir);
+ dir->i_version++;
+ ext4_mark_inode_dirty(handle, dir);
+ return 1;
+}
+
+/*
+ * Try to add the new entry to the inline data.
+ * If succeeds, return 0. If not, extended the inline dir and copied data to
+ * the new created block.
+ */
+static int ext4_try_add_inline_entry(handle_t *handle, struct dentry *dentry,
+ struct inode *inode)
+{
+ int ret, inline_size, de_len;
+ void *inline_start, *limit, *de_buf;
+ struct buffer_head *dir_block;
+ struct ext4_dir_entry_2 *de, *prev_de;
+ struct ext4_iloc iloc;
+ int blocksize = inode->i_sb->s_blocksize;
+ struct inode *dir = dentry->d_parent->d_inode;
+
+ ret = ext4_get_inode_loc(dir, &iloc);
+ if (ret)
+ return ret;
+
+ inline_start = ext4_get_inline_data_pos(dir, &iloc);
+ inline_size = ext4_get_max_inline_size(dir);
+
+ ret = ext4_add_dirent_to_inline(handle, dentry, inode, &iloc,
+ inline_start, inline_size);
+ if (ret != -ENOSPC)
+ goto out;
+
+ /* The inline space is filled up, so create a new block for it. */
+ dir->i_size = EXT4_I(dir)->i_disksize = blocksize;
+ dir_block = ext4_bread(handle, dir, 0, 1, &ret);
+ if (!dir_block)
+ goto out;
+
+ BUFFER_TRACE(dir_block, "get_write_access");
+ ret = ext4_journal_get_write_access(handle, dir_block);
+ if (ret)
+ goto out;
+ memcpy(dir_block->b_data, inline_start, inline_size);
+
+ /* Set the final de to cover the whole block. */
+ de_buf = dir_block->b_data;
+ de = (struct ext4_dir_entry_2 *)de_buf;
+ limit = de_buf + inline_size;
+ do {
+ prev_de = de;
+ de_len = ext4_rec_len_from_disk(de->rec_len, inline_size);
+ de_buf += de_len;
+ de = (struct ext4_dir_entry_2 *)de_buf;
+ } while (de_buf < limit);
+
+ prev_de->rec_len = ext4_rec_len_to_disk(de_len + blocksize -
+ inline_size, blocksize);
+
+ BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
+ ret = ext4_handle_dirty_metadata(handle, dir, dir_block);
+
+ /* clear the entry and the flag in inode now. */
+ ret = ext4_destroy_inline_data(handle, dir);
+out:
+ brelse(dir_block);
+ ext4_mark_inode_dirty(handle, dir);
+ brelse(iloc.bh);
+ return ret;
+}
+
/*
* ext4_add_entry()
*
@@ -1493,6 +1635,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
blocksize = sb->s_blocksize;
if (!dentry->d_name.len)
return -EINVAL;
+
+ if (ext4_has_inline_data(dir)) {
+ retval = ext4_try_add_inline_entry(handle, dentry, inode);
+ if (retval < 0)
+ return retval;
+ if (retval == 1) {
+ retval = 0;
+ return retval;
+ }
+ }
+
if (is_dx(dir)) {
retval = ext4_dx_add_entry(handle, dentry, inode);
if (!retval || (retval != ERR_BAD_DX_DIR))
@@ -1862,6 +2015,38 @@ static void ext4_init_dot_dotdot(struct inode *parent, struct inode *inode,
inode->i_nlink = 2;
}

+/*
+ * Try to create the inline data for the new dir.
+ * If it succeeds, return 0, otherwise return the error.
+ * In case of ENOSPC, the caller should create the normal disk layout dir.
+ */
+static int ext4_try_create_inline_dir(handle_t *handle, struct inode *parent,
+ struct inode *inode)
+{
+ int ret, inline_size;
+ struct ext4_iloc iloc;
+ void *inline_pos;
+ struct ext4_dir_entry_2 *de;
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+ ret = ext4_init_inline_data(handle, inode, &iloc);
+ if (ret)
+ goto out;
+
+ inline_pos = ext4_get_inline_data_pos(inode, &iloc);
+ inline_size = ext4_get_max_inline_size(inode);
+
+ de = (struct ext4_dir_entry_2 *)inline_pos;
+ ext4_init_dot_dotdot(parent, inode, de, inline_size);
+ inode->i_size = EXT4_I(inode)->i_disksize = inline_size;
+out:
+ brelse(iloc.bh);
+ return ret;
+}
+
static int ext4_init_new_dir(handle_t *handle, struct inode *parent,
struct inode *inode)
{
@@ -1870,6 +2055,14 @@ static int ext4_init_new_dir(handle_t *handle, struct inode *parent,
int err;
int blocksize = inode->i_sb->s_blocksize;

+ if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) {
+ err = ext4_try_create_inline_dir(handle, parent, inode);
+ if (err < 0 && err != ENOSPC)
+ goto out;
+ if (!err)
+ goto out;
+ }
+
inode->i_size = EXT4_I(inode)->i_disksize = blocksize;
dir_block = ext4_bread(handle, inode, 0, 1, &err);
if (!dir_block)
--
1.7.0.4

2011-10-26 07:34:20

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 09/17] ext4: Create __ext4_insert_dentry for dir entry insertion.

From: Tao Ma <[email protected]>

The old add_dirent_to_buf handles all the work related to the
work of adding dir entry to a dir block. Now we have inline data,
so create 2 new function __ext4_find_dest_de and __ext4_insert_dentry
that do the real work and let add_dirent_to_buf call them.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/namei.c | 104 ++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 68 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d7661cb..2e83b92 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1243,6 +1243,66 @@ errout:
return NULL;
}

+static inline int __ext4_find_dest_de(struct inode *dir, struct inode *inode,
+ struct buffer_head *bh,
+ void *buf, int buf_size,
+ const char *name, int namelen,
+ struct ext4_dir_entry_2 **dest_de)
+{
+ struct ext4_dir_entry_2 *de;
+ unsigned short reclen = EXT4_DIR_REC_LEN(namelen);
+ int nlen, rlen;
+ unsigned int offset = 0;
+ char *top;
+
+ de = (struct ext4_dir_entry_2 *)buf;
+ top = buf + buf_size - reclen;
+ while ((char *) de <= top) {
+ if (ext4_check_dir_entry(dir, NULL, de, bh,
+ buf, buf_size, offset))
+ return -EIO;
+ if (ext4_match(namelen, name, de))
+ return -EEXIST;
+ nlen = EXT4_DIR_REC_LEN(de->name_len);
+ rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
+ if ((de->inode ? rlen - nlen : rlen) >= reclen)
+ break;
+ de = (struct ext4_dir_entry_2 *)((char *)de + rlen);
+ offset += rlen;
+ }
+ if ((char *) de > top)
+ return -ENOSPC;
+
+ *dest_de = de;
+ return 0;
+}
+
+static inline void __ext4_insert_dentry(struct inode *inode,
+ struct ext4_dir_entry_2 *de,
+ int buf_size,
+ const char *name, int namelen)
+{
+
+ int nlen, rlen;
+
+ nlen = EXT4_DIR_REC_LEN(de->name_len);
+ rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
+ if (de->inode) {
+ struct ext4_dir_entry_2 *de1 =
+ (struct ext4_dir_entry_2 *)((char *)de + nlen);
+ de1->rec_len = ext4_rec_len_to_disk(rlen - nlen, buf_size);
+ de->rec_len = ext4_rec_len_to_disk(nlen, buf_size);
+ de = de1;
+ }
+ de->file_type = EXT4_FT_UNKNOWN;
+ if (inode) {
+ de->inode = cpu_to_le32(inode->i_ino);
+ ext4_set_de_type(inode->i_sb, de, inode->i_mode);
+ } else
+ de->inode = 0;
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+}
/*
* Add a new entry into a directory (leaf) block. If de is non-NULL,
* it points to a directory entry which is guaranteed to be large
@@ -1258,31 +1318,17 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
struct inode *dir = dentry->d_parent->d_inode;
const char *name = dentry->d_name.name;
int namelen = dentry->d_name.len;
- unsigned int offset = 0;
unsigned int blocksize = dir->i_sb->s_blocksize;
unsigned short reclen;
- int nlen, rlen, err;
- char *top;
+ int err;

reclen = EXT4_DIR_REC_LEN(namelen);
if (!de) {
- de = (struct ext4_dir_entry_2 *)bh->b_data;
- top = bh->b_data + blocksize - reclen;
- while ((char *) de <= top) {
- if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data,
- bh->b_size, offset))
- return -EIO;
- if (ext4_match(namelen, name, de))
- return -EEXIST;
- nlen = EXT4_DIR_REC_LEN(de->name_len);
- rlen = ext4_rec_len_from_disk(de->rec_len, blocksize);
- if ((de->inode? rlen - nlen: rlen) >= reclen)
- break;
- de = (struct ext4_dir_entry_2 *)((char *)de + rlen);
- offset += rlen;
- }
- if ((char *) de > top)
- return -ENOSPC;
+ err = __ext4_find_dest_de(dir, inode,
+ bh, bh->b_data, blocksize,
+ name, namelen, &de);
+ if (err)
+ return err;
}
BUFFER_TRACE(bh, "get_write_access");
err = ext4_journal_get_write_access(handle, bh);
@@ -1292,22 +1338,8 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
}

/* By now the buffer is marked for journaling */
- nlen = EXT4_DIR_REC_LEN(de->name_len);
- rlen = ext4_rec_len_from_disk(de->rec_len, blocksize);
- if (de->inode) {
- struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen);
- de1->rec_len = ext4_rec_len_to_disk(rlen - nlen, blocksize);
- de->rec_len = ext4_rec_len_to_disk(nlen, blocksize);
- de = de1;
- }
- de->file_type = EXT4_FT_UNKNOWN;
- if (inode) {
- de->inode = cpu_to_le32(inode->i_ino);
- ext4_set_de_type(dir->i_sb, de, inode->i_mode);
- } else
- de->inode = 0;
- de->name_len = namelen;
- memcpy(de->name, name, namelen);
+ __ext4_insert_dentry(inode, de, blocksize, name, namelen);
+
/*
* XXX shouldn't update any times until successful
* completion of syscall, but too many callers depend
--
1.7.0.4

2011-10-26 07:34:16

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 05/17] ext4: Add journalled write support for inline data.

From: Tao Ma <[email protected]>

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/inode.c | 81 ++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 63 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 876cdfd..bbd4b3c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1196,16 +1196,21 @@ static int ext4_journalled_write_end(struct file *file,

BUG_ON(!ext4_handle_valid(handle));

- if (copied < len) {
- if (!PageUptodate(page))
- copied = 0;
- page_zero_new_buffers(page, from+copied, to);
- }
+ if (ext4_has_inline_data(inode))
+ copied = ext4_write_inline_data_end(inode, pos, len,
+ copied, page);
+ else {
+ if (copied < len) {
+ if (!PageUptodate(page))
+ copied = 0;
+ page_zero_new_buffers(page, from+copied, to);
+ }

- ret = walk_page_buffers(handle, page_buffers(page), from,
- to, &partial, write_end_fn);
- if (!partial)
- SetPageUptodate(page);
+ ret = walk_page_buffers(handle, page_buffers(page), from,
+ to, &partial, write_end_fn);
+ if (!partial)
+ SetPageUptodate(page);
+ }
new_i_size = pos + copied;
if (new_i_size > inode->i_size)
i_size_write(inode, pos+copied);
@@ -1893,20 +1898,50 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
return 0;
}

+static struct buffer_head *
+ext4_journalled_write_inline_data(struct inode *inode,
+ unsigned len,
+ struct page *page)
+{
+ int ret;
+ void *kaddr;
+ struct ext4_iloc iloc;
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret) {
+ ext4_std_error(inode->i_sb, ret);
+ return NULL;
+ }
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ ext4_write_inline_data(inode, &iloc, kaddr, 0, len);
+ kunmap_atomic(kaddr, KM_USER0);
+
+ return iloc.bh;
+}
+
static int __ext4_journalled_writepage(struct page *page,
unsigned int len)
{
struct address_space *mapping = page->mapping;
struct inode *inode = mapping->host;
- struct buffer_head *page_bufs;
+ struct buffer_head *page_bufs = NULL;
handle_t *handle = NULL;
int ret = 0;
int err;
+ struct buffer_head *inode_bh = NULL;

ClearPageChecked(page);
- page_bufs = page_buffers(page);
- BUG_ON(!page_bufs);
- walk_page_buffers(handle, page_bufs, 0, len, NULL, bget_one);
+
+ if (ext4_has_inline_data(inode)) {
+ BUG_ON(page->index != 0);
+ BUG_ON(len > ext4_get_max_inline_size(inode));
+ inode_bh = ext4_journalled_write_inline_data(inode, len, page);
+ } else {
+ page_bufs = page_buffers(page);
+ BUG_ON(!page_bufs);
+ walk_page_buffers(handle, page_bufs, 0, len, NULL, bget_one);
+ }
/* As soon as we unlock the page, it can go away, but we have
* references to buffers so we are safe */
unlock_page(page);
@@ -1919,11 +1954,19 @@ static int __ext4_journalled_writepage(struct page *page,

BUG_ON(!ext4_handle_valid(handle));

- ret = walk_page_buffers(handle, page_bufs, 0, len, NULL,
- do_journal_get_write_access);
+ if (ext4_has_inline_data(inode) && inode_bh) {
+ ret = ext4_journal_get_write_access(handle, inode_bh);
+
+ err = ext4_handle_dirty_metadata(handle, inode, inode_bh);
+
+ } else {
+ ret = walk_page_buffers(handle, page_bufs, 0, len, NULL,
+ do_journal_get_write_access);
+
+ err = walk_page_buffers(handle, page_bufs, 0, len, NULL,
+ write_end_fn);
+ }

- err = walk_page_buffers(handle, page_bufs, 0, len, NULL,
- write_end_fn);
if (ret == 0)
ret = err;
EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
@@ -1931,9 +1974,11 @@ static int __ext4_journalled_writepage(struct page *page,
if (!ret)
ret = err;

- walk_page_buffers(handle, page_bufs, 0, len, NULL, bput_one);
+ if (!ext4_has_inline_data(inode))
+ walk_page_buffers(handle, page_bufs, 0, len, NULL, bput_one);
ext4_set_inode_state(inode, EXT4_STATE_JDATA);
out:
+ brelse(inode_bh);
return ret;
}

--
1.7.0.4

2011-10-26 07:34:18

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 07/17] ext4: Create a new function ext4_init_new_dir.

From: Tao Ma <[email protected]>

Currently, the initialization of dot and dotdot are encapsulated in
ext4_mkdir and also bond with dir_block. So create a new function
named ext4_init_new_dir and the initialization is moved to
ext4_init_dot_dotdot which only accepts a 'de'.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/namei.c | 74 +++++++++++++++++++++++++++++++++---------------------
1 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 6e7779c..c2ca5f4 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1807,13 +1807,54 @@ retry:
return err;
}

+static void ext4_init_dot_dotdot(struct inode *parent, struct inode *inode,
+ struct ext4_dir_entry_2 *de, int blocksize)
+{
+ de->inode = cpu_to_le32(inode->i_ino);
+ de->name_len = 1;
+ de->rec_len = ext4_rec_len_to_disk(EXT4_DIR_REC_LEN(de->name_len),
+ blocksize);
+ strcpy(de->name, ".");
+ ext4_set_de_type(parent->i_sb, de, S_IFDIR);
+ de = ext4_next_entry(de, blocksize);
+ de->inode = cpu_to_le32(parent->i_ino);
+ de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(1),
+ blocksize);
+ de->name_len = 2;
+ strcpy(de->name, "..");
+ ext4_set_de_type(parent->i_sb, de, S_IFDIR);
+ inode->i_nlink = 2;
+}
+
+static int ext4_init_new_dir(handle_t *handle, struct inode *parent,
+ struct inode *inode)
+{
+ struct buffer_head *dir_block = NULL;
+ struct ext4_dir_entry_2 *de;
+ int err;
+ int blocksize = inode->i_sb->s_blocksize;
+
+ inode->i_size = EXT4_I(inode)->i_disksize = blocksize;
+ dir_block = ext4_bread(handle, inode, 0, 1, &err);
+ if (!dir_block)
+ goto out;
+ BUFFER_TRACE(dir_block, "get_write_access");
+ err = ext4_journal_get_write_access(handle, dir_block);
+ if (err)
+ goto out;
+ de = (struct ext4_dir_entry_2 *)dir_block->b_data;
+ ext4_init_dot_dotdot(parent, inode, de, blocksize);
+ BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
+ err = ext4_handle_dirty_metadata(handle, inode, dir_block);
+out:
+ brelse(dir_block);
+ return err;
+}
+
static int ext4_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
handle_t *handle;
struct inode *inode;
- struct buffer_head *dir_block = NULL;
- struct ext4_dir_entry_2 *de;
- unsigned int blocksize = dir->i_sb->s_blocksize;
int err, retries = 0;

if (EXT4_DIR_LINK_MAX(dir))
@@ -1839,31 +1880,7 @@ retry:

inode->i_op = &ext4_dir_inode_operations;
inode->i_fop = &ext4_dir_operations;
- inode->i_size = EXT4_I(inode)->i_disksize = inode->i_sb->s_blocksize;
- dir_block = ext4_bread(handle, inode, 0, 1, &err);
- if (!dir_block)
- goto out_clear_inode;
- BUFFER_TRACE(dir_block, "get_write_access");
- err = ext4_journal_get_write_access(handle, dir_block);
- if (err)
- goto out_clear_inode;
- de = (struct ext4_dir_entry_2 *) dir_block->b_data;
- de->inode = cpu_to_le32(inode->i_ino);
- de->name_len = 1;
- de->rec_len = ext4_rec_len_to_disk(EXT4_DIR_REC_LEN(de->name_len),
- blocksize);
- strcpy(de->name, ".");
- ext4_set_de_type(dir->i_sb, de, S_IFDIR);
- de = ext4_next_entry(de, blocksize);
- de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(1),
- blocksize);
- de->name_len = 2;
- strcpy(de->name, "..");
- ext4_set_de_type(dir->i_sb, de, S_IFDIR);
- inode->i_nlink = 2;
- BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, inode, dir_block);
+ err = ext4_init_new_dir(handle, dir, inode);
if (err)
goto out_clear_inode;
err = ext4_mark_inode_dirty(handle, inode);
@@ -1885,7 +1902,6 @@ out_clear_inode:
d_instantiate(dentry, inode);
unlock_new_inode(inode);
out_stop:
- brelse(dir_block);
ext4_journal_stop(handle);
if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
goto retry;
--
1.7.0.4

2011-10-26 07:34:27

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 16/17] ext4: let ext4_rename handle inline dir.

From: Tao Ma <[email protected]>

In case of we rename a dir, ext4_rename has to read the dir block
and change its dotdot's information. The old ext4_rename encapsulated
the dir_block read into itself. So this patch try to add a new function
ext4_get_dir_block which get the dir buffer information so the
ext4_rename can handle it properly.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/namei.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 343af6c..cb1b396 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2777,6 +2777,43 @@ retry:
(ext4_next_entry((struct ext4_dir_entry_2 *)(buffer), size)->inode)

/*
+ * Try to find buffer head where contains the parent block.
+ * It should be the inode block if it is inlined or the 1st block
+ * if it is a normal dir.
+ */
+static struct buffer_head *ext4_get_dir_block(handle_t *handle,
+ struct inode *inode,
+ void **buf,
+ int *buf_size,
+ int *retval)
+{
+ struct buffer_head *bh;
+ void *inline_start;
+ int inline_size;
+ struct ext4_iloc iloc;
+
+ if (!ext4_has_inline_data(inode)) {
+ bh = ext4_bread(handle, inode, 0, 0, retval);
+ if (!bh)
+ return NULL;
+ *buf = bh->b_data;
+ *buf_size = inode->i_sb->s_blocksize;
+ return bh;
+ }
+
+ *retval = ext4_get_inode_loc(inode, &iloc);
+ if (*retval)
+ return NULL;
+
+ inline_start = ext4_get_inline_data_pos(inode, &iloc);
+ inline_size = ext4_get_max_inline_size(inode);
+
+ *buf = inline_start;
+ *buf_size = inline_size;
+ return iloc.bh;
+}
+
+/*
* Anybody can rename anything with this: the permission checks are left to the
* higher-level routines.
*/
@@ -2787,7 +2824,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *old_inode, *new_inode;
struct buffer_head *old_bh, *new_bh, *dir_bh;
struct ext4_dir_entry_2 *old_de, *new_de;
- int retval, force_da_alloc = 0;
+ int buf_size, retval, force_da_alloc = 0;
+ void *dir_buf = NULL;

dquot_initialize(old_dir);
dquot_initialize(new_dir);
@@ -2834,11 +2872,12 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
goto end_rename;
}
retval = -EIO;
- dir_bh = ext4_bread(handle, old_inode, 0, 0, &retval);
+ dir_bh = ext4_get_dir_block(handle, old_inode,
+ &dir_buf, &buf_size, &retval);
if (!dir_bh)
goto end_rename;
- if (le32_to_cpu(PARENT_INO(dir_bh->b_data,
- old_dir->i_sb->s_blocksize)) != old_dir->i_ino)
+ if (le32_to_cpu(PARENT_INO(dir_buf,
+ buf_size)) != old_dir->i_ino)
goto end_rename;
retval = -EMLINK;
if (!new_inode && new_dir != old_dir &&
@@ -2918,8 +2957,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
old_dir->i_ctime = old_dir->i_mtime = ext4_current_time(old_dir);
ext4_update_dx_flag(old_dir);
if (dir_bh) {
- PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
- cpu_to_le32(new_dir->i_ino);
+ PARENT_INO(dir_buf, buf_size) = cpu_to_le32(new_dir->i_ino);
BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
if (retval) {
--
1.7.0.4

2011-10-26 07:34:23

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 12/17] ext4: Create a new function search_dir.

From: Tao Ma <[email protected]>

search_dirblock is used to search a dir block, but the code is
almost the same for searching an inline dir.

So create a new fuction search_dir and let search_dirblock
call it.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/namei.c | 26 +++++++++++++++++++-------
1 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4f36137..2c388ad 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -802,11 +802,13 @@ static inline int ext4_match (int len, const char * const name,
/*
* Returns 0 if not found, -1 on failure, and 1 on success
*/
-static inline int search_dirblock(struct buffer_head *bh,
- struct inode *dir,
- const struct qstr *d_name,
- unsigned int offset,
- struct ext4_dir_entry_2 ** res_dir)
+static inline int search_dir(struct buffer_head *bh,
+ char *search_buf,
+ int buf_size,
+ struct inode *dir,
+ const struct qstr *d_name,
+ unsigned int offset,
+ struct ext4_dir_entry_2 **res_dir)
{
struct ext4_dir_entry_2 * de;
char * dlimit;
@@ -814,8 +816,8 @@ static inline int search_dirblock(struct buffer_head *bh,
const char *name = d_name->name;
int namelen = d_name->len;

- de = (struct ext4_dir_entry_2 *) bh->b_data;
- dlimit = bh->b_data + dir->i_sb->s_blocksize;
+ de = (struct ext4_dir_entry_2 *)search_buf;
+ dlimit = search_buf + buf_size;
while ((char *) de < dlimit) {
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
@@ -840,6 +842,16 @@ static inline int search_dirblock(struct buffer_head *bh,
return 0;
}

+static inline int search_dirblock(struct buffer_head *bh,
+ struct inode *dir,
+ const struct qstr *d_name,
+ unsigned int offset,
+ struct ext4_dir_entry_2 **res_dir)
+{
+ return search_dir(bh, bh->b_data, dir->i_sb->s_blocksize, dir,
+ d_name, offset, res_dir);
+}
+

/*
* ext4_find_entry()
--
1.7.0.4

2011-10-26 07:34:26

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 15/17] ext4: let empty_dir handle inline dir.

From: Tao Ma <[email protected]>

empty_dir is used when deleting a dir. So it should handle
inline dir properly.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/namei.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index fefd061..343af6c 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2235,6 +2235,50 @@ out_stop:
return err;
}

+static int empty_inline_dir(struct inode *dir)
+{
+ int err, inline_size;
+ struct ext4_iloc iloc;
+ void *inline_start;
+ unsigned int offset;
+ struct ext4_dir_entry_2 *de, *de1;
+ int ret = 0;
+
+ err = ext4_get_inode_loc(dir, &iloc);
+ if (err) {
+ EXT4_ERROR_INODE(dir, "error %d getting inode %lu block",
+ err, dir->i_ino);
+ return 1;
+ }
+
+ inline_start = ext4_get_inline_data_pos(dir, &iloc);
+ inline_size = ext4_get_max_inline_size(dir);
+
+ de = (struct ext4_dir_entry_2 *)inline_start;
+ de1 = ext4_next_entry(de, inline_size);
+ if (le32_to_cpu(de->inode) != dir->i_ino ||
+ !le32_to_cpu(de1->inode) ||
+ strcmp(".", de->name) ||
+ strcmp("..", de1->name)) {
+ ext4_warning(dir->i_sb,
+ "bad directory (dir #%lu) - no `.' or `..'",
+ dir->i_ino);
+ ret = 1;
+ goto out;
+ }
+
+ offset = ext4_rec_len_from_disk(de->rec_len, inline_size) +
+ ext4_rec_len_from_disk(de1->rec_len, inline_size);
+ de = ext4_next_entry(de1, inline_size);
+ /* We should only have 2 entries: . and ... */
+
+ if ((void *)de == inline_start + inline_size)
+ ret = 1;
+
+out:
+ brelse(iloc.bh);
+ return ret;
+}
/*
* routine to check that the specified directory is empty (for rmdir)
*/
@@ -2246,6 +2290,9 @@ static int empty_dir(struct inode *inode)
struct super_block *sb;
int err = 0;

+ if (ext4_has_inline_data(inode))
+ return empty_inline_dir(inode);
+
sb = inode->i_sb;
if (inode->i_size < EXT4_DIR_REC_LEN(1) + EXT4_DIR_REC_LEN(2) ||
!(bh = ext4_bread(NULL, inode, 0, 0, &err))) {
--
1.7.0.4

2011-10-26 07:34:24

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 13/17] ext4: let ext4_find_entry handle inline data.

From: Tao Ma <[email protected]>

Create a new ext4_find_inline_entry to handle the case
of inline data.

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/namei.c | 27 +++++++++++++++++++++++++++
1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2c388ad..55498a8 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -852,6 +852,29 @@ static inline int search_dirblock(struct buffer_head *bh,
d_name, offset, res_dir);
}

+static struct buffer_head *ext4_find_inline_entry(struct inode *dir,
+ const struct qstr *d_name,
+ struct ext4_dir_entry_2 **res_dir)
+{
+ int ret;
+ struct ext4_iloc iloc;
+ void *inline_start;
+ int inline_size;
+
+ if (ext4_get_inode_loc(dir, &iloc))
+ return NULL;
+
+ inline_start = ext4_get_inline_data_pos(dir, &iloc);
+ inline_size = ext4_get_max_inline_size(dir);
+
+ ret = search_dir(iloc.bh, inline_start, inline_size,
+ dir, d_name, 0, res_dir);
+ if (ret == 1)
+ return iloc.bh;
+
+ brelse(iloc.bh);
+ return NULL;
+}

/*
* ext4_find_entry()
@@ -887,6 +910,10 @@ static struct buffer_head * ext4_find_entry (struct inode *dir,
namelen = d_name->len;
if (namelen > EXT4_NAME_LEN)
return NULL;
+
+ if (ext4_has_inline_data(dir))
+ return ext4_find_inline_entry(dir, d_name, res_dir);
+
if ((namelen <= 2) && (name[0] == '.') &&
(name[1] == '.' || name[1] == '\0')) {
/*
--
1.7.0.4

2011-10-26 07:34:25

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 14/17] ext4: let ext4_delete_entry handle inline data.

From: Tao Ma <[email protected]>

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/namei.c | 99 +++++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 79 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 55498a8..fefd061 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1861,32 +1861,28 @@ cleanup:
}

/*
- * ext4_delete_entry deletes a directory entry by merging it with the
+ * __ext4_delete_entry deletes a directory entry by merging it with the
* previous entry
*/
-static int ext4_delete_entry(handle_t *handle,
- struct inode *dir,
- struct ext4_dir_entry_2 *de_del,
- struct buffer_head *bh)
+static int __ext4_delete_entry(handle_t *handle,
+ struct inode *dir,
+ struct ext4_dir_entry_2 *de_del,
+ struct buffer_head *bh,
+ void *entry_buf,
+ int buf_size)
{
struct ext4_dir_entry_2 *de, *pde;
unsigned int blocksize = dir->i_sb->s_blocksize;
- int i, err;
+ int i;

i = 0;
pde = NULL;
- de = (struct ext4_dir_entry_2 *) bh->b_data;
- while (i < bh->b_size) {
+ de = (struct ext4_dir_entry_2 *)entry_buf;
+ while (i < buf_size) {
if (ext4_check_dir_entry(dir, NULL, de, bh,
bh->b_data, bh->b_size, i))
return -EIO;
if (de == de_del) {
- BUFFER_TRACE(bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, bh);
- if (unlikely(err)) {
- ext4_std_error(dir->i_sb, err);
- return err;
- }
if (pde)
pde->rec_len = ext4_rec_len_to_disk(
ext4_rec_len_from_disk(pde->rec_len,
@@ -1897,12 +1893,6 @@ static int ext4_delete_entry(handle_t *handle,
else
de->inode = 0;
dir->i_version++;
- BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, dir, bh);
- if (unlikely(err)) {
- ext4_std_error(dir->i_sb, err);
- return err;
- }
return 0;
}
i += ext4_rec_len_from_disk(de->rec_len, blocksize);
@@ -1912,6 +1902,75 @@ static int ext4_delete_entry(handle_t *handle,
return -ENOENT;
}

+static int ext4_delete_inline_entry(handle_t *handle,
+ struct inode *dir,
+ struct ext4_dir_entry_2 *de_del,
+ struct buffer_head *bh)
+{
+ int err, inline_size;
+ struct ext4_iloc iloc;
+ void *inline_start;
+
+ err = ext4_get_inode_loc(dir, &iloc);
+ if (err)
+ return err;
+
+ inline_start = ext4_get_inline_data_pos(dir, &iloc);
+ inline_size = ext4_get_max_inline_size(dir);
+
+ err = ext4_journal_get_write_access(handle, bh);
+
+ err = __ext4_delete_entry(handle, dir, de_del, bh,
+ inline_start, inline_size);
+ if (err)
+ goto out;
+
+ BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+ err = ext4_mark_inode_dirty(handle, dir);
+ if (unlikely(err))
+ goto out;
+
+ ext4_show_inline_dir(dir, iloc.bh, inline_start, inline_size);
+ return 0;
+out:
+ brelse(iloc.bh);
+ if (err != -ENOENT)
+ ext4_std_error(dir->i_sb, err);
+ return err;
+}
+
+static int ext4_delete_entry(handle_t *handle,
+ struct inode *dir,
+ struct ext4_dir_entry_2 *de_del,
+ struct buffer_head *bh)
+{
+ int err;
+
+ if (ext4_has_inline_data(dir))
+ return ext4_delete_inline_entry(handle, dir, de_del, bh);
+
+ BUFFER_TRACE(bh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, bh);
+ if (unlikely(err))
+ goto out;
+
+ err = __ext4_delete_entry(handle, dir, de_del, bh, bh->b_data,
+ dir->i_sb->s_blocksize);
+ if (err)
+ goto out;
+
+ BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+ err = ext4_handle_dirty_metadata(handle, dir, bh);
+ if (unlikely(err))
+ goto out;
+
+ return 0;
+out:
+ if (err != -ENOENT)
+ ext4_std_error(dir->i_sb, err);
+ return err;
+}
+
/*
* DIR_NLINK feature is set if 1) nlinks > EXT4_LINK_MAX or 2) nlinks == 2,
* since this indicates that nlinks count was previously 1.
--
1.7.0.4

2011-10-26 07:41:21

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 11/17] ext4: Let ext4_readdir handle inline data.

From: Tao Ma <[email protected]>

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/dir.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 6c2199e..8b88cfa 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -27,6 +27,7 @@
#include <linux/slab.h>
#include <linux/rbtree.h>
#include "ext4.h"
+#include "xattr.h"

static unsigned char ext4_filetype_table[] = {
DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
@@ -64,6 +65,9 @@ static unsigned char get_dtype(struct super_block *sb, int filetype)
* Return 0 if the directory entry is OK, and 1 if there is a problem
*
* Note: this is the opposite of what ext2 and ext3 historically returned...
+ *
+ * bh passed here can be an inode block or a dir data block, depending
+ * on the inode inline data flag.
*/
int __ext4_check_dir_entry(const char *function, unsigned int line,
struct inode *dir, struct file *filp,
@@ -107,6 +111,100 @@ int __ext4_check_dir_entry(const char *function, unsigned int line,
return 1;
}

+static int ext4_read_inline_dir(struct file *filp,
+ void *dirent, filldir_t filldir)
+{
+ int error = 0;
+ unsigned int offset;
+ int i, stored;
+ struct ext4_dir_entry_2 *de;
+ struct super_block *sb;
+ struct inode *inode = filp->f_path.dentry->d_inode;
+ int ret, inline_size;
+ void *inline_pos;
+ struct ext4_iloc iloc;
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+ sb = inode->i_sb;
+ stored = 0;
+ offset = filp->f_pos & (sb->s_blocksize - 1);
+ inline_pos = ext4_get_inline_data_pos(inode, &iloc);
+ inline_size = ext4_get_max_inline_size(inode);
+
+ ext4_show_inline_dir(inode, iloc.bh, inline_pos, inline_size);
+
+ while (!error && !stored && filp->f_pos < inode->i_size) {
+revalidate:
+ /* If the version has changed since the last call to
+ * readdir(2), then we might be pointing to an invalid
+ * dirent right now. Scan from the start of the block
+ * to make sure. */
+ if (filp->f_version != inode->i_version) {
+ for (i = 0; i < inode->i_size && i < offset; ) {
+ de = (struct ext4_dir_entry_2 *)
+ (inline_pos + i);
+ /* It's too expensive to do a full
+ * dirent test each time round this
+ * loop, but we do have to test at
+ * least that it is non-zero. A
+ * failure will be detected in the
+ * dirent test below. */
+ if (ext4_rec_len_from_disk(de->rec_len,
+ inline_size) < EXT4_DIR_REC_LEN(1))
+ break;
+ i += ext4_rec_len_from_disk(de->rec_len,
+ inline_size);
+ }
+ offset = i;
+ filp->f_pos = offset;
+ filp->f_version = inode->i_version;
+ }
+
+ while (!error && filp->f_pos < inode->i_size
+ && offset < inline_size) {
+ de = (struct ext4_dir_entry_2 *) (inline_pos + offset);
+ if (ext4_check_dir_entry(inode, filp, de,
+ iloc.bh, inline_pos,
+ inline_size, offset)) {
+ ret = stored;
+ goto out;
+ }
+ offset += ext4_rec_len_from_disk(de->rec_len,
+ inline_size);
+ if (le32_to_cpu(de->inode)) {
+ /* We might block in the next section
+ * if the data destination is
+ * currently swapped out. So, use a
+ * version stamp to detect whether or
+ * not the directory has been modified
+ * during the copy operation.
+ */
+ u64 version = filp->f_version;
+
+ error = filldir(dirent, de->name,
+ de->name_len,
+ filp->f_pos,
+ le32_to_cpu(de->inode),
+ get_dtype(sb, de->file_type));
+ if (error)
+ break;
+ if (version != filp->f_version)
+ goto revalidate;
+ stored++;
+ }
+ filp->f_pos += ext4_rec_len_from_disk(de->rec_len,
+ inline_size);
+ }
+ offset = 0;
+ }
+out:
+ brelse(iloc.bh);
+ return ret;
+}
+
static int ext4_readdir(struct file *filp,
void *dirent, filldir_t filldir)
{
@@ -122,6 +220,9 @@ static int ext4_readdir(struct file *filp,

sb = inode->i_sb;

+ if (ext4_has_inline_data(inode))
+ return ext4_read_inline_dir(filp, dirent, filldir);
+
if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb,
EXT4_FEATURE_COMPAT_DIR_INDEX) &&
((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
--
1.7.0.4

2011-10-26 07:34:28

by Tao Ma

[permalink] [raw]

Subject: [PATCH V1 17/17] ext4: Enable ext4 inline support.

From: Tao Ma <[email protected]>

Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/ialloc.c | 4 ++++
2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4319c95..3e5644b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1377,6 +1377,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_INCOMPAT_FLEX_BG 0x0200
#define EXT4_FEATURE_INCOMPAT_EA_INODE 0x0400 /* EA in inode */
#define EXT4_FEATURE_INCOMPAT_DIRDATA 0x1000 /* data in dirent */
+#define EXT4_FEATURE_INCOMPAT_INLINEDATA 0x2000 /* data in inode */

#define EXT2_FEATURE_COMPAT_SUPP EXT4_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 9c63f27..4e9cd2f 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1033,6 +1033,10 @@ got:

ei->i_extra_isize = EXT4_SB(sb)->s_want_extra_isize;

+ ei->i_inline_off = 0;
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_INLINEDATA))
+ ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+
ret = inode;
dquot_initialize(inode);
err = dquot_alloc_inode(inode);
--
1.7.0.4

2011-10-26 08:17:10

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH V1 00/17] ext4: Add inline data support.

On 10/26/2011 04:05 PM, Amir Goldstein wrote:
>
>
> On Wed, Oct 26, 2011 at 9:32 AM, Tao Ma <[email protected] <mailto:[email protected]>>
> wrote:
>
> Hi Ted, Andreas and list,
> This is my 1st attempt to add inline data support to ext4
> inode. For
> more information about the background, please refer to the thread
> http://marc.info/?l=linux-ext4&m=131715205428067&w=2
> <http://marc.info/?l=linux-ext4&m=131715205428067&w=2>
> When I sent out the RFC on Sep.27, Andreas suggested that we can use the
> space of xattr to put inline data. So this is the 1st version using that
> method. It should be easy to change if we decide to use other places in
> inode(e.g the unused extent space) since all the inline data
> manipulation function is wrapped with function like ext4_*_inline_data.
>
> Currently I use all the space between i_extra_isize and inode_size if
> inode_size = 256. For inode_size > 256, half of that space is used so as
> to leave some space for other xattrs.
>
> This is only a V1 and there are still something to do(e.g. I am thinking
> of using unused extent space), but I'd like to send it out earlier so
> that it can be reviewed ASAP.
>
> Some simple tests shows that with a linux-3.0 vanilla source, the new
> dir can save 1% disk space. For my "/usr", it can save about 3.2%
> spaces. I guess for volume with future bigalloc support, it should save
> more space for us for small dir. I also run some other tests and it
> seems the code is OK for a try. I haven't found a good test cases that
> can test the small file/dir(tens of bytes in my case) performance where
> inline data should have some good number.
>
> Any comments are welcomed.
>
> git diff --stat
>
> fs/ext4/dir.c | 117 ++++++++++-
> fs/ext4/ext4.h | 15 +-
> fs/ext4/ialloc.c | 4 +
> fs/ext4/inode.c | 583
> +++++++++++++++++++++++++++++++++++++++++++++++---
>
>
> Hi Tao,
>
> One generic comment.
> how about adding a new file fs/ext4/inline.c to host inline data support
> related code?
> inode.c is too big as it is and there is an effort the size it down.
sure, I will try to separate related functions to inline.c in my V2.

Thanks
Tao

2011-10-26 08:37:07

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 2011-10-26, at 1:34 AM, Tao Ma wrote:
> Implement inline data with xattr. This idea is inspired by Andreas.
> So now we use "system.data" to store xattr. For inode_size = 256,
> currently we uses all the space between i_extra_isize and inode_size.
> For inode_size > 256, we use half of that space.
>
> +#define EXT4_XATTR_SYSTEM_DATA_NAME "data"

Did you check whether the "data" string takes 4 bytes or 8 in the
xattr header? I believe it is storing the NUL termination (and
the code is using "sizeof(EXT4_XATTR_SYSTEM_DATA_NAME)" = 5), so
I believe it will round up to the next 4-byte boundary. Using a
3-byte name will save a bit of space, like "system.dat".

> +void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
> + void *buffer, loff_t pos, unsigned len)
> +{
> + header = IHDR(inode, ext4_raw_inode(iloc));
> + entry = (struct ext4_xattr_entry *)((void *)ext4_raw_inode(iloc) +
> + EXT4_I(inode)->i_inline_off);
> + memcpy((void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs) + pos,
> + buffer + pos, len);
> +}
> +
> +int ext4_init_inline_data(handle_t *handle, struct inode *inode,
> + struct ext4_iloc *iloc)
> +{
> + size = ext4_get_max_inline_size(inode);
> + value = kzalloc(size, GFP_NOFS);
> + if (!value)
> + return -ENOMEM;
> +
> + error = ext4_xattr_ibody_set(handle, inode, &i, &is);
> +}

Since file data is changed very rarely, instead of consuming the full
xattr space that may not be needed, wouldn't it be better to change
ext4_write_inline_data() to just to the ext4_xattr_ibody_set() to save
the exact-sized buffer into the xattr? That will allow other xattrs
to be stored in this space as well as the inline data.

Cheers, Andreas

2011-10-26 14:38:25

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 10/26/2011 04:36 PM, Andreas Dilger wrote:
> On 2011-10-26, at 1:34 AM, Tao Ma wrote:
>> Implement inline data with xattr. This idea is inspired by Andreas.
>> So now we use "system.data" to store xattr. For inode_size = 256,
>> currently we uses all the space between i_extra_isize and inode_size.
>> For inode_size > 256, we use half of that space.
>>
>> +#define EXT4_XATTR_SYSTEM_DATA_NAME "data"
>
> Did you check whether the "data" string takes 4 bytes or 8 in the
> xattr header? I believe it is storing the NUL termination (and
> the code is using "sizeof(EXT4_XATTR_SYSTEM_DATA_NAME)" = 5), so
> I believe it will round up to the next 4-byte boundary. Using a
> 3-byte name will save a bit of space, like "system.dat".
Actually it will takes 4 bytes instead of 8, since in
ext4_xattr_set_entry it uses memcpy to copy the name with namelen
initialized with strlen("data"). So "data" is fine here. As for the
calculation of max_inline_size, yes, it uses
sizeof(EXT4_XATTR_SYSTEM_DATA_NAME) which should be changed to strlen.
Thanks for the advice.
>
>> +void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
>> + void *buffer, loff_t pos, unsigned len)
>> +{
>> + header = IHDR(inode, ext4_raw_inode(iloc));
>> + entry = (struct ext4_xattr_entry *)((void *)ext4_raw_inode(iloc) +
>> + EXT4_I(inode)->i_inline_off);
>> + memcpy((void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs) + pos,
>> + buffer + pos, len);
>> +}
>> +
>> +int ext4_init_inline_data(handle_t *handle, struct inode *inode,
>> + struct ext4_iloc *iloc)
>> +{
>> + size = ext4_get_max_inline_size(inode);
>> + value = kzalloc(size, GFP_NOFS);
>> + if (!value)
>> + return -ENOMEM;
>> +
>> + error = ext4_xattr_ibody_set(handle, inode, &i, &is);
>> +}
>
> Since file data is changed very rarely, instead of consuming the full
> xattr space that may not be needed, wouldn't it be better to change
> ext4_write_inline_data() to just to the ext4_xattr_ibody_set() to save
> the exact-sized buffer into the xattr? That will allow other xattrs
> to be stored in this space as well as the inline data.
I am just worried about the cpu usage. You know, the xattr values in
ext4 has to be packed so if we change the content of an inline file
frequently(say append), the inline xattr value will be removed and added
frequently which should consume much cpu cycles. What's more, the other
xattr values has to be moved also if they are not aligned to the end of
the inode. I am not sure whether it is good for performance or not.
Another side effect is that we have to write the whole inline data every
time as a new xattr value replace every time while the current solution
just needs to memcpy the appended bytes.

Thanks
Tao

2011-10-26 22:28:54

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 2011-10-26, at 8:38 AM, Tao Ma wrote:
> On 10/26/2011 04:36 PM, Andreas Dilger wrote:
>> On 2011-10-26, at 1:34 AM, Tao Ma wrote:
>>> +void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
>>> + void *buffer, loff_t pos, unsigned len)
>>> +{
>>> + header = IHDR(inode, ext4_raw_inode(iloc));
>>> + entry = (struct ext4_xattr_entry *)((void *)ext4_raw_inode(iloc) +
>>> + EXT4_I(inode)->i_inline_off);
>>> + memcpy((void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs) + pos,
>>> + buffer + pos, len);
>>> +}
>>> +
>>> +int ext4_init_inline_data(handle_t *handle, struct inode *inode,
>>> + struct ext4_iloc *iloc)
>>> +{
>>> + size = ext4_get_max_inline_size(inode);
>>> + value = kzalloc(size, GFP_NOFS);
>>> + if (!value)
>>> + return -ENOMEM;
>>> +
>>> + error = ext4_xattr_ibody_set(handle, inode, &i, &is);
>>> +}
>>
>> Since file data is changed very rarely, instead of consuming the full
>> xattr space that may not be needed, wouldn't it be better to change
>> ext4_write_inline_data() to just to the ext4_xattr_ibody_set() to save
>> the exact-sized buffer into the xattr? That will allow other xattrs
>> to be stored in this space as well as the inline data.
>
> I am just worried about the cpu usage. You know, the xattr values in
> ext4 has to be packed so if we change the content of an inline file
> frequently(say append), the inline xattr value will be removed and added
> frequently which should consume much cpu cycles. What's more, the other
> xattr values has to be moved also if they are not aligned to the end of
> the inode. I am not sure whether it is good for performance or not.

I'd also guess it isn't the most CPU efficient mechanism, but the main
question is whether this extra CPU usage is even noticeable compared
to the IO time? Even with the added CPU usage, there is a dramatic
reduction in the IO (no external block to write), so it would always
be a net win to do it that way.

> Another side effect is that we have to write the whole inline data every
> time as a new xattr value replace every time while the current solution
> just needs to memcpy the appended bytes.

What about only storing a new xattr if the file size is increasing, or
when it is truncated to zero? If the write is <= existing xattr size
then it can use the same mechanism as today (inline overwrite of the
xattr buffer, and update of the xattr checksum). That avoids overhead
for the case of repeatedly writing a small same-size value into the file.
If some application is appending 1 byte at a time to a file, I think
the CPU overhead in the xattr code is the least of their worries.

The main reason I don't like to consume all of the xattr space right
away is that this will cause OTHER xattrs to immediately be pushed
into the external xattr block (e.g. selinux, security, etc) and then
we will be even worse off than before (file data in inode, xattr in
external block, and added complexity for no benefit).

Cheers, Andreas

2011-10-27 00:51:44

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 10/27/2011 06:28 AM, Andreas Dilger wrote:
> On 2011-10-26, at 8:38 AM, Tao Ma wrote:
>> On 10/26/2011 04:36 PM, Andreas Dilger wrote:
>>> On 2011-10-26, at 1:34 AM, Tao Ma wrote:
>>>> +void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
>>>> + void *buffer, loff_t pos, unsigned len)
>>>> +{
>>>> + header = IHDR(inode, ext4_raw_inode(iloc));
>>>> + entry = (struct ext4_xattr_entry *)((void *)ext4_raw_inode(iloc) +
>>>> + EXT4_I(inode)->i_inline_off);
>>>> + memcpy((void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs) + pos,
>>>> + buffer + pos, len);
>>>> +}
>>>> +
>>>> +int ext4_init_inline_data(handle_t *handle, struct inode *inode,
>>>> + struct ext4_iloc *iloc)
>>>> +{
>>>> + size = ext4_get_max_inline_size(inode);
>>>> + value = kzalloc(size, GFP_NOFS);
>>>> + if (!value)
>>>> + return -ENOMEM;
>>>> +
>>>> + error = ext4_xattr_ibody_set(handle, inode, &i, &is);
>>>> +}
>>>
>>> Since file data is changed very rarely, instead of consuming the full
>>> xattr space that may not be needed, wouldn't it be better to change
>>> ext4_write_inline_data() to just to the ext4_xattr_ibody_set() to save
>>> the exact-sized buffer into the xattr? That will allow other xattrs
>>> to be stored in this space as well as the inline data.
>>
>> I am just worried about the cpu usage. You know, the xattr values in
>> ext4 has to be packed so if we change the content of an inline file
>> frequently(say append), the inline xattr value will be removed and added
>> frequently which should consume much cpu cycles. What's more, the other
>> xattr values has to be moved also if they are not aligned to the end of
>> the inode. I am not sure whether it is good for performance or not.
>
> I'd also guess it isn't the most CPU efficient mechanism, but the main
> question is whether this extra CPU usage is even noticeable compared
> to the IO time? Even with the added CPU usage, there is a dramatic
> reduction in the IO (no external block to write), so it would always
> be a net win to do it that way.
It seems so. anyway, I will do some tests for file appending to see how
much these 2 methods differs.
>
>> Another side effect is that we have to write the whole inline data every
>> time as a new xattr value replace every time while the current solution
>> just needs to memcpy the appended bytes.
>
> What about only storing a new xattr if the file size is increasing, or
> when it is truncated to zero? If the write is <= existing xattr size
> then it can use the same mechanism as today (inline overwrite of the
> xattr buffer, and update of the xattr checksum). That avoids overhead
> for the case of repeatedly writing a small same-size value into the file.
> If some application is appending 1 byte at a time to a file, I think
> the CPU overhead in the xattr code is the least of their worries.
>
> The main reason I don't like to consume all of the xattr space right
> away is that this will cause OTHER xattrs to immediately be pushed
> into the external xattr block (e.g. selinux, security, etc) and then
> we will be even worse off than before (file data in inode, xattr in
> external block, and added complexity for no benefit).
To be honest, with inode size = 256, we don't have much spaces left in
the inode. With current i_extra_isize 28, we have only 92 bytes left for
xattrs(4 bytes for the xattr header magic and 4 bytes for the gap
between ext4_xattr_entry and the value, 256 - 128 - 28 - 4 - 4). So
considering one ext4_xattr_entry have 16 bytes and with the miminum
namelen of 4, if we support 2 entries(one for inline data and one for a
real xattr), these will take 40 bytes. And only 52 bytes are left. I
don't think these bytes are enough for 2 xattr values. ;) So why not
take all of them(72 bytes)? As for inode size > 256, the inline data
will only takes half of the spaces left and leaves the space for other
xattrs. Does it make sense?

btw, I have no idea of what a normal acl xattr takes, but if it takes
more than 10 bytes, it will almost make the inline dir almost no use,
since we have to store dot and dotdot first and then the real file
names. Too small space isn't good but adds overhead of converting from
inline to external block.

Thanks
Tao

2011-10-27 00:51:49

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

2011-10-27 07:10:22

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [PATCH V1 00/17] ext4: Add inline data support.

On Wed, 26 Oct 2011 15:32:24 +0800, Tao Ma said:

> Currently I use all the space between i_extra_isize and inode_size if
> inode_size = 256. For inode_size > 256, half of that space is used so as
> to leave some space for other xattrs.

I didn't check the code too closely - does this code DTRT if the user tries to
then attach a moby-sized xattr (or set of xattrs - if it's got a security.selinux
tag on it, and a security.capabilities xattr, and a user xattr or two, things are
going to be getting full).

> This is only a V1 and there are still something to do(e.g. I am thinking
> of using unused extent space), but I'd like to send it out earlier so
> that it can be reviewed ASAP.

If this works out, would it make sense to investigate doing this for all
tails in a V2? So if your file was 4099 bytes long, you could save allocating
a second block. Assuming random distribution of tail sizes, this wil save
an average of (space avail for tail)/(blocksize) per file.

Attachments:

(No filename) (227.00 B)

2011-10-27 13:55:08

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH V1 00/17] ext4: Add inline data support.

On 10/27/2011 03:10 PM, [email protected] wrote:
> On Wed, 26 Oct 2011 15:32:24 +0800, Tao Ma said:
>
>> Currently I use all the space between i_extra_isize and inode_size if
>> inode_size = 256. For inode_size > 256, half of that space is used so as
>> to leave some space for other xattrs.
>
> I didn't check the code too closely - does this code DTRT if the user tries to
> then attach a moby-sized xattr (or set of xattrs - if it's got a security.selinux
> tag on it, and a security.capabilities xattr, and a user xattr or two, things are
> going to be getting full).
sure, it will work and all these stuff will be inserted to the external
xattr block since the in-inode space is full.
>
>> This is only a V1 and there are still something to do(e.g. I am thinking
>> of using unused extent space), but I'd like to send it out earlier so
>> that it can be reviewed ASAP.
>
> If this works out, would it make sense to investigate doing this for all
> tails in a V2? So if your file was 4099 bytes long, you could save allocating
> a second block. Assuming random distribution of tail sizes, this wil save
> an average of (space avail for tail)/(blocksize) per file.
uh, it seems like a good suggestion, but it will make the code a little
bit complicated(at least compared to the current version) and I am not
sure whether the maintainer like it or not ;) . So Ted and Andreas, What
do you think of this?

Thanks
Tao

2011-10-27 14:29:51

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 2011-10-26, at 6:51 PM, Tao Ma <[email protected]> wrote:
> On 10/27/2011 06:28 AM, Andreas Dilger wrote:
>> On 2011-10-26, at 8:38 AM, Tao Ma wrote:
>>> On 10/26/2011 04:36 PM, Andreas Dilger wrote:
>>>> Since file data is changed very rarely, instead of consuming the full
>>>> xattr space that may not be needed, wouldn't it be better to change
>>>> ext4_write_inline_data() to just to the ext4_xattr_ibody_set() to save
>>>> the exact-sized buffer into the xattr? That will allow other xattrs
>>>> to be stored in this space as well as the inline data.
>>>
>>> I am just worried about the cpu usage. You know, the xattr values in
>>> ext4 has to be packed so if we change the content of an inline file
>>> frequently (say append), the inline xattr value will be removed and added

Given the small size of the space, it seems unlikely that apps would be growing the size of the file many times before it overflowed the inode xattr space, so I don't think this is a valid concern. I think such small files will normally be written once at the proper size, or if they are written repeatedly the offset will not change.

>>> frequently which should consume much cpu cycles. What's more, the other
>>> xattr values has to be moved also if they are not aligned to the end of
>>> the inode. I am not sure whether it is good for performance or not.
>>
>> I'd also guess it isn't the most CPU efficient mechanism, but the main
>> question is whether this extra CPU usage is even noticeable compared
>> to the IO time? Even with the added CPU usage, there is a dramatic
>> reduction in the IO (no external block to write), so it would always
>> be a net win to do it that way.

> It seems so. anyway, I will do some tests for file appending to see how
> much these 2 methods differs.

Great.

>>> Another side effect is that we have to write the whole inline data every
>>> time as a new xattr value replace every time while the current solution
>>> just needs to memcpy the appended bytes.
>>
>> What about only storing a new xattr if the file size is increasing, or
>> when it is truncated to zero? If the write is <= existing xattr size
>> then it can use the same mechanism as today (inline overwrite of the
>> xattr buffer, and update of the xattr checksum). That avoids overhead
>> for the case of repeatedly writing a small same-size value into the file.
>> If some application is appending 1 byte at a time to a file, I think
>> the CPU overhead in the xattr code is the least of their worries.
>>
>> The main reason I don't like to consume all of the xattr space right
>> away is that this will cause OTHER xattrs to immediately be pushed
>> into the external xattr block (e.g. selinux, security, etc) and then
>> we will be even worse off than before (file data in inode, xattr in
>> external block, and added complexity for no benefit).

> To be honest, with inode size = 256, we don't have much spaces left in
> the inode. With current i_extra_isize 28, we have only 92 bytes left for
> xattrs(4 bytes for the xattr header magic and 4 bytes for the gap
> between ext4_xattr_entry and the value, 256 - 128 - 28 - 4 - 4). So
> considering one ext4_xattr_entry have 16 bytes and with the miminum
> namelen of 4, if we support 2 entries(one for inline data and one for a
> real xattr), these will take 40 bytes. And only 52 bytes are left. I
> don't think these bytes are enough for 2 xattr values. ;)

This is enough for an empty dir (24 bytes) and another 28 bytes for a security xattr.

> So why not
> take all of them(72 bytes)? As for inode size > 256, the inline data
> will only takes half of the spaces left and leaves the space for other
> xattrs. Does it make sense?

No, because if ANY other xattr exists it will be pushed to an external block, and then if this data xattr grows (much more likely than the other xattr changing) it won't fit into the inode, and now performance is permanently worse than before.

> btw, I have no idea of what a normal acl xattr takes, but if it takes
> more than 10 bytes, it will almost make the inline dir almost no use,
> since we have to store dot and dotdot first and then the real file
> names. Too small space isn't good but adds overhead of converting from
> inline to external block.

In our environment we use at least 512-byte inodes on the metadata server, but I still don't want half if that space wasted on this xattr if so much is not needed.

Cheers, Andreas

2011-10-27 14:53:59

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 10/27/2011 05:57 PM, Andreas Dilger wrote:
> On 2011-10-26, at 6:51 PM, Tao Ma <[email protected]> wrote:
>> On 10/27/2011 06:28 AM, Andreas Dilger wrote:
>>> On 2011-10-26, at 8:38 AM, Tao Ma wrote:
>>>> On 10/26/2011 04:36 PM, Andreas Dilger wrote:
>>>>> Since file data is changed very rarely, instead of consuming the full
>>>>> xattr space that may not be needed, wouldn't it be better to change
>>>>> ext4_write_inline_data() to just to the ext4_xattr_ibody_set() to save
>>>>> the exact-sized buffer into the xattr? That will allow other xattrs
>>>>> to be stored in this space as well as the inline data.
>>>>
>>>> I am just worried about the cpu usage. You know, the xattr values in
>>>> ext4 has to be packed so if we change the content of an inline file
>>>> frequently (say append), the inline xattr value will be removed and added
>
> Given the small size of the space, it seems unlikely that apps would be growing the size of the file many times before it overflowed the inode xattr space, so I don't think this is a valid concern. I think such small files will normally be written once at the proper size, or if they are written repeatedly the offset will not change.
OK.
>
>>>> frequently which should consume much cpu cycles. What's more, the other
>>>> xattr values has to be moved also if they are not aligned to the end of
>>>> the inode. I am not sure whether it is good for performance or not.
>>>
>>> I'd also guess it isn't the most CPU efficient mechanism, but the main
>>> question is whether this extra CPU usage is even noticeable compared
>>> to the IO time? Even with the added CPU usage, there is a dramatic
>>> reduction in the IO (no external block to write), so it would always
>>> be a net win to do it that way.
>
>> It seems so. anyway, I will do some tests for file appending to see how
>> much these 2 methods differs.
>
> Great.
>
>>>> Another side effect is that we have to write the whole inline data every
>>>> time as a new xattr value replace every time while the current solution
>>>> just needs to memcpy the appended bytes.
>>>
>>> What about only storing a new xattr if the file size is increasing, or
>>> when it is truncated to zero? If the write is <= existing xattr size
>>> then it can use the same mechanism as today (inline overwrite of the
>>> xattr buffer, and update of the xattr checksum). That avoids overhead
>>> for the case of repeatedly writing a small same-size value into the file.
>>> If some application is appending 1 byte at a time to a file, I think
>>> the CPU overhead in the xattr code is the least of their worries.
>>>
>>> The main reason I don't like to consume all of the xattr space right
>>> away is that this will cause OTHER xattrs to immediately be pushed
>>> into the external xattr block (e.g. selinux, security, etc) and then
>>> we will be even worse off than before (file data in inode, xattr in
>>> external block, and added complexity for no benefit).
>
>> To be honest, with inode size = 256, we don't have much spaces left in
>> the inode. With current i_extra_isize 28, we have only 92 bytes left for
>> xattrs(4 bytes for the xattr header magic and 4 bytes for the gap
>> between ext4_xattr_entry and the value, 256 - 128 - 28 - 4 - 4). So
>> considering one ext4_xattr_entry have 16 bytes and with the miminum
>> namelen of 4, if we support 2 entries(one for inline data and one for a
>> real xattr), these will take 40 bytes. And only 52 bytes are left. I
>> don't think these bytes are enough for 2 xattr values. ;)
>
> This is enough for an empty dir (24 bytes) and another 28 bytes for a security xattr.
>
>> So why not
>> take all of them(72 bytes)? As for inode size > 256, the inline data
>> will only takes half of the spaces left and leaves the space for other
>> xattrs. Does it make sense?
>
> No, because if ANY other xattr exists it will be pushed to an external block, and then if this data xattr grows (much more likely than the other xattr changing) it won't fit into the inode, and now performance is permanently worse than before.
OK, since it seems that lustre uses xattr heavily, I will try my best to
avoid the performance regression for xattr operations.
>
>> btw, I have no idea of what a normal acl xattr takes, but if it takes
>> more than 10 bytes, it will almost make the inline dir almost no use,
>> since we have to store dot and dotdot first and then the real file
>> names. Too small space isn't good but adds overhead of converting from
>> inline to external block.
>
> In our environment we use at least 512-byte inodes on the metadata server, but I still don't want half if that space wasted on this xattr if so much is not needed.
Thanks for the info.

btw, I have another idea about using the not-used extent space for
storing inline data like what we do for a symlink. So I will still use a
xattr entry to indicate whether the inode will have inline data or not.
If yes, the initialized xattr value len will be zero while the extent
space(60 bytes) will be used to store the inline data. And if the file
size is larger than 60, it will begin to insert xattr values. In such
case, we supports inline data and don't use too much space after the
i_extra_isize. What do you think of it?

Thanks
Tao

2011-11-02 21:15:54

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 2011-10-27, at 8:53 AM, Tao Ma wrote:
> On 10/27/2011 05:57 PM, Andreas Dilger wrote:
>> if ANY other xattr exists it will be pushed to an external block, and
>> then if this data xattr grows (much more likely than the other xattr
>> changing) it won't fit into the inode, and now performance is
>> permanently worse than before.
>
> OK, since it seems that lustre uses xattr heavily, I will try my best to
> avoid the performance regression for xattr operations.

I don't even think it is very much a Lustre problem, since it always
stores file data in a separate filesystem from the metadata, and
60-byte files are going to have terrible performance either way.

My main concern is for SELinux (enabled by default on most systems
today). If the "small file" data is stored in the xattr space, and
this pushes the SELinux xattr to an external block, we have added
code complexity and a gratuitous format change (data in inode and
metadata in block, instead of metadata in inode and data in block)
with no real benefit at all.

>> In our environment we use at least 512-byte inodes on the metadata server, but I still don't want half if that space wasted on this xattr if so much is not needed.
>
> btw, I have another idea about using the not-used extent space for
> storing inline data like what we do for a symlink. So I will still use a
> xattr entry to indicate whether the inode will have inline data or not.
> If yes, the initialized xattr value len will be zero while the extent
> space(60 bytes) will be used to store the inline data. And if the file
> size is larger than 60, it will begin to insert xattr values. In such
> case, we supports inline data and don't use too much space after the
> i_extra_isize. What do you think of it?

I think this is an interesting idea. Since only the "data" xattr could
use this space, it gives us an extra 60 bytes of space to be used in
the inode and does not consume the xattr space. The main drawback is
that this would add special case handling based on the xattr name, but
I think it is worthwhile to investigate how complex that code is and
what kind of performance improvement it gives.

Looking at my FC13 installation, it seems like a large number of
files could benefit from just 60 bytes of inline storage. So more
than 10% of all of the files on the filesystem would fit in i_blocks.
This filesystem includes a lot of source and build files, but I also
think this is pretty typical of normal Linux usage.

# find / -xdev -type f -size -61c | wc -l
35661
# find / -xdev -type f | wc -l
335515

The "fsstats" tool is useful for collecting interesting data like this:
(http://www.pdsi-scidac.org/fsstats/files/fsstats-1.4.5.tar.gz)
and it shows the same is true for directories as well:

directory size (entries): Range of entries, count of directories in range,
number of dirs in range as a % of total num of dirs, number of dirs in
this range or smaller as a % total number of dirs, total entries in range,
number of entries in range as a % of total number of entries, number of
entries in this range or smaller as a % of total number of entries.

count=33476 avg=11.38 ents
min=0.00 ents max=3848.00 ents
[ 0- 1 ents]: 11257 (33.63%) ( 33.63%) 9968.00 ents ( 2.62%) ( 2.62%)
[ 2- 3 ents]: 7080 (21.15%) ( 54.78%) 16608.00 ents ( 4.36%) ( 6.97%)
[ 4- 7 ents]: 5793 (17.30%) ( 72.08%) 30674.00 ents ( 8.05%) ( 15.02%)
[ 8- 15 ents]: 3971 (11.86%) ( 83.94%) 43315.00 ents (11.37%) ( 26.39%)
[ 16- 31 ents]: 2731 ( 8.16%) ( 92.10%) 59612.00 ents (15.64%) ( 42.04%)
[ 32- 63 ents]: 1610 ( 4.81%) ( 96.91%) 69326.00 ents (18.19%) ( 60.23%)
[ 64- 127 ents]: 705 ( 2.11%) ( 99.02%) 61633.00 ents (16.17%) ( 76.40%)
[ 128- 255 ents]: 236 ( 0.70%) ( 99.72%) 40005.00 ents (10.50%) ( 86.90%)
[ 256- 511 ents]: 66 ( 0.20%) ( 99.92%) 21923.00 ents ( 5.75%) ( 92.66%)
[ 512-1023 ents]: 19 ( 0.06%) ( 99.98%) 14249.00 ents ( 3.74%) ( 96.40%)
[1024-2047 ents]: 6 ( 0.02%) ( 99.99%) 7756.00 ents ( 2.04%) ( 98.43%)
[2048-4095 ents]: 2 ( 0.01%) (100.00%) 5979.00 ents ( 1.57%) (100.00%)

A simple test of the performance gains might be running "file" on
everything in /etc and /usr, and measuring this with blktrace to
see what kind of seek reduction is seen from not doing seeks to
read the small files from an external block.

I think it is still useful to try to store the data in the large inode
xattr space if it is larger than i_blocks, especially for larger inodes,
but if there is not enough space for all the xattrs to fit into the
inode, I think "data" should be the first one to be pushed out of the
inode since that changes the format back to a normal ext* file.

We might also consider a reiserfs-like approach where multiple small
files could be packed into the same shared xattr block, but then the
xattr name would need to change from "data" to e.g. "inode.generation"
so that it can be located within the block shared between inodes.

Tail packing is more complex, so such a change would only make sense
if real-world testing showed a benefit. There is already the concept
of shared external xattr blocks, so maybe it isn't too bad. Together
with bigalloc, it might make sense to be able to pack many small files
into one cluster if there is a binomial distribution of file sizes?

Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.

2011-11-03 04:23:29

by Tao Ma

[permalink] [raw]

Subject: Re: [PATCH V1 02/17] ext4: Add the basic function for inline data support.

On 11/03/2011 05:16 AM, Andreas Dilger wrote:
> On 2011-10-27, at 8:53 AM, Tao Ma wrote:
>> On 10/27/2011 05:57 PM, Andreas Dilger wrote:
>>> if ANY other xattr exists it will be pushed to an external block, and
>>> then if this data xattr grows (much more likely than the other xattr
>>> changing) it won't fit into the inode, and now performance is
>>> permanently worse than before.
>>
>> OK, since it seems that lustre uses xattr heavily, I will try my best to
>> avoid the performance regression for xattr operations.
>
> I don't even think it is very much a Lustre problem, since it always
> stores file data in a separate filesystem from the metadata, and
> 60-byte files are going to have terrible performance either way.
>
> My main concern is for SELinux (enabled by default on most systems
> today). If the "small file" data is stored in the xattr space, and
> this pushes the SELinux xattr to an external block, we have added
> code complexity and a gratuitous format change (data in inode and
> metadata in block, instead of metadata in inode and data in block)
> with no real benefit at all.
Fair enough. Thanks for the explanation.
>
>>> In our environment we use at least 512-byte inodes on the metadata server, but I still don't want half if that space wasted on this xattr if so much is not needed.
>>
>> btw, I have another idea about using the not-used extent space for
>> storing inline data like what we do for a symlink. So I will still use a
>> xattr entry to indicate whether the inode will have inline data or not.
>> If yes, the initialized xattr value len will be zero while the extent
>> space(60 bytes) will be used to store the inline data. And if the file
>> size is larger than 60, it will begin to insert xattr values. In such
>> case, we supports inline data and don't use too much space after the
>> i_extra_isize. What do you think of it?
>
> I think this is an interesting idea. Since only the "data" xattr could
> use this space, it gives us an extra 60 bytes of space to be used in
> the inode and does not consume the xattr space. The main drawback is
> that this would add special case handling based on the xattr name, but
> I think it is worthwhile to investigate how complex that code is and
> what kind of performance improvement it gives.
>
> Looking at my FC13 installation, it seems like a large number of
> files could benefit from just 60 bytes of inline storage. So more
> than 10% of all of the files on the filesystem would fit in i_blocks.
> This filesystem includes a lot of source and build files, but I also
> think this is pretty typical of normal Linux usage.
>
> # find / -xdev -type f -size -61c | wc -l
> 35661
> # find / -xdev -type f | wc -l
> 335515
>
> The "fsstats" tool is useful for collecting interesting data like this:
> (http://www.pdsi-scidac.org/fsstats/files/fsstats-1.4.5.tar.gz)
> and it shows the same is true for directories as well:
>
>
> directory size (entries): Range of entries, count of directories in range,
> number of dirs in range as a % of total num of dirs, number of dirs in
> this range or smaller as a % total number of dirs, total entries in range,
> number of entries in range as a % of total number of entries, number of
> entries in this range or smaller as a % of total number of entries.
>
> count=33476 avg=11.38 ents
> min=0.00 ents max=3848.00 ents
> [ 0- 1 ents]: 11257 (33.63%) ( 33.63%) 9968.00 ents ( 2.62%) ( 2.62%)
> [ 2- 3 ents]: 7080 (21.15%) ( 54.78%) 16608.00 ents ( 4.36%) ( 6.97%)
> [ 4- 7 ents]: 5793 (17.30%) ( 72.08%) 30674.00 ents ( 8.05%) ( 15.02%)
> [ 8- 15 ents]: 3971 (11.86%) ( 83.94%) 43315.00 ents (11.37%) ( 26.39%)
> [ 16- 31 ents]: 2731 ( 8.16%) ( 92.10%) 59612.00 ents (15.64%) ( 42.04%)
> [ 32- 63 ents]: 1610 ( 4.81%) ( 96.91%) 69326.00 ents (18.19%) ( 60.23%)
> [ 64- 127 ents]: 705 ( 2.11%) ( 99.02%) 61633.00 ents (16.17%) ( 76.40%)
> [ 128- 255 ents]: 236 ( 0.70%) ( 99.72%) 40005.00 ents (10.50%) ( 86.90%)
> [ 256- 511 ents]: 66 ( 0.20%) ( 99.92%) 21923.00 ents ( 5.75%) ( 92.66%)
> [ 512-1023 ents]: 19 ( 0.06%) ( 99.98%) 14249.00 ents ( 3.74%) ( 96.40%)
> [1024-2047 ents]: 6 ( 0.02%) ( 99.99%) 7756.00 ents ( 2.04%) ( 98.43%)
> [2048-4095 ents]: 2 ( 0.01%) (100.00%) 5979.00 ents ( 1.57%) (100.00%)
>
> A simple test of the performance gains might be running "file" on
> everything in /etc and /usr, and measuring this with blktrace to
> see what kind of seek reduction is seen from not doing seeks to
> read the small files from an external block.
Thanks for the test tools, and yes, I will try to test it when I finish
my V2.
> I think it is still useful to try to store the data in the large inode
> xattr space if it is larger than i_blocks, especially for larger inodes,
> but if there is not enough space for all the xattrs to fit into the
> inode, I think "data" should be the first one to be pushed out of the
> inode since that changes the format back to a normal ext* file.
Oh, that does mean that we need to change the normal way we handling
xattr set. I am not sure of it. Maybe we need an option in sysfs that
can be tuned so that the user can tell us whether his inline file
content is more important or the xattr. :)
>
>
> We might also consider a reiserfs-like approach where multiple small
> files could be packed into the same shared xattr block, but then the
> xattr name would need to change from "data" to e.g. "inode.generation"
> so that it can be located within the block shared between inodes.
I'd like to defer it to a later version since it means that we still
need to do 2 I/Os for a small file, the same as a ext* file. Having said
that, it does be helpful for a sequence of small file read(file
iteration or tar) if the underlying block device do a good readahead
job. And this should help in the bigalloc case I mentioned in another
thread about changing extent lengh to cluster(the case is when we tar a
kernel source and do 'sync', it is very time-consuming for a bigalloc
volume).
>
> Tail packing is more complex, so such a change would only make sense
> if real-world testing showed a benefit. There is already the concept
> of shared external xattr blocks, so maybe it isn't too bad. Together
> with bigalloc, it might make sense to be able to pack many small files
> into one cluster if there is a binomial distribution of file sizes?
It is a bit complicate than the current code base. So let me first
implement all the suggestions above and then try to investigate whether
it really helps.

Thanks
Tao