2020-04-15 11:46:20

by Josh Triplett

[permalink] [raw]
Subject: Inline data with 128-byte inodes?

Is there a fundamental reason that ext4 *can't* or *shouldn't* support
inline data with 128-byte inodes?

As far as I can tell, the kernel ext4 implementation only allows inline
data with 256-byte or larger inodes, because it requires the system.data
xattr to exist, even if the actual data requires 60 bytes or less. (The
implementation in debugfs, on the other hand, handles inline data in
128-byte inodes just fine. And it seems like it'd be fairly
straightforward to change the kernel implementation to support it as
well.)

For filesystems that don't need to store xattrs in general, and can live
with the other limitations of 128-byte inodes, using a 128-byte inode
can save substantial space compared to a 256-byte inode (many megabytes
worth of inode tables, versus 4k for each file between 61-160 bytes),
and many small files or small directories would still fit in 60 bytes.


2020-04-22 16:01:17

by Jan Kara

[permalink] [raw]
Subject: Re: Inline data with 128-byte inodes?

On Tue 14-04-20 00:02:07, Josh Triplett wrote:
> Is there a fundamental reason that ext4 *can't* or *shouldn't* support
> inline data with 128-byte inodes?

Well, where would we put it on disk? ext4 on-disk inode fills 128-bytes
with 'osd2' union...

Or do you mean we should put inline data in an external xattr block?

Honza

> As far as I can tell, the kernel ext4 implementation only allows inline
> data with 256-byte or larger inodes, because it requires the system.data
> xattr to exist, even if the actual data requires 60 bytes or less. (The
> implementation in debugfs, on the other hand, handles inline data in
> 128-byte inodes just fine. And it seems like it'd be fairly
> straightforward to change the kernel implementation to support it as
> well.)
>
> For filesystems that don't need to store xattrs in general, and can live
> with the other limitations of 128-byte inodes, using a 128-byte inode
> can save substantial space compared to a 256-byte inode (many megabytes
> worth of inode tables, versus 4k for each file between 61-160 bytes),
> and many small files or small directories would still fit in 60 bytes.
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-04-22 20:16:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: Inline data with 128-byte inodes?

On Apr 22, 2020, at 10:00 AM, Jan Kara <[email protected]> wrote:
>
> On Tue 14-04-20 00:02:07, Josh Triplett wrote:
>> Is there a fundamental reason that ext4 *can't* or *shouldn't* support
>> inline data with 128-byte inodes?
>
> Well, where would we put it on disk? ext4 on-disk inode fills 128-bytes
> with 'osd2' union...

There are 60 bytes in the "i_block" field that can be used by inline_data.

> Or do you mean we should put inline data in an external xattr block?

Using an 4KB xattr block would IMHO be worse than just using a regular
file block for such files, if they don't fit into the 60 i_block bytes.
That makes the data handling more complex (data copies each time in/out
of the xattr) and has performance impact (all writes essentially data
journal because they go via the setxattr code path.

The only time it _might_ be useful is if there are other xattrs that are
shared in the external block with inline_data. However, at that point
I think you are better off to just create larger inodes to hold the
xattrs to avoid the seeking needed to load the external block...

Given the prevalence of xattrs today (SELinux springs to mind), I'd be
surprised whether this combination shows any improvement in real life,
but I don't have an _objection_ to allowing this combination (e.g. for
ultra-compact /etc or boot filesystem images.

Maybe there is a bigger win for small directories avoiding 4KB leaf blocks?

That said, I'd be happy to see some numbers to show this is a win, and
I'm definitely not _against_ allowing this to work if there is a use for it.

Cheers, Andreas

>> As far as I can tell, the kernel ext4 implementation only allows inline
>> data with 256-byte or larger inodes, because it requires the system.data
>> xattr to exist, even if the actual data requires 60 bytes or less. (The
>> implementation in debugfs, on the other hand, handles inline data in
>> 128-byte inodes just fine. And it seems like it'd be fairly
>> straightforward to change the kernel implementation to support it as
>> well.)
>>
>> For filesystems that don't need to store xattrs in general, and can live
>> with the other limitations of 128-byte inodes, using a 128-byte inode
>> can save substantial space compared to a 256-byte inode (many megabytes
>> worth of inode tables, versus 4k for each file between 61-160 bytes),
>> and many small files or small directories would still fit in 60 bytes.
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-04-23 00:41:22

by Josh Triplett

[permalink] [raw]
Subject: Re: Inline data with 128-byte inodes?

On Wed, Apr 22, 2020 at 02:15:28PM -0600, Andreas Dilger wrote:
> On Apr 22, 2020, at 10:00 AM, Jan Kara <[email protected]> wrote:
> > On Tue 14-04-20 00:02:07, Josh Triplett wrote:
> >> Is there a fundamental reason that ext4 *can't* or *shouldn't* support
> >> inline data with 128-byte inodes?
> >
> > Well, where would we put it on disk? ext4 on-disk inode fills 128-bytes
> > with 'osd2' union...
>
> There are 60 bytes in the "i_block" field that can be used by inline_data.

Exactly. But the Linux ext4 implementation doesn't accept inline data
unless the system.data xattr exists, even if the file's data fits in 60
bytes (in which case system.data must exist and have 0 length).

> > Or do you mean we should put inline data in an external xattr block?

Definitely not, no. That seems much more complex to deal with.

I'm only talking about the case of files or directories <= 60 bytes
fitting in the inode with 128-byte inodes. Effectively, this would mean
accepting inodes with the inline data flag set, whether they have the
system.data xattr or not.

> Maybe there is a bigger win for small directories avoiding 4KB leaf blocks?
>
> That said, I'd be happy to see some numbers to show this is a win, and
> I'm definitely not _against_ allowing this to work if there is a use for it.

Some statistics, for ext4 with 4k blocks and 128-byte inodes, if 60-byte
inline data worked with 128-byte inodes:

A filesystem containing the source code of the Linux kernel would
save about 1508 disk blocks, or around 6032k.

A filesystem containing only my /etc directory would save about 650
blocks, or 2600k, a substantial fraction of the entire directory (which
takes up 9004k total without inline data).

- Josh Triplett

2020-04-23 12:07:24

by Jan Kara

[permalink] [raw]
Subject: Re: Inline data with 128-byte inodes?

On Wed 22-04-20 17:40:33, Josh Triplett wrote:
> On Wed, Apr 22, 2020 at 02:15:28PM -0600, Andreas Dilger wrote:
> > On Apr 22, 2020, at 10:00 AM, Jan Kara <[email protected]> wrote:
> > > On Tue 14-04-20 00:02:07, Josh Triplett wrote:
> > >> Is there a fundamental reason that ext4 *can't* or *shouldn't* support
> > >> inline data with 128-byte inodes?
> > >
> > > Well, where would we put it on disk? ext4 on-disk inode fills 128-bytes
> > > with 'osd2' union...
> >
> > There are 60 bytes in the "i_block" field that can be used by inline_data.
>
> Exactly. But the Linux ext4 implementation doesn't accept inline data
> unless the system.data xattr exists, even if the file's data fits in 60
> bytes (in which case system.data must exist and have 0 length).

I see now I understand what you meant. Thanks for explanation.

> > Maybe there is a bigger win for small directories avoiding 4KB leaf blocks?
> >
> > That said, I'd be happy to see some numbers to show this is a win, and
> > I'm definitely not _against_ allowing this to work if there is a use for it.
>
> Some statistics, for ext4 with 4k blocks and 128-byte inodes, if 60-byte
> inline data worked with 128-byte inodes:
>
> A filesystem containing the source code of the Linux kernel would
> save about 1508 disk blocks, or around 6032k.
>
> A filesystem containing only my /etc directory would save about 650
> blocks, or 2600k, a substantial fraction of the entire directory (which
> takes up 9004k total without inline data).

I guess few people care about a few megabytes these days... For really
space sensitive applications, people don't pick ext4 as a base filesystem
I'd guess (I'd expect squashfs, erofs, or ubifs if you need write access).
So the benefit is relatively small, the question about the cost is - how
complicated it gets to support inline data without xattrs?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-05-04 22:53:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: Inline data with 128-byte inodes?

On Apr 22, 2020, at 6:40 PM, Josh Triplett <[email protected]> wrote:
>
> On Wed, Apr 22, 2020 at 02:15:28PM -0600, Andreas Dilger wrote:
>> On Apr 22, 2020, at 10:00 AM, Jan Kara <[email protected]> wrote:
>>> On Tue 14-04-20 00:02:07, Josh Triplett wrote:
>>>> Is there a fundamental reason that ext4 *can't* or *shouldn't* support
>>>> inline data with 128-byte inodes?
>>>
>>> Well, where would we put it on disk? ext4 on-disk inode fills 128-bytes
>>> with 'osd2' union...
>>
>> There are 60 bytes in the "i_block" field that can be used by inline_data.
>
> Exactly. But the Linux ext4 implementation doesn't accept inline data
> unless the system.data xattr exists, even if the file's data fits in 60
> bytes (in which case system.data must exist and have 0 length).
>
>>> Or do you mean we should put inline data in an external xattr block?
>
> Definitely not, no. That seems much more complex to deal with.
>
> I'm only talking about the case of files or directories <= 60 bytes
> fitting in the inode with 128-byte inodes. Effectively, this would mean
> accepting inodes with the inline data flag set, whether they have the
> system.data xattr or not.
>
>> Maybe there is a bigger win for small directories avoiding 4KB leaf blocks?
>>
>> That said, I'd be happy to see some numbers to show this is a win, and
>> I'm definitely not _against_ allowing this to work if there is a use for it.
>
> Some statistics, for ext4 with 4k blocks and 128-byte inodes, if 60-byte
> inline data worked with 128-byte inodes:
>
> A filesystem containing the source code of the Linux kernel would
> save about 1508 disk blocks, or around 6032k.
>
> A filesystem containing only my /etc directory would save about 650
> blocks, or 2600k, a substantial fraction of the entire directory (which
> takes up 9004k total without inline data).

Hi Josh,
I started looking into this, and got half-way through a patch before
getting pulled into something else. If you are interested to finish
it off, I've included it here, but it definitely isn't ready yet.

Looking at the details, it isn't just an issue of changing a single
threshold to decide whether 128-byte inodes can handle inline data
or not. The code is intermingled between "have system.data xattr"
as the defining factor and using "inline_off" (which points to the
byte offset of the system.data contents) to determine whether there
is inline data or not.

It _should_ be possible to change to using EXT4_STATE_MAY_INLINE_DATA
and EXT4_INLINE_DATA_FL to determine if there is any inline data in
the inode, and use "inline_off" and i_extra_isize only to detect if
any data is in the xattr or not. It doesn't need to check the xattrs
if the inode size is EXT4_GOOD_OLD_INODE_SIZE, since we *don't* want
"inline" data in an external xattr block for sanity/efficiency reasons.

Hopefully this patch is useful for you as a starting point - I suspect
it is about 80% finished, but I wouldn't have time to work on it for a
few weeks at least, since 128-byte inodes is not a useful case for me (we
are using 1024-byte inodes on our metadata filesystems these days). On
the flip side, I think inline_data is very interesting, and am happy for
it to gain more widespread usage. I don't think the 128-byte inode support
for inline_data increases complexity, so I wouldn't be against landing it.

Cheers, Andreas
--

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 61b37a0..ec6f51a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3260,8 +3260,7 @@ extern int ext4_inline_data_fiemap(struct inode *inode,

static inline int ext4_has_inline_data(struct inode *inode)
{
- return ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA) &&
- EXT4_I(inode)->i_inline_off;
+ return ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA);
}

/* namei.c */
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index fad82d0..ea9596a 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -20,7 +20,7 @@

static int ext4_get_inline_size(struct inode *inode)
{
- if (EXT4_I(inode)->i_inline_off)
+ if (ext4_get_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
return EXT4_I(inode)->i_inline_size;

return 0;
@@ -94,7 +94,7 @@ int ext4_get_max_inline_size(struct inode *inode)
struct ext4_iloc iloc;

if (EXT4_I(inode)->i_extra_isize == 0)
- return 0;
+ return EXT4_MIN_INLINE_DATA_SIZE;

error = ext4_get_inode_loc(inode, &iloc);
if (error) {
@@ -133,6 +133,11 @@ int ext4_find_inline_data_nolock(struct inode *inode)
};
int error;

+ if (!ext4_has_feature_inline_data(inode))
+ return 0;
+
+ ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+ EXT4_I(inode)->i_inline_size = EXT4_MIN_INLINE_DATA_SIZE;
if (EXT4_I(inode)->i_extra_isize == 0)
return 0;

@@ -153,9 +158,8 @@ int ext4_find_inline_data_nolock(struct inode *inode)
}
EXT4_I(inode)->i_inline_off = (u16)((void *)is.s.here -
(void *)ext4_raw_inode(&is.iloc));
- EXT4_I(inode)->i_inline_size = EXT4_MIN_INLINE_DATA_SIZE +
+ EXT4_I(inode)->i_inline_size +=
le32_to_cpu(is.s.here->e_value_size);
- ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
}
out:
brelse(is.iloc.bh);
@@ -188,6 +192,7 @@ static int ext4_read_inline_data(struct inode *inode, void *buffer,
if (!len)
goto out;

+ BUG_ON(EXT4_I(inode)->i_extra_isize == 0);
header = IHDR(inode, raw_inode);
entry = (struct ext4_xattr_entry *)((void *)raw_inode +
EXT4_I(inode)->i_inline_off);
@@ -219,7 +224,7 @@ static void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return;

- BUG_ON(!EXT4_I(inode)->i_inline_off);
+ BUG_ON(!ext4_get_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));
BUG_ON(pos + len > EXT4_I(inode)->i_inline_size);

raw_inode = ext4_raw_inode(iloc);
@@ -238,6 +243,7 @@ static void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
if (!len)
return;

+ BUG_ON(EXT4_I(inode)->i_extra_isize == 0);
pos -= EXT4_MIN_INLINE_DATA_SIZE;
header = IHDR(inode, raw_inode);
entry = (struct ext4_xattr_entry *)((void *)raw_inode +
@@ -269,6 +275,17 @@ static int ext4_create_inline_data(handle_t *handle,
if (error)
goto out;

+ EXT4_I(inode)->i_inline_size = EXT4_MIN_INLINE_DATA_SIZE;
+ memset((void *)ext4_raw_inode(&is.iloc)->i_block,
+ 0, EXT4_MIN_INLINE_DATA_SIZE);
+ ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
+ ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
+
+ if (EXT4_I(inode)->i_extra_isize == 0) {
+ BUG_ON(len > EXT4_MIN_INLINE_DATA_SIZE);
+ goto out_dirty;
+ }
+
if (len > EXT4_MIN_INLINE_DATA_SIZE) {
value = EXT4_ZERO_XATTR_VALUE;
len -= EXT4_MIN_INLINE_DATA_SIZE;
@@ -295,14 +312,11 @@ static int ext4_create_inline_data(handle_t *handle,
goto out;
}

- memset((void *)ext4_raw_inode(&is.iloc)->i_block,
- 0, EXT4_MIN_INLINE_DATA_SIZE);
-
EXT4_I(inode)->i_inline_off = (u16)((void *)is.s.here -
(void *)ext4_raw_inode(&is.iloc));
- EXT4_I(inode)->i_inline_size = len + EXT4_MIN_INLINE_DATA_SIZE;
- ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
- ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
+ EXT4_I(inode)->i_inline_size += len;
+
+out_dirty:
get_bh(is.iloc.bh);
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);

@@ -804,6 +818,7 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
void **fsdata)
{
int ret = 0, inline_size;
+ bool truncate = false;
struct page *page;

page = grab_cache_page_write_begin(mapping, 0, flags);
@@ -811,10 +826,8 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
return -ENOMEM;

down_read(&EXT4_I(inode)->xattr_sem);
- if (!ext4_has_inline_data(inode)) {
- ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
- goto out;
- }
+ if (!ext4_has_inline_data(inode))
+ goto out_clear;

inline_size = ext4_get_inline_size(inode);

@@ -827,23 +840,25 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
ret = __block_write_begin(page, 0, inline_size,
ext4_da_get_block_prep);
if (ret) {
- up_read(&EXT4_I(inode)->xattr_sem);
- unlock_page(page);
- put_page(page);
- ext4_truncate_failed_write(inode);
- return ret;
+ truncate = true;
+ goto out;
}

SetPageDirty(page);
SetPageUptodate(page);
- ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
*fsdata = (void *)CONVERT_INLINE_DATA;

+out_clear:
+ ext4_clear_inode_flag(inode, EXT4_INODE_INLINE_DATA);
+ ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
out:
up_read(&EXT4_I(inode)->xattr_sem);
if (page) {
+ if (read)
unlock_page(page);
put_page(page);
+ if (truncate)
+ ext4_truncate_failed_write(inode);
}
return ret;
}


Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP