LinuxLists.cc - [PATCH] ext4: add max_dir_size

2012-08-10 19:24:01

Subject: [PATCH] ext4: add max_dir_size_kb mount option

Very large directories can cause significant performance problems,
especially if jobs are running in memory-tight environment (whether it
is VM's with a small amount of memory or small memory cgroups).

So it is useful, in cloud server/data center environments, to be able
to set a filesystem-wide cap on the maximum size of a directory, to
ensure that directories never get larger than a sane size. We do this
via a new mount option, max_dir_size_kb. If there is an attempt to
grow the directory larger than max_dir_size_kb, the system call will
return ENOSPC instead.

Google-Bug-Id: 6863013

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/namei.c | 6 ++++++
fs/ext4/super.c | 7 +++++++
3 files changed, 14 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c3411d4..7c0841e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1243,6 +1243,7 @@ struct ext4_sb_info {
unsigned int s_mb_order2_reqs;
unsigned int s_mb_group_prealloc;
unsigned int s_max_writeback_mb_bump;
+ unsigned int s_max_dir_size_kb;
/* where last allocation was done - for stream allocation */
unsigned long s_mb_last_group;
unsigned long s_mb_last_start;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2a42cc0..bdde668 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -55,6 +55,12 @@ static struct buffer_head *ext4_append(handle_t *handle,
{
struct buffer_head *bh;

+ if (unlikely((inode->i_size >> 10) >=
+ EXT4_SB(inode->i_sb)->s_max_dir_size_kb)) {
+ *err = -ENOSPC;
+ return NULL;
+ }
+
*block = inode->i_size >> inode->i_sb->s_blocksize_bits;

bh = ext4_bread(handle, inode, *block, 1, err);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 56bcaec..5896dcb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1230,6 +1230,7 @@ enum {
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
+ Opt_max_dir_size_kb,
};

static const match_table_t tokens = {
@@ -1303,6 +1304,7 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable=%u"},
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
+ {Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
{Opt_removed, "nocheck"}, /* mount option from ext2/3 */
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
@@ -1483,6 +1485,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
+ {Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_err, 0, 0}
};

@@ -1598,6 +1601,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
if (!args->from)
arg = EXT4_DEF_LI_WAIT_MULT;
sbi->s_li_wait_mult = arg;
+ } else if (token == Opt_max_dir_size_kb) {
+ sbi->s_max_dir_size_kb = arg;
} else if (token == Opt_stripe) {
sbi->s_stripe = arg;
} else if (m->flags & MOPT_DATAJ) {
@@ -1829,6 +1834,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
if (nodefs || (test_opt(sb, INIT_INODE_TABLE) &&
(sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
+ if (nodefs || sbi->s_max_dir_size_kb)
+ SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);

ext4_show_quota_options(seq, sb);
return 0;
--
1.7.12.rc0.22.gcdd159b

2012-08-10 19:38:14

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH] ext4: add max_dir_size_kb mount option

On 8/10/12 2:23 PM, Theodore Ts'o wrote:
>
> So it is useful, in cloud server/data center environments, to be able
> to set a filesystem-wide cap on the maximum size of a directory, to
> ensure that directories never get larger than a sane size. We do this
> via a new mount option, max_dir_size_kb. If there is an attempt to
> grow the directory larger than max_dir_size_kb, the system call will
> return ENOSPC instead.

Can the commit message also describe more about the problem: how bad
it is, the root cause, and why it's so hard to fix properly?

Please also update Documentation/filesystems/ext4.txt so people
know for sure what the use & intent of this new knob is.

I guess it's self explanatory in the name but docs are easy & good.

-Eric

2012-08-10 19:59:10

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: add max_dir_size_kb mount option

On Fri, Aug 10, 2012 at 02:38:10PM -0500, Eric Sandeen wrote:
>
> Can the commit message also describe more about the problem: how bad
> it is, the root cause, and why it's so hard to fix properly?

The use case is going to be fairly userspace specific, but one example
might be if the log reaper fails to run for whatever reason, and the
log directory then proceeds to grow without bound. And then when the
log repear *does* have a chance to run, if it happens to be in tight
memory cgroup, it then dies so the directory grows even larger, and
then when other processes try to access the directory, a readdir will
cause them to die because of their memory limitation, and hilarity
ensues.

You can fix this in other places in the software stack, but belt and
suspenders is good, and if there is no reason for directories to grow
larger than some set size, it's better to get a hard failure with an
ENOSPC rather than an funny failures caused by OOM's or slower and
slower performance.

> Please also update Documentation/filesystems/ext4.txt so people
> know for sure what the use & intent of this new knob is.

Yes, I'll do that in the next spin of the patches.

- Ted

2012-08-10 20:11:27

by Jeff Moyer

[permalink] [raw]

Subject: Re: [PATCH] ext4: add max_dir_size_kb mount option

"Theodore Ts'o" <[email protected]> writes:

> Very large directories can cause significant performance problems,
> especially if jobs are running in memory-tight environment (whether it
> is VM's with a small amount of memory or small memory cgroups).
>
> So it is useful, in cloud server/data center environments, to be able
> to set a filesystem-wide cap on the maximum size of a directory, to
> ensure that directories never get larger than a sane size. We do this
> via a new mount option, max_dir_size_kb. If there is an attempt to
> grow the directory larger than max_dir_size_kb, the system call will
> return ENOSPC instead.

I have no idea what a reasonable number for this would be. Can you
provide guidelines that would help admins understand what factors
influence performance degradation due to directory size?

Finally, I don't pretend to understand how your mount option parsing
routines work, but based on what I see in this patch it looks like the
default will be set to and enforced as 0. What am I missing?

Cheers,
Jeff

> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/namei.c | 6 ++++++
> fs/ext4/super.c | 7 +++++++
> 3 files changed, 14 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c3411d4..7c0841e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1243,6 +1243,7 @@ struct ext4_sb_info {
> unsigned int s_mb_order2_reqs;
> unsigned int s_mb_group_prealloc;
> unsigned int s_max_writeback_mb_bump;
> + unsigned int s_max_dir_size_kb;
> /* where last allocation was done - for stream allocation */
> unsigned long s_mb_last_group;
> unsigned long s_mb_last_start;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 2a42cc0..bdde668 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -55,6 +55,12 @@ static struct buffer_head *ext4_append(handle_t *handle,
> {
> struct buffer_head *bh;
>
> + if (unlikely((inode->i_size >> 10) >=
> + EXT4_SB(inode->i_sb)->s_max_dir_size_kb)) {
> + *err = -ENOSPC;
> + return NULL;
> + }
> +
> *block = inode->i_size >> inode->i_sb->s_blocksize_bits;
>
> bh = ext4_bread(handle, inode, *block, 1, err);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 56bcaec..5896dcb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1230,6 +1230,7 @@ enum {
> Opt_inode_readahead_blks, Opt_journal_ioprio,
> Opt_dioread_nolock, Opt_dioread_lock,
> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> + Opt_max_dir_size_kb,
> };
>
> static const match_table_t tokens = {
> @@ -1303,6 +1304,7 @@ static const match_table_t tokens = {
> {Opt_init_itable, "init_itable=%u"},
> {Opt_init_itable, "init_itable"},
> {Opt_noinit_itable, "noinit_itable"},
> + {Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
> {Opt_removed, "check=none"}, /* mount option from ext2/3 */
> {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
> {Opt_removed, "reservation"}, /* mount option from ext2/3 */
> @@ -1483,6 +1485,7 @@ static const struct mount_opts {
> {Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
> {Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
> {Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
> + {Opt_max_dir_size_kb, 0, MOPT_GTE0},
> {Opt_err, 0, 0}
> };
>
> @@ -1598,6 +1601,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> if (!args->from)
> arg = EXT4_DEF_LI_WAIT_MULT;
> sbi->s_li_wait_mult = arg;
> + } else if (token == Opt_max_dir_size_kb) {
> + sbi->s_max_dir_size_kb = arg;
> } else if (token == Opt_stripe) {
> sbi->s_stripe = arg;
> } else if (m->flags & MOPT_DATAJ) {
> @@ -1829,6 +1834,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> if (nodefs || (test_opt(sb, INIT_INODE_TABLE) &&
> (sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
> SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
> + if (nodefs || sbi->s_max_dir_size_kb)
> + SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
>
> ext4_show_quota_options(seq, sb);
> return 0;

2012-08-10 20:17:00

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH] ext4: add max_dir_size_kb mount option

On 8/10/12 2:59 PM, Theodore Ts'o wrote:
> On Fri, Aug 10, 2012 at 02:38:10PM -0500, Eric Sandeen wrote:
>>
>> Can the commit message also describe more about the problem: how bad
>> it is, the root cause, and why it's so hard to fix properly?
>
> The use case is going to be fairly userspace specific, but one example
> might be if the log reaper fails to run for whatever reason, and the
> log directory then proceeds to grow without bound. And then when the
> log repear *does* have a chance to run, if it happens to be in tight
> memory cgroup, it then dies so the directory grows even larger, and
> then when other processes try to access the directory, a readdir will
> cause them to die because of their memory limitation, and hilarity
> ensues.

Oh, I thought this was papering over a scaling problem in ext4. The intent
is to protect userspace from arbitrarily large readdir results?

If that's the case, this should probably be proposed as a VFS level
option, and see how it's received there...

(Can you tell I'm not a huge fan of the idea?) ;)

-Eric

> You can fix this in other places in the software stack, but belt and
> suspenders is good, and if there is no reason for directories to grow
> larger than some set size, it's better to get a hard failure with an
> ENOSPC rather than an funny failures caused by OOM's or slower and
> slower performance.

>> Please also update Documentation/filesystems/ext4.txt so people
>> know for sure what the use & intent of this new knob is.
>
> Yes, I'll do that in the next spin of the patches.
>
> - Ted
>

2012-08-10 21:58:15

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: add max_dir_size_kb mount option

On Fri, Aug 10, 2012 at 04:11:24PM -0400, Jeff Moyer wrote:
>
> I have no idea what a reasonable number for this would be. Can you
> provide guidelines that would help admins understand what factors
> influence performance degradation due to directory size?

Well, for example, if you have a job which is a 512mb memory
container, and the directory has grown to 176mb, an attempt to readdir
said directory will cause that job to thrash badly, and perhaps get
killed by the OOM killer. If you know that no sane directory should
ever grow beyond a single megabyte, you might pick a max_dir_size_kb
of 1024.

> Finally, I don't pretend to understand how your mount option parsing
> routines work, but based on what I see in this patch it looks like the
> default will be set to and enforced as 0. What am I missing?

Sorry, I sent out the wrong version of the patch. The limit was only
supposed to be used if maximum directory size is greater than 0; that
is, the default is that the directory size is unlimited, as before.
I'll send out a revised v2 version of the patch.

I view this as a very specialized option, but if you're running in a
tightly constrained memory cgroup, or a tiny EC2 instance, or the
equivalent Cloud Open VM, it might be a very useful thing to be able
to cap.

Regards,

- Ted

2012-08-10 22:00:01

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: add max_dir_size_kb mount option

On Fri, Aug 10, 2012 at 03:16:57PM -0500, Eric Sandeen wrote:
>
> Oh, I thought this was papering over a scaling problem in ext4. The intent
> is to protect userspace from arbitrarily large readdir results?

The problem that arose was that the directory had grown to 176mb, and
in a tight memory cgroup (say, 512mb, or some such), this could cause
significant memory thrashing, or the actual OOM killing of the task in
said tight memory cgroup when it tried to read the entire directory
via readdir().

> If that's the case, this should probably be proposed as a VFS level
> option, and see how it's received there...

It's not really a VFS level thing, since the goal is to stop the
directory size itself from growing beyond a size limit (say, 1mb).

- Ted

2012-08-10 23:14:17

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH] ext4: add max_dir_size_kb mount option

On 2012-08-10, at 3:58 PM, Theodore Ts'o wrote:
> On Fri, Aug 10, 2012 at 04:11:24PM -0400, Jeff Moyer wrote:
>>
>> I have no idea what a reasonable number for this would be. Can you
>> provide guidelines that would help admins understand what factors
>> influence performance degradation due to directory size?
>
> Well, for example, if you have a job which is a 512mb memory
> container, and the directory has grown to 176mb, an attempt to readdir
> said directory will cause that job to thrash badly, and perhaps get
> killed by the OOM killer. If you know that no sane directory should
> ever grow beyond a single megabyte, you might pick a max_dir_size_kb
> of 1024.

Actually, we've been carrying a very similar patch to this in Lustre
for a long time. I didn't think this would be of interest to others
outside of Lustre, so I don't think I ever sent it upstream.

The reason we have this is that some HPC jobs might create 10k files
every hour in the same output directory, and if the user/job don't
pay attention to clean up the old files, then they might get many
millions of files in the same directory. Simple operations like
"ls -l" in the directory will behave badly because GNU ls will read
and sort all of the entries first. Even if "-U" is given to not sort
entries, ls will try to read (and by default stat for color) all the
entries before displaying them so that column widths can be made nice.
Similarly, "rm *" or other foolish things will break on large dirs for
na?ve users (who are scientists and not sysadmins).

Operations may take many minutes on a huge directory, and users will
complain and call support when they think the filesystem has hung.
Instead, the admins limit the directory size and cause such applications
to fail early to alert the user that their application is behaving badly.

Of course, other sites want huge directories (10 billion files in one
directory is the latest number I've seen), so this has to be tunable.
We have a patch to fix the 2-level htree and 2GB directory size limits
already, in case that is of interest to anyone.

Cheers, Andreas

>> Finally, I don't pretend to understand how your mount option parsing
>> routines work, but based on what I see in this patch it looks like the
>> default will be set to and enforced as 0. What am I missing?
>
> Sorry, I sent out the wrong version of the patch. The limit was only
> supposed to be used if maximum directory size is greater than 0; that
> is, the default is that the directory size is unlimited, as before.
> I'll send out a revised v2 version of the patch.
>
> I view this as a very specialized option, but if you're running in a
> tightly constrained memory cgroup, or a tiny EC2 instance, or the
> equivalent Cloud Open VM, it might be a very useful thing to be able
> to cap.
>
> Regards,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas

2012-08-11 01:40:38

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH -v2] ext4: add max_dir_size_kb mount option

Very large directories can cause significant performance problems, or
perhaps even invoke the OOM killer, if the process is running in a
highly constrained memory environment (whether it is VM's with a small
amount of memory or in a small memory cgroup).

So it is useful, in cloud server/data center environments, to be able
to set a filesystem-wide cap on the maximum size of a directory, to
ensure that directories never get larger than a sane size. We do this
via a new mount option, max_dir_size_kb. If there is an attempt to
grow the directory larger than max_dir_size_kb, the system call will
return ENOSPC instead.

Google-Bug-Id: 6863013

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
Documentation/filesystems/ext4.txt | 4 ++++
fs/ext4/ext4.h | 1 +
fs/ext4/namei.c | 7 +++++++
fs/ext4/super.c | 7 +++++++
4 files changed, 19 insertions(+)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 1b7f9ac..43b80b5 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -375,6 +375,10 @@ dioread_nolock locking. If the dioread_nolock option is specified
Because of the restrictions this options comprises
it is off by default (e.g. dioread_lock).

+max_dir_size_kb=n This limits the size of directories so that any
+ attempt to expand them beyond the specified
+ limit in kilobytes will cause an ENOSPC error.
+
i_version Enable 64-bit inode version support. This option is
off by default.

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c3411d4..7c0841e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1243,6 +1243,7 @@ struct ext4_sb_info {
unsigned int s_mb_order2_reqs;
unsigned int s_mb_group_prealloc;
unsigned int s_max_writeback_mb_bump;
+ unsigned int s_max_dir_size_kb;
/* where last allocation was done - for stream allocation */
unsigned long s_mb_last_group;
unsigned long s_mb_last_start;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2a42cc0..7450ff0 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -55,6 +55,13 @@ static struct buffer_head *ext4_append(handle_t *handle,
{
struct buffer_head *bh;

+ if (unlikely(EXT4_SB(inode->i_sb)->s_max_dir_size_kb &&
+ ((inode->i_size >> 10) >=
+ EXT4_SB(inode->i_sb)->s_max_dir_size_kb))) {
+ *err = -ENOSPC;
+ return NULL;
+ }
+
*block = inode->i_size >> inode->i_sb->s_blocksize_bits;

bh = ext4_bread(handle, inode, *block, 1, err);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 56bcaec..5896dcb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1230,6 +1230,7 @@ enum {
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
+ Opt_max_dir_size_kb,
};

static const match_table_t tokens = {
@@ -1303,6 +1304,7 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable=%u"},
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
+ {Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
{Opt_removed, "nocheck"}, /* mount option from ext2/3 */
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
@@ -1483,6 +1485,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
+ {Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_err, 0, 0}
};

@@ -1598,6 +1601,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
if (!args->from)
arg = EXT4_DEF_LI_WAIT_MULT;
sbi->s_li_wait_mult = arg;
+ } else if (token == Opt_max_dir_size_kb) {
+ sbi->s_max_dir_size_kb = arg;
} else if (token == Opt_stripe) {
sbi->s_stripe = arg;
} else if (m->flags & MOPT_DATAJ) {
@@ -1829,6 +1834,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
if (nodefs || (test_opt(sb, INIT_INODE_TABLE) &&
(sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
+ if (nodefs || sbi->s_max_dir_size_kb)
+ SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);

ext4_show_quota_options(seq, sb);
return 0;
--
1.7.12.rc0.22.gcdd159b

2012-08-11 03:22:40

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH -v2] ext4: add max_dir_size_kb mount option

On 2012-08-10, at 19:40, Theodore Ts'o <[email protected]> wrote:

> Very large directories can cause significant performance problems, or
> perhaps even invoke the OOM killer, if the process is running in a
> highly constrained memory environment (whether it is VM's with a small
> amount of memory or in a small memory cgroup).
>
> So it is useful, in cloud server/data center environments, to be able
> to set a filesystem-wide cap on the maximum size of a directory, to
> ensure that directories never get larger than a sane size. We do this
> via a new mount option, max_dir_size_kb. If there is an attempt to
> grow the directory larger than max_dir_size_kb, the system call will
> return ENOSPC instead.

In our patch, it returns EFBIG, since it isn't really a case of being out of space for blocks or inodes.

Cheers, Andreas

> Google-Bug-Id: 6863013
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> Documentation/filesystems/ext4.txt | 4 ++++
> fs/ext4/ext4.h | 1 +
> fs/ext4/namei.c | 7 +++++++
> fs/ext4/super.c | 7 +++++++
> 4 files changed, 19 insertions(+)
>
> diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
> index 1b7f9ac..43b80b5 100644
> --- a/Documentation/filesystems/ext4.txt
> +++ b/Documentation/filesystems/ext4.txt
> @@ -375,6 +375,10 @@ dioread_nolock locking. If the dioread_nolock option is specified
> Because of the restrictions this options comprises
> it is off by default (e.g. dioread_lock).
>
> +max_dir_size_kb=n This limits the size of directories so that any
> + attempt to expand them beyond the specified
> + limit in kilobytes will cause an ENOSPC error.
> +
> i_version Enable 64-bit inode version support. This option is
> off by default.
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c3411d4..7c0841e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1243,6 +1243,7 @@ struct ext4_sb_info {
> unsigned int s_mb_order2_reqs;
> unsigned int s_mb_group_prealloc;
> unsigned int s_max_writeback_mb_bump;
> + unsigned int s_max_dir_size_kb;
> /* where last allocation was done - for stream allocation */
> unsigned long s_mb_last_group;
> unsigned long s_mb_last_start;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 2a42cc0..7450ff0 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -55,6 +55,13 @@ static struct buffer_head *ext4_append(handle_t *handle,
> {
> struct buffer_head *bh;
>
> + if (unlikely(EXT4_SB(inode->i_sb)->s_max_dir_size_kb &&
> + ((inode->i_size >> 10) >=
> + EXT4_SB(inode->i_sb)->s_max_dir_size_kb))) {
> + *err = -ENOSPC;
> + return NULL;
> + }
> +
> *block = inode->i_size >> inode->i_sb->s_blocksize_bits;
>
> bh = ext4_bread(handle, inode, *block, 1, err);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 56bcaec..5896dcb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1230,6 +1230,7 @@ enum {
> Opt_inode_readahead_blks, Opt_journal_ioprio,
> Opt_dioread_nolock, Opt_dioread_lock,
> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> + Opt_max_dir_size_kb,
> };
>
> static const match_table_t tokens = {
> @@ -1303,6 +1304,7 @@ static const match_table_t tokens = {
> {Opt_init_itable, "init_itable=%u"},
> {Opt_init_itable, "init_itable"},
> {Opt_noinit_itable, "noinit_itable"},
> + {Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
> {Opt_removed, "check=none"}, /* mount option from ext2/3 */
> {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
> {Opt_removed, "reservation"}, /* mount option from ext2/3 */
> @@ -1483,6 +1485,7 @@ static const struct mount_opts {
> {Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
> {Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
> {Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
> + {Opt_max_dir_size_kb, 0, MOPT_GTE0},
> {Opt_err, 0, 0}
> };
>
> @@ -1598,6 +1601,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> if (!args->from)
> arg = EXT4_DEF_LI_WAIT_MULT;
> sbi->s_li_wait_mult = arg;
> + } else if (token == Opt_max_dir_size_kb) {
> + sbi->s_max_dir_size_kb = arg;
> } else if (token == Opt_stripe) {
> sbi->s_stripe = arg;
> } else if (m->flags & MOPT_DATAJ) {
> @@ -1829,6 +1834,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> if (nodefs || (test_opt(sb, INIT_INODE_TABLE) &&
> (sbi->s_li_wait_mult != EXT4_DEF_LI_WAIT_MULT)))
> SEQ_OPTS_PRINT("init_itable=%u", sbi->s_li_wait_mult);
> + if (nodefs || sbi->s_max_dir_size_kb)
> + SEQ_OPTS_PRINT("max_dir_size_kb=%u", sbi->s_max_dir_size_kb);
>
> ext4_show_quota_options(seq, sb);
> return 0;
> --
> 1.7.12.rc0.22.gcdd159b
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-08-11 19:26:50

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH -v2] ext4: add max_dir_size_kb mount option

On Fri, Aug 10, 2012 at 09:22:39PM -0600, Andreas Dilger wrote:
>
> In our patch, it returns EFBIG, since it isn't really a case of
> being out of space for blocks or inodes.

I agree, EFBIG seems to be a better errno.

How did you configure the directory size limit, BTW? Did you use a
mount option, if so, what did you name it?

Regards,

- Ted

2012-08-11 21:10:35

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH -v2] ext4: add max_dir_size_kb mount option

On 2012-08-11, at 1:26 PM, Theodore Ts'o wrote:
> On Fri, Aug 10, 2012 at 09:22:39PM -0600, Andreas Dilger wrote:
>>
>> In our patch, it returns EFBIG, since it isn't really a case of
>> being out of space for blocks or inodes.
>
> I agree, EFBIG seems to be a better errno.
>
> How did you configure the directory size limit, BTW? Did you use a
> mount option, if so, what did you name it?

It was configured via /sys/fs/ext4/{dev}/max_dir_size, but the units
were bytes instead of kbytes. Not many sites use this value, so if
it needs to be renamed .../max_dir_size_kb it wouldn't be fatal - we
could modify our patch to allow both, and print a deprecation message
if the old one is used. It will likely still be a few years before
distros are running 3.6 (or whatever kernel this patch is included in).
At least we wouldn't have to carry that patch in perpetuity.

Cheers, Andreas
--
Andreas Dilger Whamcloud, Inc.
Principal Lustre Engineer http://www.whamcloud.com/

2012-08-11 21:13:16

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH -v2] ext4: add max_dir_size_kb mount option

On Sat, Aug 11, 2012 at 03:26:48PM -0400, Theodore Ts'o wrote:
> On Fri, Aug 10, 2012 at 09:22:39PM -0600, Andreas Dilger wrote:
> >
> > In our patch, it returns EFBIG, since it isn't really a case of
> > being out of space for blocks or inodes.
>
> I agree, EFBIG seems to be a better errno.

Hmmm..... upon doing some more research, there was a related
discussion on this point on the IETF NFSv4 mailing list earlier this
year[1]. In it, Trond argued that EFBIG is defined by POSIX to mean:
"An attempt was made to write a file that exceeds the file size limit
of the process.", while ENOSPC is explicitly documented as an
allowable error code for rename(2):

[ENOSPC]
The directory that would contain new cannot be extended.

The same definition is there for link(2) and open(2). For open, it
reads:

[ENOSPC]
The directory or file system that would contain the new file
cannot be expanded, the file does not exist, and O_CREAT is
specified.

Hence, Trond argued that using ENOSPC was a better choice, in terms of
being a closer match with POSIX specifications, and hence what
programs might expect.

The string returned by perror/strerror is going to be a little
confusing to users in either case. EFBIG returns "File too large",
while ENOSPC returns "No space left on device". One might argue that
ENOSPC's error return is a little better, but then again there's a
grand Unix tradition in this, after all --- "Not a typewriter",
anyone? :-)

- Ted

[1] http://www.ietf.org/mail-archive/web/nfsv4/current/msg10720.html

P.S. The context for this is a feature which the NetApp filer has,
MaxDirSize, which controls the maximum size of a directory specified
in Kilobytes. The discussion was what was the proper NFS error code
that should be returned, given how it would be reflected back to a
Posix errno by most clients.

Interestingly, it appears the default MaxDirSize starting with
NetApp's Data before ONTAP 6.5 was 64k. On newer NetApps, the limit
is defaulted to be 1% of the memory size configured on the filer. The
reason given for limiting the maximum directory size was for
performance reasons. Given the issues we've seen when you have jobs
running with a 512mb memory cgroup (which would also apply if you were
running a micro EC2 instance, or some other Xen or KVM VM with a small
memory size), it's interesting that this was an issue that NetApp has
run into, and addressed the same way.

I can definitely attest to the fact that the system will not be happy
if you are limited to 512mb of memory, and you have a 176mb
directory....