LinuxLists.cc - [PATCH,STABLE 2.6.29 01/18] ext4: don't inherit inappropriate inode flags from parent

2009-06-02 12:08:10

Subject: [PATCH,STABLE 2.6.29 01/18] ext4: don't inherit inappropriate inode flags from parent

From: Duane Griffin <[email protected]>

At present INDEX and EXTENTS are the only flags that new ext4 inodes do
NOT inherit from their parent. In addition prevent the flags DIRTY,
ECOMPR, IMAGIC, TOPDIR, HUGE_FILE and EXT_MIGRATE from being inherited.
List inheritable flags explicitly to prevent future flags from
accidentally being inherited.

This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.

Signed-off-by: Duane Griffin <[email protected]>
Acked-by: Andreas Dilger <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 8fa43a81b97853fc69417bb6054182e78f95cbeb)
---
fs/ext4/ext4.h | 7 +++++++
fs/ext4/ialloc.c | 2 +-
2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 90909f9..45af699 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -248,6 +248,13 @@ struct flex_groups {
#define EXT4_FL_USER_VISIBLE 0x000BDFFF /* User visible flags */
#define EXT4_FL_USER_MODIFIABLE 0x000B80FF /* User modifiable flags */

+/* Flags that should be inherited by new inodes from their parent. */
+#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
+ EXT4_SYNC_FL | EXT4_IMMUTABLE_FL | EXT4_APPEND_FL |\
+ EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
+ EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
+ EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
+
/*
* Inode dynamic state flags
*/
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 2d2b358..6f09543 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -889,7 +889,7 @@ got:
* newly created directory and file only if -o extent mount option is
* specified
*/
- ei->i_flags = EXT4_I(dir)->i_flags & ~(EXT4_INDEX_FL|EXT4_EXTENTS_FL);
+ ei->i_flags = EXT4_I(dir)->i_flags & EXT4_FL_INHERITED;
if (S_ISLNK(mode))
ei->i_flags &= ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL);
/* dirsync only applies to directories */
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 07/18] ext4: Automatically allocate delay allocated blocks on rename

When renaming a file such that a link to another inode is overwritten,
force any delay allocated blocks that to be allocated so that if the
filesystem is mounted with data=ordered, the data blocks will be
pushed out to disk along with the journal commit. Many application
programs expect this, so we do this to avoid zero length files if the
system crashes unexpectedly.

Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 8750c6d5fcbd3342b3d908d157f81d345c5325a7)
---
fs/ext4/namei.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f787234..63568ec 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2319,7 +2319,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *old_inode, *new_inode;
struct buffer_head *old_bh, *new_bh, *dir_bh;
struct ext4_dir_entry_2 *old_de, *new_de;
- int retval;
+ int retval, force_da_alloc = 0;

old_bh = new_bh = dir_bh = NULL;

@@ -2457,6 +2457,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
ext4_mark_inode_dirty(handle, new_inode);
if (!new_inode->i_nlink)
ext4_orphan_add(handle, new_inode);
+ force_da_alloc = 1;
}
retval = 0;

@@ -2465,6 +2466,8 @@ end_rename:
brelse(old_bh);
brelse(new_bh);
ext4_journal_stop(handle);
+ if (retval == 0 && force_da_alloc)
+ ext4_alloc_da_blocks(old_inode);
return retval;
}

--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 04/18] ext4: Add fine print for the 32000 subdirectory limit

Some poeple are reading the ext4 feature list too literally and create
dubious test cases involving very long filenames and 1k blocksize and
then complain when they run into an htree-imposed limit. So add fine
print to the "fix 32000 subdirectory limit" ext4 feature.

Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 722bde6875bfb49a0c84e5601eb82dd7ac02d27c)
---
Documentation/filesystems/ext4.txt | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index cec829b..5c484ae 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -85,7 +85,7 @@ Note: More extensive information for getting started with ext4 can be
* extent format more robust in face of on-disk corruption due to magics,
* internal redundancy in tree
* improved file allocation (multi-block alloc)
-* fix 32000 subdirectory limit
+* lift 32000 subdirectory limit imposed by i_links_count[1]
* nsec timestamps for mtime, atime, ctime, create time
* inode version field on disk (NFSv4, Lustre)
* reduced e2fsck time via uninit_bg feature
@@ -100,6 +100,9 @@ Note: More extensive information for getting started with ext4 can be
* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
the ordering)

+[1] Filesystems with a block size of 1k may see a limit imposed by the
+directory hash tree having a maximum depth of two.
+
2.2 Candidate features for future inclusion

* Online defrag (patches available but not well tested)
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 02/18] ext4: tighten restrictions on inode flags

From: Duane Griffin <[email protected]>

At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.

Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.

Signed-off-by: Duane Griffin <[email protected]>
Acked-by: Andreas Dilger <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 2dc6b0d48ca0599837df21b14bb8393d0804af57)
---
fs/ext4/ext4.h | 17 +++++++++++++++++
fs/ext4/ialloc.c | 14 +++++---------
fs/ext4/ioctl.c | 3 +--
3 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 45af699..6a954de 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -255,6 +255,23 @@ struct flex_groups {
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)

+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
+
+/* Flags that are appropriate for non-directories/regular files. */
+#define EXT4_OTHER_FLMASK (EXT4_NODUMP_FL | EXT4_NOATIME_FL)
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static inline __u32 ext4_mask_flags(umode_t mode, __u32 flags)
+{
+ if (S_ISDIR(mode))
+ return flags;
+ else if (S_ISREG(mode))
+ return flags & EXT4_REG_FLMASK;
+ else
+ return flags & EXT4_OTHER_FLMASK;
+}
+
/*
* Inode dynamic state flags
*/
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 6f09543..befd95e 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -885,16 +885,12 @@ got:
ei->i_disksize = 0;

/*
- * Don't inherit extent flag from directory. We set extent flag on
- * newly created directory and file only if -o extent mount option is
- * specified
+ * Don't inherit extent flag from directory, amongst others. We set
+ * extent flag on newly created directory and file only if -o extent
+ * mount option is specified
*/
- ei->i_flags = EXT4_I(dir)->i_flags & EXT4_FL_INHERITED;
- if (S_ISLNK(mode))
- ei->i_flags &= ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL);
- /* dirsync only applies to directories */
- if (!S_ISDIR(mode))
- ei->i_flags &= ~EXT4_DIRSYNC_FL;
+ ei->i_flags =
+ ext4_mask_flags(mode, EXT4_I(dir)->i_flags & EXT4_FL_INHERITED);
ei->i_file_acl = 0;
ei->i_dtime = 0;
ei->i_block_group = group;
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 42dc83f..22dd29f 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -48,8 +48,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (err)
return err;

- if (!S_ISDIR(inode->i_mode))
- flags &= ~EXT4_DIRSYNC_FL;
+ flags = ext4_mask_flags(inode->i_mode, flags);

err = -EPERM;
mutex_lock(&inode->i_mutex);
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 13/18] ext4: Fix softlockup caused by illegal i_file_acl value in on-disk inode

If the block containing external extended attributes (which is stored
in i_file_acl and i_file_acl_high) is larger than the on-disk
filesystem, the process which tried to access the extended attributes
will endlessly issue kernel printks complaining that
"__find_get_block_slow() failed", locking up that CPU until the system
is forcibly rebooted.

So when we read in the inode, make sure the i_file_acl value is legal,
and if not, flag the filesystem as being corrupted.

Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 485c26ec70f823f2a9cf45982b724893e53a859e)
---
fs/ext4/inode.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c4f0e14..ec3457b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4351,6 +4351,18 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}

+ if (ei->i_file_acl &&
+ ((ei->i_file_acl <
+ (le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) +
+ EXT4_SB(sb)->s_gdb_count)) ||
+ (ei->i_file_acl >= ext4_blocks_count(EXT4_SB(sb)->s_es)))) {
+ ext4_error(sb, __func__,
+ "bad extended attribute block %llu in inode #%lu",
+ ei->i_file_acl, inode->i_ino);
+ ret = -EIO;
+ goto bad_inode;
+ }
+
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 05/18] ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl

Add an ioctl which forces all of the delay allocated blocks to be
allocated. This also provides a function ext4_alloc_da_blocks() which
will be used by the following commits to force files to be fully
allocated to preserve application-expected ext3 behaviour.

Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit ccd2506bd43113659aa904d5bea5d1300605e2a6)
---
fs/ext4/ext4.h | 3 +++
fs/ext4/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ioctl.c | 14 ++++++++++++++
3 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6a954de..f5552d7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -326,7 +326,9 @@ struct ext4_new_group_data {
#define EXT4_IOC_GROUP_EXTEND _IOW('f', 7, unsigned long)
#define EXT4_IOC_GROUP_ADD _IOW('f', 8, struct ext4_new_group_input)
#define EXT4_IOC_MIGRATE _IO('f', 9)
+ /* note ioctl 10 reserved for an early version of the FIEMAP ioctl */
/* note ioctl 11 reserved for filesystem-independent FIEMAP ioctl */
+#define EXT4_IOC_ALLOC_DA_BLKS _IO('f', 12)

/*
* ioctl commands in 32 bit emulation
@@ -1115,6 +1117,7 @@ extern int ext4_can_truncate(struct inode *inode);
extern void ext4_truncate(struct inode *);
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
+extern int ext4_alloc_da_blocks(struct inode *inode);
extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2c0439d..0c93ce0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2816,6 +2816,48 @@ out:
return;
}

+/*
+ * Force all delayed allocation blocks to be allocated for a given inode.
+ */
+int ext4_alloc_da_blocks(struct inode *inode)
+{
+ if (!EXT4_I(inode)->i_reserved_data_blocks &&
+ !EXT4_I(inode)->i_reserved_meta_blocks)
+ return 0;
+
+ /*
+ * We do something simple for now. The filemap_flush() will
+ * also start triggering a write of the data blocks, which is
+ * not strictly speaking necessary (and for users of
+ * laptop_mode, not even desirable). However, to do otherwise
+ * would require replicating code paths in:
+ *
+ * ext4_da_writepages() ->
+ * write_cache_pages() ---> (via passed in callback function)
+ * __mpage_da_writepage() -->
+ * mpage_add_bh_to_extent()
+ * mpage_da_map_blocks()
+ *
+ * The problem is that write_cache_pages(), located in
+ * mm/page-writeback.c, marks pages clean in preparation for
+ * doing I/O, which is not desirable if we're not planning on
+ * doing I/O at all.
+ *
+ * We could call write_cache_pages(), and then redirty all of
+ * the pages by calling redirty_page_for_writeback() but that
+ * would be ugly in the extreme. So instead we would need to
+ * replicate parts of the code in the above functions,
+ * simplifying them becuase we wouldn't actually intend to
+ * write out the pages, but rather only collect contiguous
+ * logical block extents, call the multi-block allocator, and
+ * then update the buffer heads with the block allocations.
+ *
+ * For now, though, we'll cheat by calling filemap_flush(),
+ * which will map the blocks, and start the I/O, but not
+ * actually wait for the I/O to complete.
+ */
+ return filemap_flush(inode->i_mapping);
+}

/*
* bmap() is special. It gets used by applications such as lilo and by
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 22dd29f..91e75f7 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -262,6 +262,20 @@ setversion_out:
return err;
}

+ case EXT4_IOC_ALLOC_DA_BLKS:
+ {
+ int err;
+ if (!is_owner_or_cap(inode))
+ return -EACCES;
+
+ err = mnt_want_write(filp->f_path.mnt);
+ if (err)
+ return err;
+ err = ext4_alloc_da_blocks(inode);
+ mnt_drop_write(filp->f_path.mnt);
+ return err;
+ }
+
default:
return -ENOTTY;
}
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 18/18] ext4: Fix race in ext4_inode_info.i_cached_extent

If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time. This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.

This bug has been in the ext4 code since almost the very beginning of
ext4's development. Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult. Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Reviewed-by: "Aneesh Kumar K.V" <[email protected]>
(cherry picked from commit 2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4)
---
fs/ext4/extents.c | 17 ++++++++++++-----
1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6af5a50..d315c97 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1740,11 +1740,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
{
struct ext4_ext_cache *cex;
BUG_ON(len == 0);
+ spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
cex = &EXT4_I(inode)->i_cached_extent;
cex->ec_type = type;
cex->ec_block = block;
cex->ec_len = len;
cex->ec_start = start;
+ spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
}

/*
@@ -1801,12 +1803,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
struct ext4_extent *ex)
{
struct ext4_ext_cache *cex;
+ int ret = EXT4_EXT_CACHE_NO;

+ /*
+ * We borrow i_block_reservation_lock to protect i_cached_extent
+ */
+ spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
cex = &EXT4_I(inode)->i_cached_extent;

/* has cache valid data? */
if (cex->ec_type == EXT4_EXT_CACHE_NO)
- return EXT4_EXT_CACHE_NO;
+ goto errout;

BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP &&
cex->ec_type != EXT4_EXT_CACHE_EXTENT);
@@ -1817,11 +1824,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
ext_debug("%u cached by %u:%u:%llu\n",
block,
cex->ec_block, cex->ec_len, cex->ec_start);
- return cex->ec_type;
+ ret = cex->ec_type;
}

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 06/18] ext4: Automatically allocate delay allocated blocks on close

When closing a file that had been previously truncated, force any
delay allocated blocks that to be allocated so that if the filesystem
is mounted with data=ordered, the data blocks will be pushed out to
disk along with the journal commit. Many application programs expect
this, so we do this to avoid zero length files if the system crashes
unexpectedly.

Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 7d8f9f7d150dded7b68e61ca6403a1f166fb4edf)
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 4 ++++
fs/ext4/inode.c | 3 +++
3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f5552d7..83f685d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -279,6 +279,7 @@ static inline __u32 ext4_mask_flags(umode_t mode, __u32 flags)
#define EXT4_STATE_NEW 0x00000002 /* inode is newly created */
#define EXT4_STATE_XATTR 0x00000004 /* has in-inode xattrs */
#define EXT4_STATE_NO_EXPAND 0x00000008 /* No space for expansion */
+#define EXT4_STATE_DA_ALLOC_CLOSE 0x00000010 /* Alloc DA blks on close */

/* Used to pass group descriptor data when online resize is done */
struct ext4_new_group_input {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f731cb5..06df827 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,10 @@
*/
static int ext4_release_file(struct inode *inode, struct file *filp)
{
+ if (EXT4_I(inode)->i_state & EXT4_STATE_DA_ALLOC_CLOSE) {
+ ext4_alloc_da_blocks(inode);
+ EXT4_I(inode)->i_state &= ~EXT4_STATE_DA_ALLOC_CLOSE;
+ }
/* if we are the last writer on the inode, drop the block reservation */
if ((filp->f_mode & FMODE_WRITE) &&
(atomic_read(&inode->i_writecount) == 1))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0c93ce0..8c7259a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3880,6 +3880,9 @@ void ext4_truncate(struct inode *inode)
if (!ext4_can_truncate(inode))
return;

+ if (inode->i_size == 0)
+ ei->i_state |= EXT4_STATE_DA_ALLOC_CLOSE;
+
if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
ext4_ext_truncate(inode);
return;
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 03/18] ext4: return -EIO not -ESTALE on directory traversal through deleted inode

From: Bryan Donlan <[email protected]>

ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext4_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted
such that a directory entry references a deleted inode. This leads to
a misleading error message - "Stale NFS file handle" - and confusion
on the part of the admin.

The bug can be easily reproduced by creating a new filesystem, making
a link to an unused inode using debugfs, then mounting and attempting
to ls -l said link.

This patch thus changes ext4_lookup to return -EIO if it receives
-ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
corruption; and also invokes the appropriate ext*_error functions when
this case is detected.

Signed-off-by: Bryan Donlan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit e6f009b0b45220c004672d41a58865e94946104d)
---
fs/ext4/namei.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index ba702bd..f787234 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1052,8 +1052,16 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, stru
return ERR_PTR(-EIO);
}
inode = ext4_iget(dir->i_sb, ino);
- if (IS_ERR(inode))
- return ERR_CAST(inode);
+ if (unlikely(IS_ERR(inode))) {
+ if (PTR_ERR(inode) == -ESTALE) {
+ ext4_error(dir->i_sb, __func__,
+ "deleted inode referenced: %u",
+ ino);
+ return ERR_PTR(-EIO);
+ } else {
+ return ERR_CAST(inode);
+ }
+ }
}
return d_splice_alias(inode, dentry);
}
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 14/18] ext4: Ignore i_file_acl_high unless EXT4_FEATURE_INCOMPAT_64BIT is present

Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature
bit is set. The field is normally zero, but older versions of e2fsck
didn't automatically check to make sure of this, so in the spirit of
"be liberal in what you accept", don't look at i_file_acl_high unless
we are using a 64-bit filesystem.

Signed-off-by: "Theodore Ts'o" <[email protected]>

(cherry picked from commit a9e817425dc0baede8ebe5fbc9984a640257432b)
---
fs/ext4/inode.c | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ec3457b..cf65a83 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4300,11 +4300,9 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
inode->i_blocks = ext4_inode_blocks(raw_inode, ei);
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
- if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
- cpu_to_le32(EXT4_OS_HURD)) {
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_64BIT))
ei->i_file_acl |=
((__u64)le16_to_cpu(raw_inode->i_file_acl_high)) << 32;
- }
inode->i_size = ext4_isize(raw_inode);
ei->i_disksize = inode->i_size;
inode->i_generation = le32_to_cpu(raw_inode->i_generation);
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 09/18] ext4: Add auto_da_alloc mount option

Add a mount option which allows the user to disable automatic
allocation of blocks whose allocation by delayed allocation when the
file was originally truncated or when the file is renamed over an
existing file. This feature is intended to save users from the
effects of naive application writers, but it reduces the effectiveness
of the delayed allocation code. This mount option disables this
safety feature, which may be desirable for prodcutions systems where
the risk of unclean shutdowns or unexpected system crashes is low.

Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit afd4672dc7610b7feef5190168aa917cc2e417e4)
---
fs/ext4/ext4.h | 2 +-
fs/ext4/inode.c | 2 +-
fs/ext4/namei.c | 3 ++-
fs/ext4/super.c | 25 +++++++++++++------------
4 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 83f685d..a2bd86e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -557,7 +557,7 @@ do { \
#define EXT4_MOUNT_NO_UID32 0x02000 /* Disable 32-bit UIDs */
#define EXT4_MOUNT_XATTR_USER 0x04000 /* Extended user attributes */
#define EXT4_MOUNT_POSIX_ACL 0x08000 /* POSIX Access Control Lists */
-#define EXT4_MOUNT_RESERVATION 0x10000 /* Preallocation */
+#define EXT4_MOUNT_NO_AUTO_DA_ALLOC 0x10000 /* No auto delalloc mapping */
#define EXT4_MOUNT_BARRIER 0x20000 /* Use block barriers */
#define EXT4_MOUNT_NOBH 0x40000 /* No bufferheads */
#define EXT4_MOUNT_QUOTA 0x80000 /* Some quota option set */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3dafa6b..8ff6762 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3887,7 +3887,7 @@ void ext4_truncate(struct inode *inode)
if (!ext4_can_truncate(inode))
return;

- if (inode->i_size == 0)
+ if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
ei->i_state |= EXT4_STATE_DA_ALLOC_CLOSE;

if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 63568ec..8977e60 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2457,7 +2457,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
ext4_mark_inode_dirty(handle, new_inode);
if (!new_inode->i_nlink)
ext4_orphan_add(handle, new_inode);
- force_da_alloc = 1;
+ if (!test_opt(new_dir->i_sb, NO_AUTO_DA_ALLOC))
+ force_da_alloc = 1;
}
retval = 0;

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 39d1993..1ad3c20 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -803,8 +803,6 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
if (!test_opt(sb, POSIX_ACL) && (def_mount_opts & EXT4_DEFM_ACL))
seq_puts(seq, ",noacl");
#endif
- if (!test_opt(sb, RESERVATION))
- seq_puts(seq, ",noreservation");
if (sbi->s_commit_interval != JBD2_DEFAULT_MAX_COMMIT_AGE*HZ) {
seq_printf(seq, ",commit=%u",
(unsigned) (sbi->s_commit_interval / HZ));
@@ -855,6 +853,9 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
if (test_opt(sb, DATA_ERR_ABORT))
seq_puts(seq, ",data_err=abort");

+ if (test_opt(sb, NO_AUTO_DA_ALLOC))
+ seq_puts(seq, ",auto_da_alloc=0");
+
ext4_show_quota_options(seq, sb);
return 0;
}
@@ -1002,7 +1003,7 @@ enum {
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
Opt_nouid32, Opt_debug, Opt_oldalloc, Opt_orlov,
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
- Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
+ Opt_auto_da_alloc, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_min_batch_time, Opt_max_batch_time,
Opt_journal_update, Opt_journal_dev,
Opt_journal_checksum, Opt_journal_async_commit,
@@ -1037,8 +1038,6 @@ static const match_table_t tokens = {
{Opt_nouser_xattr, "nouser_xattr"},
{Opt_acl, "acl"},
{Opt_noacl, "noacl"},
- {Opt_reservation, "reservation"},
- {Opt_noreservation, "noreservation"},
{Opt_noload, "noload"},
{Opt_nobh, "nobh"},
{Opt_bh, "bh"},
@@ -1073,6 +1072,7 @@ static const match_table_t tokens = {
{Opt_nodelalloc, "nodelalloc"},
{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
{Opt_journal_ioprio, "journal_ioprio=%u"},
+ {Opt_auto_da_alloc, "auto_da_alloc=%u"},
{Opt_err, NULL},
};

@@ -1205,12 +1205,6 @@ static int parse_options(char *options, struct super_block *sb,
"not supported\n");
break;
#endif
- case Opt_reservation:
- set_opt(sbi->s_mount_opt, RESERVATION);
- break;
- case Opt_noreservation:
- clear_opt(sbi->s_mount_opt, RESERVATION);
- break;
case Opt_journal_update:
/* @@@ FIXME */
/* Eventually we will want to be able to create
@@ -1471,6 +1465,14 @@ set_qf_format:
*journal_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE,
option);
break;
+ case Opt_auto_da_alloc:
+ if (match_int(&args[0], &option))
+ return 0;
+ if (option)
+ clear_opt(sbi->s_mount_opt, NO_AUTO_DA_ALLOC);
+ else
+ set_opt(sbi->s_mount_opt,NO_AUTO_DA_ALLOC);
+ break;
default:
printk(KERN_ERR
"EXT4-fs: Unrecognized mount option \"%s\" "
@@ -2099,7 +2101,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;

- set_opt(sbi->s_mount_opt, RESERVATION);
set_opt(sbi->s_mount_opt, BARRIER);

/*
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:11

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 08/18] ext4: Fix discard of inode prealloc space with delayed allocation.

From: Aneesh Kumar K.V <[email protected]>

With delayed allocation we should not/cannot discard inode prealloc
space during file close. We would still have dirty pages for which we
haven't allocated blocks yet. With this fix after each get_blocks
request we check whether we have zero reserved blocks and if yes and
we don't have any writers on the file we discard inode prealloc space.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit d6014301b5599fba395c42a1e96a7fe86f7d0b2d)
---
fs/ext4/file.c | 3 ++-
fs/ext4/inode.c | 9 ++++++++-
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 06df827..588af8c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -39,7 +39,8 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
}
/* if we are the last writer on the inode, drop the block reservation */
if ((filp->f_mode & FMODE_WRITE) &&
- (atomic_read(&inode->i_writecount) == 1))
+ (atomic_read(&inode->i_writecount) == 1) &&
+ !EXT4_I(inode)->i_reserved_data_blocks)
{
down_write(&EXT4_I(inode)->i_data_sem);
ext4_discard_preallocations(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8c7259a..3dafa6b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1036,8 +1036,15 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
/* update per-inode reservations */
BUG_ON(used > EXT4_I(inode)->i_reserved_data_blocks);
EXT4_I(inode)->i_reserved_data_blocks -= used;

2009-06-02 12:08:11

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 12/18] ext4: really print the find_group_flex fallback warning only once

From: Chuck Ebbert <[email protected]>

Missing braces caused the warning to print more than once.

Signed-Off-By: Chuck Ebbert <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 6b82f3cb2d480b7714eb0ff61aee99c22160389e)
---
fs/ext4/ialloc.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index befd95e..345cba1 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -720,11 +720,12 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
ret2 = find_group_flex(sb, dir, &group);
if (ret2 == -1) {
ret2 = find_group_other(sb, dir, &group);
- if (ret2 == 0 && once)
+ if (ret2 == 0 && once) {
once = 0;
printk(KERN_NOTICE "ext4: find_group_flex "
"failed, fallback succeeded dir %lu\n",
dir->i_ino);
+ }
}
goto got_group;
}
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:10

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 10/18] ext4: Check for an valid i_mode when reading the inode from disk

Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 563bdd61fe4dbd6b58cf7eb06f8d8f14479ae1dc)
---
fs/ext4/inode.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8ff6762..c4f0e14 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4367,7 +4367,8 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
inode->i_op = &ext4_symlink_inode_operations;
ext4_set_aops(inode);
}
- } else {
+ } else if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
+ S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
inode->i_op = &ext4_special_inode_operations;
if (raw_inode->i_block[0])
init_special_inode(inode, inode->i_mode,
@@ -4375,6 +4376,13 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
else
init_special_inode(inode, inode->i_mode,
new_decode_dev(le32_to_cpu(raw_inode->i_block[1])));
+ } else {
+ brelse(bh);
+ ret = -EIO;
+ ext4_error(inode->i_sb, __func__,
+ "bogus i_mode (%o) for inode=%lu",
+ inode->i_mode, inode->i_ino);
+ goto bad_inode;
}
brelse(iloc.bh);
ext4_set_inode_flags(inode);
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:11

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 17/18] ext4: Clear the unwritten buffer_head flag after the extent is initialized

From: Aneesh Kumar K.V <[email protected]>

The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area. That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.

When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate. Hence, we need to make sure the
BH_Unwritten is not left set, since the combination of BH_Mapped and
BH_Unwritten is not allowed; among other things, it will result ext4's
get_block() to be called over and over again during the write_begin
phase of write(2).

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 2a8964d63d50dd2d65d71d342bc7fb6ef4117614)
---
fs/ext4/inode.c | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4ed5e92..b3d7250 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1076,6 +1076,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
int retval;

clear_buffer_mapped(bh);
+ clear_buffer_unwritten(bh);

/*
* Try to see if we can get the block without requesting
@@ -1106,6 +1107,18 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
return retval;

/*
+ * When we call get_blocks without the create flag, the
+ * BH_Unwritten flag could have gotten set if the blocks
+ * requested were part of a uninitialized extent. We need to
+ * clear this flag now that we are committed to convert all or
+ * part of the uninitialized extent to be an initialized
+ * extent. This is because we need to avoid the combination
+ * of BH_Unwritten and BH_Mapped flags being simultaneously
+ * set on the buffer_head.
+ */
+ clear_buffer_unwritten(bh);
+
+ /*
* New blocks allocate and/or writing to uninitialized extent
* will possibly result in updating i_data, so we take
* the write lock of i_data_sem, and call get_blocks()
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:11

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 16/18] ext4: Use a fake block number for delayed new buffer_head

From: Aneesh Kumar K.V <[email protected]>

Use a very large unsigned number (~0xffff) as as the fake block number
for the delayed new buffer. The VFS should never try to write out this
number, but if it does, this will make it obvious.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 33b9817e2ae097c7b8d256e3510ac6c54fc6d9d0)
---
fs/ext4/inode.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2caeda7..4ed5e92 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2220,6 +2220,10 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
int ret = 0;
+ sector_t invalid_block = ~((sector_t) 0xffff);
+
+ if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
+ invalid_block = ~0;

BUG_ON(create == 0);
BUG_ON(bh_result->b_size != inode->i_sb->s_blocksize);
@@ -2241,7 +2245,7 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
/* not enough space to reserve */
return ret;

- map_bh(bh_result, inode->i_sb, 0);
+ map_bh(bh_result, inode->i_sb, invalid_block);
set_buffer_new(bh_result);
set_buffer_delay(bh_result);
} else if (ret > 0) {
--
1.6.3.1.1.g75fc.dirty

2009-06-02 12:08:13

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 11/18] jbd2: Update locking coments

From: Jan Kara <[email protected]>

Update information about locking in JBD2 revoke code. Inconsistency in
comments found by Lin Tan <[email protected]>.

CC: Lin Tan <[email protected]>.
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 86db97c87f744364d5889ca8a4134ca2048b8f83)
---
fs/jbd2/revoke.c | 24 +++++++++++++++++++-----
1 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 257ff26..bbe6d59 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -55,6 +55,25 @@
* need do nothing.
* RevokeValid set, Revoked set:
* buffer has been revoked.
+ *
+ * Locking rules:
+ * We keep two hash tables of revoke records. One hashtable belongs to the
+ * running transaction (is pointed to by journal->j_revoke), the other one
+ * belongs to the committing transaction. Accesses to the second hash table
+ * happen only from the kjournald and no other thread touches this table. Also
+ * journal_switch_revoke_table() which switches which hashtable belongs to the
+ * running and which to the committing transaction is called only from
+ * kjournald. Therefore we need no locks when accessing the hashtable belonging
+ * to the committing transaction.
+ *
+ * All users operating on the hash table belonging to the running transaction
+ * have a handle to the transaction. Therefore they are safe from kjournald
+ * switching hash tables under them. For operations on the lists of entries in
+ * the hash table j_revoke_lock is used.
+ *
+ * Finally, also replay code uses the hash tables but at this moment noone else
+ * can touch them (filesystem isn't mounted yet) and hence no locking is
+ * needed.
*/

#ifndef __KERNEL__
@@ -401,8 +420,6 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
* the second time we would still have a pending revoke to cancel. So,
* do not trust the Revoked bit on buffers unless RevokeValid is also
* set.
- *
- * The caller must have the journal locked.
*/
int jbd2_journal_cancel_revoke(handle_t *handle, struct journal_head *jh)
{
@@ -480,10 +497,7 @@ void jbd2_journal_switch_revoke_table(journal_t *journal)
/*
* Write revoke records to the journal for all entries in the current
* revoke hash, deleting the entries as we go.
- *
- * Called with the journal lock held.
*/

2009-06-02 12:08:11

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH,STABLE 2.6.29 15/18] ext4: Fix sub-block zeroing for writes into preallocated extents

From: Aneesh Kumar K.V <[email protected]>

We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.

Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata() unmaps the
correct block.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
(cherry picked from commit 9c1ee184a30394e54165fa4c15923cabd952c106)
---
fs/ext4/extents.c | 2 ++
fs/ext4/inode.c | 7 +++++++
2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e0aa4fe..6af5a50 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2776,6 +2776,8 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
if (allocated > max_blocks)
allocated = max_blocks;
set_buffer_unwritten(bh_result);
+ bh_result->b_bdev = inode->i_sb->s_bdev;
+ bh_result->b_blocknr = newblock;
goto out2;
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cf65a83..2caeda7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2246,6 +2246,13 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
set_buffer_delay(bh_result);
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
+ /*
+ * With sub-block writes into unwritten extents
+ * we also need to mark the buffer as new so that
+ * the unwritten parts of the buffer gets correctly zeroed.
+ */
+ if (buffer_unwritten(bh_result))
+ set_buffer_new(bh_result);
ret = 0;
}

--
1.6.3.1.1.g75fc.dirty

2009-06-03 18:14:38

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH,STABLE 2.6.29 06/18] ext4: Automatically allocate delay allocated blocks on close

On Jun 02, 2009 08:07 -0400, Theodore Ts'o wrote:
> When closing a file that had been previously truncated, force any
> delay allocated blocks that to be allocated so that if the filesystem
> is mounted with data=ordered, the data blocks will be pushed out to
> disk along with the journal commit. Many application programs expect
> this, so we do this to avoid zero length files if the system crashes
> unexpectedly.
>
> @@ -3880,6 +3880,9 @@ void ext4_truncate(struct inode *inode)
> if (!ext4_can_truncate(inode))
> return;
>
> + if (inode->i_size == 0)
> + ei->i_state |= EXT4_STATE_DA_ALLOC_CLOSE;

Since some applications open files with open(..., O_WRONLY|O_CREAT|O_TRUNC)
to avoid re-using existing files (and avoiding the need to check if the
file already exists to modify the flags), it would make sense to set
EXT4_STATE_DA_ALLOC_CLOSE only if the file previously had some data in it.

By the time we get to ext4_truncate() i_size is overwritten already, but
it might make sense to also check i_disksize != 0 before setting this flag.
Otherwise delayed allocation may be inadvertently disabled for these apps
when it should not be.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-06-03 18:16:26

by Andreas Dilger

[permalink] [raw]

Subject: Re: Fix softlockup caused by illegal i_file_acl value in on-disk inode

On Jun 02, 2009 08:07 -0400, Theodore Ts'o wrote:
> + if (ei->i_file_acl &&
> + ((ei->i_file_acl <
> + (le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) +
> + EXT4_SB(sb)->s_gdb_count)) ||
> + (ei->i_file_acl >= ext4_blocks_count(EXT4_SB(sb)->s_es)))) {

I was just thinking it might make sense to wrap this check into a helper
like the following. We check the validity of blocks in at least half a
dozen different places. The elaborate ext4_blocktype is to allow for
future expansion of this checking mechanism to allow it to check for
blocks overlapping with e.g. the inode table and such, and possibly for
using with the jbd2 buffer checksum mechanism at some later date.

enum ext4_blocktype {
EXT4_BT_SUPERBLOCK = 1,
EXT4_BT_GDT = 2,
EXT4_BT_INODE_BITMAP = 3,
EXT4_BT_BLOCK_BITMAP = 4,
EXT4_BT_INODE_TABLE = 5,
EXT4_BT_DIRECTORY_ROOT = 10,
EXT4_BT_DIRECTORY_LEAF = 11,
EXT4_BT_DIRECTORY_HTREE = 12,
EXT4_BT_INDIRECT = 21,
EXT4_BT_DINDIRECT = 22,
EXT4_BT_TINDIRECT = 23,
EXT4_BT_EXTENT_INDEX = 25,
EXT4_BT_EXTENT_LEAF = 26,
EXT4_BT_DATA_BLOCK = 30,
EXT4_BT_ACL_BLOCK = 31,
};

bool ext4_block_valid(ext4_blk_t block, enum blocktype)
{
if (block < le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) +
EXT4_SB(sb)->s_gdb_count)) ||
block >= ext4_blocks_count(EXT4_SB(sb)->s_es)
return 0;

return 1;
}

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-06-03 18:17:37

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH,STABLE 2.6.29 14/18] ext4: Ignore i_file_acl_high unless EXT4_FEATURE_INCOMPAT_64BIT is present

On Jun 02, 2009 08:07 -0400, Theodore Ts'o wrote:
> Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature
> bit is set. The field is normally zero, but older versions of e2fsck
> didn't automatically check to make sure of this, so in the spirit of
> "be liberal in what you accept", don't look at i_file_acl_high unless
> we are using a 64-bit filesystem.

Should we do the same with other "_hi" fields in the inode? There are
many cases like this for EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT
in super.c. Does e2fsck check and zero those already?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-06-03 19:24:38

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Fix softlockup caused by illegal i_file_acl value in on-disk inode

On Wed, Jun 03, 2009 at 12:16:08PM -0600, Andreas Dilger wrote:
> On Jun 02, 2009 08:07 -0400, Theodore Ts'o wrote:
> > + if (ei->i_file_acl &&
> > + ((ei->i_file_acl <
> > + (le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) +
> > + EXT4_SB(sb)->s_gdb_count)) ||
> > + (ei->i_file_acl >= ext4_blocks_count(EXT4_SB(sb)->s_es)))) {
>
> I was just thinking it might make sense to wrap this check into a helper
> like the following. We check the validity of blocks in at least half a
> dozen different places. The elaborate ext4_blocktype is to allow for
> future expansion of this checking mechanism to allow it to check for
> blocks overlapping with e.g. the inode table and such, and possibly for
> using with the jbd2 buffer checksum mechanism at some later date.

We do have a helper function that is waiting to be merged in the patch
queue. See the patch "add-check-block-validity-to-ext4_get_blocks_wrap".

It doesn't have the blocktype extension, since to keep things fast and
simple, I have a single red-black tree for any blocks that shouldn't
be used for file blocks allows for a *much* more compact
representation in the red-black tree, thanks to flex_bg putting the
block and inode bitmaps and inode tables back-to-back with each other.
If I were to add blocktype information to the red-black tree that
ext4_data_block_valid() could check against, the red-black tree would
at least triple in size.

The nice thing about this patch (which will be merged for 2.6.31) is
that it's a runtime mount option. So if we have a customer that runs
into problems, we don't have to ship them a custom debugging kernel;
we just tell them to mount the filesystem with block_validity, and we
can start debugging the problem right away.

- Ted

2009-06-03 19:29:37

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH,STABLE 2.6.29 06/18] ext4: Automatically allocate delay allocated blocks on close

On Wed, Jun 03, 2009 at 12:14:19PM -0600, Andreas Dilger wrote:
>
> Since some applications open files with open(..., O_WRONLY|O_CREAT|O_TRUNC)
> to avoid re-using existing files (and avoiding the need to check if the
> file already exists to modify the flags), it would make sense to set
> EXT4_STATE_DA_ALLOC_CLOSE only if the file previously had some data in it.
>
> By the time we get to ext4_truncate() i_size is overwritten already, but
> it might make sense to also check i_disksize != 0 before setting this flag.
> Otherwise delayed allocation may be inadvertently disabled for these apps
> when it should not be.

Agreed; I'll make such a change for the ext4 patch queue. We can
propagate such a patch to the -stable kernels once it's in mainline.

- Ted

2009-06-09 09:40:17

by Greg KH

[permalink] [raw]

Subject: patch ext4-don-t-inherit-inappropriate-inode-flags-from-parent.patch added to 2.6.29-stable tree

This is a note to let you know that we have just queued up the patch titled

Subject: ext4: don't inherit inappropriate inode flags from parent

to the 2.6.29-stable tree. Its filename is

ext4-don-t-inherit-inappropriate-inode-flags-from-parent.patch

A git repo of this tree can be found at
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

>From [email protected] Tue Jun 9 02:24:00 2009
From: Duane Griffin <[email protected]>
Date: Tue, 2 Jun 2009 08:07:42 -0400
Subject: ext4: don't inherit inappropriate inode flags from parent
To: [email protected]
Cc: Andrew Morton <[email protected]>, [email protected], "Theodore Ts'o" <[email protected]>, Duane Griffin <[email protected]>
Message-ID: <[email protected]>

From: Duane Griffin <[email protected]>

(cherry picked from commit 8fa43a81b97853fc69417bb6054182e78f95cbeb)

At present INDEX and EXTENTS are the only flags that new ext4 inodes do
NOT inherit from their parent. In addition prevent the flags DIRTY,
ECOMPR, IMAGIC, TOPDIR, HUGE_FILE and EXT_MIGRATE from being inherited.
List inheritable flags explicitly to prevent future flags from
accidentally being inherited.

This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.

Signed-off-by: Duane Griffin <[email protected]>
Acked-by: Andreas Dilger <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/ext4.h | 7 +++++++
fs/ext4/ialloc.c | 2 +-
2 files changed, 8 insertions(+), 1 deletion(-)

--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -248,6 +248,13 @@ struct flex_groups {
#define EXT4_FL_USER_VISIBLE 0x000BDFFF /* User visible flags */
#define EXT4_FL_USER_MODIFIABLE 0x000B80FF /* User modifiable flags */

+/* Flags that should be inherited by new inodes from their parent. */
+#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
+ EXT4_SYNC_FL | EXT4_IMMUTABLE_FL | EXT4_APPEND_FL |\
+ EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
+ EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
+ EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
+
/*
* Inode dynamic state flags
*/
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -889,7 +889,7 @@ got:
* newly created directory and file only if -o extent mount option is
* specified
*/
- ei->i_flags = EXT4_I(dir)->i_flags & ~(EXT4_INDEX_FL|EXT4_EXTENTS_FL);
+ ei->i_flags = EXT4_I(dir)->i_flags & EXT4_FL_INHERITED;
if (S_ISLNK(mode))
ei->i_flags &= ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL);
/* dirsync only applies to directories */

Patches currently in stable-queue which might be from [email protected] are

2009-06-09 09:41:53

by Greg KH

[permalink] [raw]

Subject: patch ext4-return-eio-not-estale-on-directory-traversal-through-deleted-inode.patch added to 2.6.29-stable tree

This is a note to let you know that we have just queued up the patch titled

Subject: ext4: return -EIO not -ESTALE on directory traversal through deleted inode

to the 2.6.29-stable tree. Its filename is

ext4-return-eio-not-estale-on-directory-traversal-through-deleted-inode.patch

A git repo of this tree can be found at
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

>From [email protected] Tue Jun 9 02:25:08 2009
From: Bryan Donlan <[email protected]>
Date: Tue, 2 Jun 2009 08:07:44 -0400
Subject: ext4: return -EIO not -ESTALE on directory traversal through deleted inode
To: [email protected]
Cc: "Theodore Ts'o" <[email protected]>, Andrew Morton <[email protected]>, [email protected], Bryan Donlan <[email protected]>
Message-ID: <[email protected]>

From: Bryan Donlan <[email protected]>

(cherry picked from commit e6f009b0b45220c004672d41a58865e94946104d)

ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext4_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted
such that a directory entry references a deleted inode. This leads to
a misleading error message - "Stale NFS file handle" - and confusion
on the part of the admin.

The bug can be easily reproduced by creating a new filesystem, making
a link to an unused inode using debugfs, then mounting and attempting
to ls -l said link.

This patch thus changes ext4_lookup to return -EIO if it receives
-ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
corruption; and also invokes the appropriate ext*_error functions when
this case is detected.

Signed-off-by: Bryan Donlan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
fs/ext4/namei.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1052,8 +1052,16 @@ static struct dentry *ext4_lookup(struct
return ERR_PTR(-EIO);
}
inode = ext4_iget(dir->i_sb, ino);
- if (IS_ERR(inode))
- return ERR_CAST(inode);
+ if (unlikely(IS_ERR(inode))) {
+ if (PTR_ERR(inode) == -ESTALE) {
+ ext4_error(dir->i_sb, __func__,
+ "deleted inode referenced: %u",
+ ino);
+ return ERR_PTR(-EIO);
+ } else {
+ return ERR_CAST(inode);
+ }
+ }
}
return d_splice_alias(inode, dentry);
}

Patches currently in stable-queue which might be from [email protected] are

2009-06-09 09:42:39

by Greg KH

[permalink] [raw]

Subject: patch ext4-tighten-restrictions-on-inode-flags.patch added to 2.6.29-stable tree

This is a note to let you know that we have just queued up the patch titled

Subject: ext4: tighten restrictions on inode flags

to the 2.6.29-stable tree. Its filename is

ext4-tighten-restrictions-on-inode-flags.patch

A git repo of this tree can be found at
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

>From [email protected] Tue Jun 9 02:24:29 2009
From: Duane Griffin <[email protected]>
Date: Tue, 2 Jun 2009 08:07:43 -0400
Subject: ext4: tighten restrictions on inode flags
To: [email protected]
Cc: Andrew Morton <[email protected]>, [email protected], "Theodore Ts'o" <[email protected]>, Duane Griffin <[email protected]>
Message-ID: <[email protected]>

From: Duane Griffin <[email protected]>

(cherry picked from commit 2dc6b0d48ca0599837df21b14bb8393d0804af57)

At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.

Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.

Signed-off-by: Duane Griffin <[email protected]>
Acked-by: Andreas Dilger <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
fs/ext4/ext4.h | 17 +++++++++++++++++
fs/ext4/ialloc.c | 14 +++++---------
fs/ext4/ioctl.c | 3 +--
3 files changed, 23 insertions(+), 11 deletions(-)

--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -255,6 +255,23 @@ struct flex_groups {
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)

+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
+
+/* Flags that are appropriate for non-directories/regular files. */
+#define EXT4_OTHER_FLMASK (EXT4_NODUMP_FL | EXT4_NOATIME_FL)
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static inline __u32 ext4_mask_flags(umode_t mode, __u32 flags)
+{
+ if (S_ISDIR(mode))
+ return flags;
+ else if (S_ISREG(mode))
+ return flags & EXT4_REG_FLMASK;
+ else
+ return flags & EXT4_OTHER_FLMASK;
+}
+
/*
* Inode dynamic state flags
*/
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -885,16 +885,12 @@ got:
ei->i_disksize = 0;

/*
- * Don't inherit extent flag from directory. We set extent flag on
- * newly created directory and file only if -o extent mount option is
- * specified
+ * Don't inherit extent flag from directory, amongst others. We set
+ * extent flag on newly created directory and file only if -o extent
+ * mount option is specified
*/
- ei->i_flags = EXT4_I(dir)->i_flags & EXT4_FL_INHERITED;
- if (S_ISLNK(mode))
- ei->i_flags &= ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL);
- /* dirsync only applies to directories */
- if (!S_ISDIR(mode))
- ei->i_flags &= ~EXT4_DIRSYNC_FL;
+ ei->i_flags =
+ ext4_mask_flags(mode, EXT4_I(dir)->i_flags & EXT4_FL_INHERITED);
ei->i_file_acl = 0;
ei->i_dtime = 0;
ei->i_block_group = group;
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -48,8 +48,7 @@ long ext4_ioctl(struct file *filp, unsig
if (err)
return err;

- if (!S_ISDIR(inode->i_mode))
- flags &= ~EXT4_DIRSYNC_FL;
+ flags = ext4_mask_flags(inode->i_mode, flags);

err = -EPERM;
mutex_lock(&inode->i_mutex);

Patches currently in stable-queue which might be from [email protected] are