2014-12-01 16:49:42

by Ryusuke Konishi

[permalink] [raw]
Subject: [PATCH 0/3] a few nilfs2 updates for 3.19

Hi Andrew,

Please add the following patches to the queue for the next merge
window.

Andreas's one simplifies nilfs_sync_file() function to avoid duplicate
segment construction on fsync(), and mine fixes a race issue between
nilfs_iget() and nilfs_new_inode(). Markus's one removes an
unnecessary NULL check related to iput().

Thanks,
Ryusuke Konishi
--
Andreas Rohner (1):
nilfs2: avoid duplicate segment construction for fsync()

Markus Elfring (1):
nilfs2: Deletion of an unnecessary check before the function call "iput"

Ryusuke Konishi (1):
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races

fs/nilfs2/file.c | 21 ++++++++-------------
fs/nilfs2/inode.c | 32 ++++++++++++++++++++++++--------
fs/nilfs2/namei.c | 15 ++++++++++++---
fs/nilfs2/the_nilfs.c | 3 +--
4 files changed, 45 insertions(+), 26 deletions(-)


2014-12-01 16:49:45

by Ryusuke Konishi

[permalink] [raw]
Subject: [PATCH 3/3] nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races

same story as in commit 41080b5a240113328c607f22b849f653373db0ce
(similar ext2 fix) except that nilfs2 needs to use
insert_inode_locked4() instead of insert_inode_locked() and a bug of a
check for dead inodes needs to be fixed.

If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode()
at the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.

If nilfs_iget() is called before nilfs_new_inode() calls
insert_inode_locked4(), it will create an in-core inode and read its
data from the on-disk inode. But, nilfs_iget() will find i_nlink
equals zero and fail at nilfs_read_inode_common(), which will lead it
to call iget_failed() and cleanly fail.

However, this sanity check doesn't work as expected for reused on-disk
inodes because they leave a non-zero value in i_mode field and it
hinders the test of i_nlink. This patch also fixes the issue by
removing the test on i_mode that nilfs2 doesn't need.

Signed-off-by: Ryusuke Konishi <[email protected]>
---
fs/nilfs2/inode.c | 32 ++++++++++++++++++++++++--------
fs/nilfs2/namei.c | 15 ++++++++++++---
2 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index e1fa69b..8b59695 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -49,6 +49,8 @@ struct nilfs_iget_args {
int for_gc;
};

+static int nilfs_iget_test(struct inode *inode, void *opaque);
+
void nilfs_inode_add_blocks(struct inode *inode, int n)
{
struct nilfs_root *root = NILFS_I(inode)->i_root;
@@ -348,6 +350,17 @@ const struct address_space_operations nilfs_aops = {
.is_partially_uptodate = block_is_partially_uptodate,
};

+static int nilfs_insert_inode_locked(struct inode *inode,
+ struct nilfs_root *root,
+ unsigned long ino)
+{
+ struct nilfs_iget_args args = {
+ .ino = ino, .root = root, .cno = 0, .for_gc = 0
+ };
+
+ return insert_inode_locked4(inode, ino, nilfs_iget_test, &args);
+}
+
struct inode *nilfs_new_inode(struct inode *dir, umode_t mode)
{
struct super_block *sb = dir->i_sb;
@@ -383,7 +396,7 @@ struct inode *nilfs_new_inode(struct inode *dir, umode_t mode)
if (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)) {
err = nilfs_bmap_read(ii->i_bmap, NULL);
if (err < 0)
- goto failed_bmap;
+ goto failed_after_creation;

set_bit(NILFS_I_BMAP, &ii->i_state);
/* No lock is needed; iget() ensures it. */
@@ -399,21 +412,24 @@ struct inode *nilfs_new_inode(struct inode *dir, umode_t mode)
spin_lock(&nilfs->ns_next_gen_lock);
inode->i_generation = nilfs->ns_next_generation++;
spin_unlock(&nilfs->ns_next_gen_lock);
- insert_inode_hash(inode);
+ if (nilfs_insert_inode_locked(inode, root, ino) < 0) {
+ err = -EIO;
+ goto failed_after_creation;
+ }

err = nilfs_init_acl(inode, dir);
if (unlikely(err))
- goto failed_acl; /* never occur. When supporting
+ goto failed_after_creation; /* never occur. When supporting
nilfs_init_acl(), proper cancellation of
above jobs should be considered */

return inode;

- failed_acl:
- failed_bmap:
+ failed_after_creation:
clear_nlink(inode);
+ unlock_new_inode(inode);
iput(inode); /* raw_inode will be deleted through
- generic_delete_inode() */
+ nilfs_evict_inode() */
goto failed;

failed_ifile_create_inode:
@@ -461,8 +477,8 @@ int nilfs_read_inode_common(struct inode *inode,
inode->i_atime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec);
inode->i_ctime.tv_nsec = le32_to_cpu(raw_inode->i_ctime_nsec);
inode->i_mtime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec);
- if (inode->i_nlink == 0 && inode->i_mode == 0)
- return -EINVAL; /* this inode is deleted */
+ if (inode->i_nlink == 0)
+ return -ESTALE; /* this inode is deleted */

inode->i_blocks = le64_to_cpu(raw_inode->i_blocks);
ii->i_flags = le32_to_cpu(raw_inode->i_flags);
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 9de78f0..0f84b25 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -51,9 +51,11 @@ static inline int nilfs_add_nondir(struct dentry *dentry, struct inode *inode)
int err = nilfs_add_link(dentry, inode);
if (!err) {
d_instantiate(dentry, inode);
+ unlock_new_inode(inode);
return 0;
}
inode_dec_link_count(inode);
+ unlock_new_inode(inode);
iput(inode);
return err;
}
@@ -182,6 +184,7 @@ out:
out_fail:
drop_nlink(inode);
nilfs_mark_inode_dirty(inode);
+ unlock_new_inode(inode);
iput(inode);
goto out;
}
@@ -201,11 +204,15 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
inode_inc_link_count(inode);
ihold(inode);

- err = nilfs_add_nondir(dentry, inode);
- if (!err)
+ err = nilfs_add_link(dentry, inode);
+ if (!err) {
+ d_instantiate(dentry, inode);
err = nilfs_transaction_commit(dir->i_sb);
- else
+ } else {
+ inode_dec_link_count(inode);
+ iput(inode);
nilfs_transaction_abort(dir->i_sb);
+ }

return err;
}
@@ -243,6 +250,7 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)

nilfs_mark_inode_dirty(inode);
d_instantiate(dentry, inode);
+ unlock_new_inode(inode);
out:
if (!err)
err = nilfs_transaction_commit(dir->i_sb);
@@ -255,6 +263,7 @@ out_fail:
drop_nlink(inode);
drop_nlink(inode);
nilfs_mark_inode_dirty(inode);
+ unlock_new_inode(inode);
iput(inode);
out_dir:
drop_nlink(dir);
--
1.8.3.1

2014-12-01 16:49:43

by Ryusuke Konishi

[permalink] [raw]
Subject: [PATCH 2/3] nilfs2: Deletion of an unnecessary check before the function call "iput"

From: Markus Elfring <[email protected]>

The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
---
fs/nilfs2/the_nilfs.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index 9da25fe..69bd801 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -808,8 +808,7 @@ void nilfs_put_root(struct nilfs_root *root)
spin_lock(&nilfs->ns_cptree_lock);
rb_erase(&root->rb_node, &nilfs->ns_cptree);
spin_unlock(&nilfs->ns_cptree_lock);
- if (root->ifile)
- iput(root->ifile);
+ iput(root->ifile);

kfree(root);
}
--
1.8.3.1

2014-12-01 16:51:32

by Ryusuke Konishi

[permalink] [raw]
Subject: [PATCH 1/3] nilfs2: avoid duplicate segment construction for fsync()

From: Andreas Rohner <[email protected]>

This patch removes filemap_write_and_wait_range() from
nilfs_sync_file(), because it triggers a data segment construction by
calling nilfs_writepages() with WB_SYNC_ALL. A data segment construction
does not remove the inode from the i_dirty list and it does not clear
the NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns
true, which leads to an unnecessary duplicate segment construction in
nilfs_sync_file().

A call to filemap_write_and_wait_range() is not needed, because NILFS2
does not rely on the generic writeback mechanisms. Instead it implements
its own mechanism to collect all dirty pages and write them into
segments. It is more efficient to initiate the segment construction
directly in nilfs_sync_file() without the detour over
filemap_write_and_wait_range().

Additionally the lock of i_mutex is not needed, because all code blocks
that are protected by i_mutex are also protected by a NILFS transaction:

Function i_mutex nilfs_transaction
------------------------------------------------------
nilfs_ioctl_setflags: yes yes
nilfs_fiemap: yes no
nilfs_write_begin: yes yes
nilfs_write_end: yes yes
nilfs_lookup: yes no
nilfs_create: yes yes
nilfs_link: yes yes
nilfs_mknod: yes yes
nilfs_symlink: yes yes
nilfs_mkdir: yes yes
nilfs_unlink: yes yes
nilfs_rmdir: yes yes
nilfs_rename: yes yes
nilfs_setattr: yes yes

For nilfs_lookup() i_mutex is held for the parent directory, to protect
it from modification. The segment construction does not modify directory
inodes, so no lock is needed.

nilfs_fiemap() reads the block layout on the disk, by using
nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.

Signed-off-by: Andreas Rohner <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
---
fs/nilfs2/file.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index e9e3325..1ad6bdf 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -41,19 +41,14 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
struct inode *inode = file->f_mapping->host;
int err;

- err = filemap_write_and_wait_range(inode->i_mapping, start, end);
- if (err)
- return err;
- mutex_lock(&inode->i_mutex);
-
- if (nilfs_inode_dirty(inode)) {
- if (datasync)
- err = nilfs_construct_dsync_segment(inode->i_sb, inode,
- 0, LLONG_MAX);
- else
- err = nilfs_construct_segment(inode->i_sb);
- }
- mutex_unlock(&inode->i_mutex);
+ if (!nilfs_inode_dirty(inode))
+ return 0;
+
+ if (datasync)
+ err = nilfs_construct_dsync_segment(inode->i_sb, inode,
+ start, end);
+ else
+ err = nilfs_construct_segment(inode->i_sb);

nilfs = inode->i_sb->s_fs_info;
if (!err)
--
1.8.3.1

2014-12-01 22:09:20

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: [PATCH 1/3] nilfs2: avoid duplicate segment construction for fsync()

On Tue, 2 Dec 2014 01:41:45 +0900, Ryusuke Konishi wrote:
> From: Andreas Rohner <[email protected]>
>
> This patch removes filemap_write_and_wait_range() from
> nilfs_sync_file(), because it triggers a data segment construction by
> calling nilfs_writepages() with WB_SYNC_ALL. A data segment construction
> does not remove the inode from the i_dirty list and it does not clear
> the NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns
> true, which leads to an unnecessary duplicate segment construction in
> nilfs_sync_file().
<snip>
> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
> index e9e3325..1ad6bdf 100644
> --- a/fs/nilfs2/file.c
> +++ b/fs/nilfs2/file.c
> @@ -41,19 +41,14 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> struct inode *inode = file->f_mapping->host;
> int err;
>
> - err = filemap_write_and_wait_range(inode->i_mapping, start, end);
> - if (err)
> - return err;
> - mutex_lock(&inode->i_mutex);
> -
> - if (nilfs_inode_dirty(inode)) {
> - if (datasync)
> - err = nilfs_construct_dsync_segment(inode->i_sb, inode,
> - 0, LLONG_MAX);
> - else
> - err = nilfs_construct_segment(inode->i_sb);
> - }
> - mutex_unlock(&inode->i_mutex);
> + if (!nilfs_inode_dirty(inode))
> + return 0;
> +
> + if (datasync)
> + err = nilfs_construct_dsync_segment(inode->i_sb, inode,
> + start, end);
> + else
> + err = nilfs_construct_segment(inode->i_sb);
>
> nilfs = inode->i_sb->s_fs_info;
> if (!err)

I found this patch introduces another data integrity issue. If
nilfs_inode_dirty() is not true, it returns without calling
nilfs_flush_device() and skips a disk cache flush.

Andreas made a revised patch to correct it.

Could you apply the following one instead ?

Regards,
Ryusuke Konishi
-----
From: Andreas Rohner <[email protected]>
Date: Mon, 1 Dec 2014 19:03:11 +0100
Subject: [PATCH] nilfs2: avoid duplicate segment construction for fsync()

This patch removes filemap_write_and_wait_range() from
nilfs_sync_file(), because it triggers a data segment construction by
calling nilfs_writepages() with WB_SYNC_ALL. A data segment construction
does not remove the inode from the i_dirty list and it does not clear
the NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns
true, which leads to an unnecessary duplicate segment construction in
nilfs_sync_file().

A call to filemap_write_and_wait_range() is not needed, because NILFS2
does not rely on the generic writeback mechanisms. Instead it implements
its own mechanism to collect all dirty pages and write them into
segments. It is more efficient to initiate the segment construction
directly in nilfs_sync_file() without the detour over
filemap_write_and_wait_range().

Additionally the lock of i_mutex is not needed, because all code blocks
that are protected by i_mutex are also protected by a NILFS transaction:

Function i_mutex nilfs_transaction
------------------------------------------------------
nilfs_ioctl_setflags: yes yes
nilfs_fiemap: yes no
nilfs_write_begin: yes yes
nilfs_write_end: yes yes
nilfs_lookup: yes no
nilfs_create: yes yes
nilfs_link: yes yes
nilfs_mknod: yes yes
nilfs_symlink: yes yes
nilfs_mkdir: yes yes
nilfs_unlink: yes yes
nilfs_rmdir: yes yes
nilfs_rename: yes yes
nilfs_setattr: yes yes

For nilfs_lookup() i_mutex is held for the parent directory, to protect
it from modification. The segment construction does not modify directory
inodes, so no lock is needed.

nilfs_fiemap() reads the block layout on the disk, by using
nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.

Signed-off-by: Andreas Rohner <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
---
fs/nilfs2/file.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index e9e3325..3a03e0a 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -39,21 +39,15 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
*/
struct the_nilfs *nilfs;
struct inode *inode = file->f_mapping->host;
- int err;
-
- err = filemap_write_and_wait_range(inode->i_mapping, start, end);
- if (err)
- return err;
- mutex_lock(&inode->i_mutex);
+ int err = 0;

if (nilfs_inode_dirty(inode)) {
if (datasync)
err = nilfs_construct_dsync_segment(inode->i_sb, inode,
- 0, LLONG_MAX);
+ start, end);
else
err = nilfs_construct_segment(inode->i_sb);
}
- mutex_unlock(&inode->i_mutex);

nilfs = inode->i_sb->s_fs_info;
if (!err)
--
1.8.3.1