2010-09-23 21:54:14

by Matt Helsley

[permalink] [raw]
Subject: [PATCH 2/6] [RFC] Create the .relink file_operation

Not all filesystems will necessarily be able to support relinking an
orphan inode back into the filesystem. Some offlist feedback suggested
that instead of overloading .link that relinking should be a separate
file operation for this reason.

Since .relink is a superset of .link make the VFS call .relink where
possible and .link otherwise.

The next commit will change ext3/4 to enable this operation.

Signed-off-by: Matt Helsley <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Al Viro <[email protected]>
---
fs/namei.c | 5 ++++-
include/linux/fs.h | 1 +
2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a7dce91..eb279e3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2446,7 +2446,10 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
return error;

mutex_lock(&inode->i_mutex);
- error = dir->i_op->link(old_dentry, dir, new_dentry);
+ if (dir->i_op->relink)
+ error = dir->i_op->relink(old_dentry, dir, new_dentry);
+ else
+ error = dir->i_op->link(old_dentry, dir, new_dentry);
mutex_unlock(&inode->i_mutex);
if (!error)
fsnotify_link(dir, inode, new_dentry);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ee725ff..d932eb7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1528,6 +1528,7 @@ struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*relink) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,int);
--
1.6.3.3



2010-09-23 21:53:29

by Matt Helsley

[permalink] [raw]
Subject: [PATCH 3/6] [RFC] ext3/4: Allow relinking to unlinked files

Orphan inodes (nlink == 0) have been forbidden from being linked back
into the ext3/4 filesystems as a means of dealing with a link/unlink
"race".

This patch effectively reverts 2988a7740dc0dd9a0cb56576e8fe1d777dff0db3

"return ENOENT from ext3_link when racing with unlink"

which was discussed in the lkml thread:

http://lkml.org/lkml/2007/1/12/200

The reverted commit was selected because disallowing relinking was just a
simpler solution -- not because removing tasks from the orphan list
was deemed difficult or incorrect.

Instead this patch utilizes the original patch proposed by Eric Sandeen.
Testing this patch with the orphan-repro.tar.bz2 code linked in that
thread seems to confirm that this patch does not reintroduce the OOPs.
Nonetheless, Amir Goldstein pointed out that if ext3_add_entry() fails
we'll also be left with a corrupted orphan list. So I've moved the orphan
removal code down to the spot where a successful return has been assured.

Eric's Original description (indented):

Remove inode from the orphan list in ext3_link() if we might have
raced with ext3_unlink(), which potentially put it on the list.
If we're on the list with nlink > 0, we'll never get cleaned up
properly and eventually may corrupt the list.

Unlike the reverted patch, this patch enables relinking an orphan inode
back into the filesystem. This relinking is extremely useful for
checkpoint/restart since it avoids incredibly costly and complex methods
of supporting checkpoint of tasks with open unlinked files. More on
how relink will be used can be found in the checkpoint/restart patches posted
with this patch.

Though relinking is a useful operation it may not be possible on all
filesystems. It appears to be possible in ext3/4 so this patch changes the
ext3/ext4 link functions to allow it and provide the .relink file operation
introduced earlier.

This assumes that in ext3/4 data written to unlinked files is not stored any
differently than data written to linked files. If that is not the case
I'd appreciate any pointers to the code showing where/how they are
handled differently. All I could find in ext3/4 was the orphan inode list.

Signed-off-by: Matt Helsley <[email protected]>
Cc: Eric Sandeen <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: [email protected]
Cc: Jan Kara <[email protected]>
Cc: [email protected]
Cc: Oren Laadan <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: [email protected]
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jamie Lokier <[email protected]>
---
fs/ext3/namei.c | 16 ++++++++--------
fs/ext4/namei.c | 14 +++++++-------
2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index ee18408..e49310c 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1985,7 +1985,7 @@ out_unlock:

/*
* ext3_orphan_del() removes an unlinked or truncated inode from the list
- * of such inodes stored on disk, because it is finally being cleaned up.
+ * of such inodes stored on disk.
*/
int ext3_orphan_del(handle_t *handle, struct inode *inode)
{
@@ -2243,13 +2243,6 @@ static int ext3_link (struct dentry * old_dentry,

dquot_initialize(dir);

- /*
- * Return -ENOENT if we've raced with unlink and i_nlink is 0. Doing
- * otherwise has the potential to corrupt the orphan inode list.
- */
- if (inode->i_nlink == 0)
- return -ENOENT;
-
retry:
handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
EXT3_INDEX_EXTRA_TRANS_BLOCKS);
@@ -2267,6 +2260,12 @@ retry:
if (!err) {
ext3_mark_inode_dirty(handle, inode);
d_instantiate(dentry, inode);
+ if (inode->i_nlink == 1)
+ /*
+ * i_nlink went from 0 to 1 thus the inode is no
+ * longer an orphan.
+ */
+ ext3_orphan_del(handle, inode);
} else {
drop_nlink(inode);
iput(inode);
@@ -2451,6 +2450,7 @@ end_rename:
const struct inode_operations ext3_dir_inode_operations = {
.create = ext3_create,
.lookup = ext3_lookup,
+ .relink = ext3_link,
.link = ext3_link,
.unlink = ext3_unlink,
.symlink = ext3_symlink,
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0c070fa..e07c374 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2323,13 +2323,6 @@ static int ext4_link(struct dentry *old_dentry,

dquot_initialize(dir);

- /*
- * Return -ENOENT if we've raced with unlink and i_nlink is 0. Doing
- * otherwise has the potential to corrupt the orphan inode list.
- */
- if (inode->i_nlink == 0)
- return -ENOENT;
-
retry:
handle = ext4_journal_start(dir, EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +
EXT4_INDEX_EXTRA_TRANS_BLOCKS);
@@ -2347,6 +2340,12 @@ retry:
if (!err) {
ext4_mark_inode_dirty(handle, inode);
d_instantiate(dentry, inode);
+ if (inode->i_nlink == 1)
+ /*
+ * i_nlink went from 0 to 1 thus the inode is no
+ * longer an orphan.
+ */
+ ext4_orphan_del(handle, inode);
} else {
drop_nlink(inode);
iput(inode);
@@ -2535,6 +2534,7 @@ end_rename:
const struct inode_operations ext4_dir_inode_operations = {
.create = ext4_create,
.lookup = ext4_lookup,
+ .relink = ext4_link,
.link = ext4_link,
.unlink = ext4_unlink,
.symlink = ext4_symlink,
--
1.6.3.3


2010-09-26 19:08:37

by Brad Boyer

[permalink] [raw]
Subject: Re: [PATCH 2/6] [RFC] Create the .relink file_operation

On Thu, Sep 23, 2010 at 02:53:28PM -0700, Matt Helsley wrote:
> Not all filesystems will necessarily be able to support relinking an
> orphan inode back into the filesystem. Some offlist feedback suggested
> that instead of overloading .link that relinking should be a separate
> file operation for this reason.
>
> Since .relink is a superset of .link make the VFS call .relink where
> possible and .link otherwise.
>
> The next commit will change ext3/4 to enable this operation.

I may have missed something in one of these patches (patch 1 and any
original summary if there was one don't appear in my email), but
what is the point of the new operation? I didn't see any case that
treats one any different than the other. What is disallowed (and how)
for a driver which does not implement .relink but has .link?

Brad Boyer
[email protected]

2010-09-27 19:16:28

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH 2/6] [RFC] Create the .relink file_operation

On Sun, Sep 26, 2010 at 12:08:37PM -0700, Brad Boyer wrote:
> On Thu, Sep 23, 2010 at 02:53:28PM -0700, Matt Helsley wrote:
> > Not all filesystems will necessarily be able to support relinking an
> > orphan inode back into the filesystem. Some offlist feedback suggested
> > that instead of overloading .link that relinking should be a separate
> > file operation for this reason.
> >
> > Since .relink is a superset of .link make the VFS call .relink where
> > possible and .link otherwise.
> >
> > The next commit will change ext3/4 to enable this operation.
>
> I may have missed something in one of these patches (patch 1 and any
> original summary if there was one don't appear in my email), but
> what is the point of the new operation? I didn't see any case that
> treats one any different than the other. What is disallowed (and how)
> for a driver which does not implement .relink but has .link?

Did you get patch 3? It shows how ext3/ext4 add the ability to take an
inode that has been unlinked, placed onto the orphan list, and relink it.

Cheers,
-Matt Helsley

2010-09-27 22:03:46

by Brad Boyer

[permalink] [raw]
Subject: Re: [PATCH 2/6] [RFC] Create the .relink file_operation

On Mon, Sep 27, 2010 at 12:16:28PM -0700, Matt Helsley wrote:
> On Sun, Sep 26, 2010 at 12:08:37PM -0700, Brad Boyer wrote:
> > On Thu, Sep 23, 2010 at 02:53:28PM -0700, Matt Helsley wrote:
> > > Not all filesystems will necessarily be able to support relinking an
> > > orphan inode back into the filesystem. Some offlist feedback suggested
> > > that instead of overloading .link that relinking should be a separate
> > > file operation for this reason.
> > >
> > > Since .relink is a superset of .link make the VFS call .relink where
> > > possible and .link otherwise.
> > >
> > > The next commit will change ext3/4 to enable this operation.
> >
> > I may have missed something in one of these patches (patch 1 and any
> > original summary if there was one don't appear in my email), but
> > what is the point of the new operation? I didn't see any case that
> > treats one any different than the other. What is disallowed (and how)
> > for a driver which does not implement .relink but has .link?
>
> Did you get patch 3? It shows how ext3/ext4 add the ability to take an
> inode that has been unlinked, placed onto the orphan list, and relink it.

Yes, I did get patch 3. I think you misunderstood my question. You point
both .link and .relink to the same function in ext3 and ext4. The common
code which calls them will call .relink if it is set and .link if it is
not set. If nothing acts any different based on .relink being NULL or
not-NULL, and the only implementation isn't any different from .link
what was the point of introducing a new operation?

What I expected to see was that some particular code path would check
if .relink was NULL and fail in that case. Unless there is a code path
that will only call .relink and not .link, it seems useless to me.

Brad Boyer
[email protected]


2010-09-29 20:16:10

by Oren Laadan

[permalink] [raw]
Subject: Re: [PATCH 2/6] [RFC] Create the .relink file_operation



On 09/27/2010 06:03 PM, Brad Boyer wrote:
> On Mon, Sep 27, 2010 at 12:16:28PM -0700, Matt Helsley wrote:
>> On Sun, Sep 26, 2010 at 12:08:37PM -0700, Brad Boyer wrote:
>>> On Thu, Sep 23, 2010 at 02:53:28PM -0700, Matt Helsley wrote:
>>>> Not all filesystems will necessarily be able to support relinking an
>>>> orphan inode back into the filesystem. Some offlist feedback suggested
>>>> that instead of overloading .link that relinking should be a separate
>>>> file operation for this reason.

In light of Brad's comment (below), maybe elaborate on this:

Some offlist feedback suggested that instead of overloading .link
to provide this functionality, relinking of an orphan inode back
into the filesystem should be a separate file operation.
This is because some filesystems may not be able to support this
operation. Their existing .link already assumes that the inode
isn't orphan (i_nlink != 0), but still won't explicitly test for
the condition. If this is the case, the overloading .link may
break their assumptions and a call to relink may not be handled
too gracefully.

>>>>
>>>> Since .relink is a superset of .link make the VFS call .relink where
>>>> possible and .link otherwise.
>>>>
>>>> The next commit will change ext3/4 to enable this operation.
>>>
>>> I may have missed something in one of these patches (patch 1 and any
>>> original summary if there was one don't appear in my email), but
>>> what is the point of the new operation? I didn't see any case that
>>> treats one any different than the other. What is disallowed (and how)
>>> for a driver which does not implement .relink but has .link?
>>
>> Did you get patch 3? It shows how ext3/ext4 add the ability to take an
>> inode that has been unlinked, placed onto the orphan list, and relink it.
>
> Yes, I did get patch 3. I think you misunderstood my question. You point
> both .link and .relink to the same function in ext3 and ext4. The common
> code which calls them will call .relink if it is set and .link if it is
> not set. If nothing acts any different based on .relink being NULL or
> not-NULL, and the only implementation isn't any different from .link
> what was the point of introducing a new operation?
>
> What I expected to see was that some particular code path would check
> if .relink was NULL and fail in that case. Unless there is a code path
> that will only call .relink and not .link, it seems useless to me.

Does the above help clarify this ?
The test performed in vfs_link().

Oren.

2010-09-29 20:19:31

by Oren Laadan

[permalink] [raw]
Subject: Re: [PATCH 2/6] [RFC] Create the .relink file_operation



On 09/23/2010 05:53 PM, Matt Helsley wrote:
> Not all filesystems will necessarily be able to support relinking an
> orphan inode back into the filesystem. Some offlist feedback suggested
> that instead of overloading .link that relinking should be a separate
> file operation for this reason.
>
> Since .relink is a superset of .link make the VFS call .relink where
> possible and .link otherwise.
>
> The next commit will change ext3/4 to enable this operation.
>
> Signed-off-by: Matt Helsley<[email protected]>
> Cc: Theodore Ts'o<[email protected]>
> Cc: Andreas Dilger<[email protected]>
> Cc: Jan Kara<[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Al Viro<[email protected]>
> ---
> fs/namei.c | 5 ++++-
> include/linux/fs.h | 1 +
> 2 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index a7dce91..eb279e3 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2446,7 +2446,10 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
> return error;
>
> mutex_lock(&inode->i_mutex);
> - error = dir->i_op->link(old_dentry, dir, new_dentry);
> + if (dir->i_op->relink)
> + error = dir->i_op->relink(old_dentry, dir, new_dentry);
> + else
> + error = dir->i_op->link(old_dentry, dir, new_dentry);

Can there be a scenario/filesystem in which .relink implementation
is so much more complex (and expensive) than .link ?

If the answer is "yes", then this we probably don't want to do
this, and let vfs_link() call .link, and instead add a new helper
vfs_relink().

Oren.

> mutex_unlock(&inode->i_mutex);
> if (!error)
> fsnotify_link(dir, inode, new_dentry);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ee725ff..d932eb7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1528,6 +1528,7 @@ struct inode_operations {
> int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
> struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
> int (*link) (struct dentry *,struct inode *,struct dentry *);
> + int (*relink) (struct dentry *,struct inode *,struct dentry *);
> int (*unlink) (struct inode *,struct dentry *);
> int (*symlink) (struct inode *,struct dentry *,const char *);
> int (*mkdir) (struct inode *,struct dentry *,int);

2010-09-29 23:07:03

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH 2/6] [RFC] Create the .relink file_operation

On Wed, Sep 29, 2010 at 04:19:31PM -0400, Oren Laadan wrote:
>
>
> On 09/23/2010 05:53 PM, Matt Helsley wrote:
> >Not all filesystems will necessarily be able to support relinking an
> >orphan inode back into the filesystem. Some offlist feedback suggested
> >that instead of overloading .link that relinking should be a separate
> >file operation for this reason.
> >
> >Since .relink is a superset of .link make the VFS call .relink where
> >possible and .link otherwise.
> >
> >The next commit will change ext3/4 to enable this operation.
> >
> >Signed-off-by: Matt Helsley<[email protected]>
> >Cc: Theodore Ts'o<[email protected]>
> >Cc: Andreas Dilger<[email protected]>
> >Cc: Jan Kara<[email protected]>
> >Cc: [email protected]
> >Cc: [email protected]
> >Cc: Al Viro<[email protected]>
> >---
> > fs/namei.c | 5 ++++-
> > include/linux/fs.h | 1 +
> > 2 files changed, 5 insertions(+), 1 deletions(-)
> >
> >diff --git a/fs/namei.c b/fs/namei.c
> >index a7dce91..eb279e3 100644
> >--- a/fs/namei.c
> >+++ b/fs/namei.c
> >@@ -2446,7 +2446,10 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
> > return error;
> >
> > mutex_lock(&inode->i_mutex);
> >- error = dir->i_op->link(old_dentry, dir, new_dentry);
> >+ if (dir->i_op->relink)
> >+ error = dir->i_op->relink(old_dentry, dir, new_dentry);
> >+ else
> >+ error = dir->i_op->link(old_dentry, dir, new_dentry);
>
> Can there be a scenario/filesystem in which .relink implementation
> is so much more complex (and expensive) than .link ?
>
> If the answer is "yes", then this we probably don't want to do
> this, and let vfs_link() call .link, and instead add a new helper
> vfs_relink().

OK, that makes some sense too. I thought the separation would
just be at the file operations layer but we can move it higher too.

I'll adjust the patches to do that and repost them.

Cheers,
-Matt