Hi,
Zoltan Sogor noticed the following VFS behavior while testing
UBIFS. He ran the following shell script:
mkdir xxx
cd xxx
rmdir ../xxx
touch tmp
cd /
The result of it was that the '->delete_inode()' operation for
the 'xxx' directory inode is not called. It is not called even
after 'cd /'. However, if we do not do the 'touch tmp' command
(which actually fails), '->delete_inode()' _is_ called for 'xxx'.
So 'touch tmp' has a side effect. You may notice this by watching
'cat /proc/slabinfo | grep dentry' while running the following
script:
while true; do
echo live
for i in `seq 1 1000`
do
mkdir $i
cd $i
rmdir ../$i
touch tmp &> /dev/null
cd - &> /dev/null
done;
done;
Dentry cache will be growing. However, the same script but
without 'touch tmp' does not make dentry cache to grow.
For UBIFS this causes the problems because its orphan area
gets overflowed (orphans are added, but not removed because
'->delete_inode()' is not called).
==============================================================
I explored this some more. What happens is:
1. Current directory is 'xxx', its 'i_nlink' is 0 (because of
'rmdir ../xxx'), but because it is current directory, 'xxx'
dentry has 'd_count' = 1, so 'iput_final()' is not called
so far. If the script would just do 'cd /', the inode
would be deleted. However, we are going to run
'touch tmp'.
2. 'touch tmp' calls 'open(tmp, O_CREAT)'.
3. we are in 'do_filp_open()' function (see fs/namei.c).
The situation is: xxx dentry 'd_count' is 1,
'xxx' inode i_count is 1 as well.
4. 'path_lookup_create()' is called, which increases 'd_count'
of 'xxx' direntry to 2 and succeeds.
5. 'do_filp_open()' calls 'lookup_hash()', which in turn calls
'd_alloc()', which allocates a dentry object for 'tmp', sets
its 'd_count' to 1, and also increases 'd_count' of 'xxx'
dentry (the parent) to 3.
So the situation is: 'xxx' dentry 'd_count' is 3
'tmp' dentry has d_count 1
6. '__lookup_hash()' calls 'inode->i_op->lookup()', which does
not find the dentry, but calls 'd_splice_alias(NULL, dentry)'
anyway, which makes 'tmp' hashed. Not sure, but AFAIU the idea
is to make it "negative" dentry anyway, to optimize lookups.
7. Return to 'do_filp_open()', and '__open_namei_create()' is
called, which in turn calls 'vfs_create()', which fails in
the 'may_create()' check on the 'IS_DEADDIR(dir)' check.
'__open_namei_create()' 'dput()'s 'xxx' direntry and returns
error.
So the situation is: 'xxx' dentry 'd_count' is 2
'tmp' dentry has d_count 1
'xxx' inode 'i_count' is still 1.
8. We do 'goto exit', which calls 'dput()' for 'tmp' dentry,
(because 'd_alloc()' set it to 1). However, this _does not_
decrease 'd_count' of the parent 'xxx' dentry (but it _did_
increase it - see step 5.
So the situation is: 'xxx' dentry 'd_count' is 2
'tmp' dentry d_count is 0
'xxx' inode 'i_count' is still 1.
9. The shell script does 'cd /', which causes 'xxx' dentry
'd_count' to be decreased. However, 'd_count' stays 1,
and 'xxx' inode 'i_count' stays 1. So the result is that
'->delete_inode()' is not called for removed 'xxx' inode
until unmount or memory pressure.
=============================================================
My understanding is that negative direntry 'tmp' keeps deleted
directory inode 'xxx'. However, I am not 100% sure I understand
the situation correctly.
The 'tmp' dentry is freed eventually because of unmount or
memory pressure, but not earlier than that. So the deleted
inodes may be kept for really long time. Is this OK?
I may just say that I fixed this in UBIFS by not calling
'd_splice_alias()' for not found dentries if the parent
directory inode has 'n_link' = 0. However, ext[23] always
call 'd_splice_alias()' for not found direntries (passing
NULL as the 'inode' parameter).
Again, I am not 100% sure this is the right fix, because
I suspect this should be "fixed" in VFS. I tried to do this
and I have a small VFS patch, but it is probably incorrect.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
On Wed, 02 Jul 2008, Artem Bityutskiy wrote:
> The result of it was that the '->delete_inode()' operation for
> the 'xxx' directory inode is not called. It is not called even
> after 'cd /'. However, if we do not do the 'touch tmp' command
> (which actually fails), '->delete_inode()' _is_ called for 'xxx'.
This certainly doesn't sound right.
<snip>
> The 'tmp' dentry is freed eventually because of unmount or
> memory pressure, but not earlier than that. So the deleted
> inodes may be kept for really long time. Is this OK?
No, deleted dentries should go away immediately, or if they are still
referenced, then they are unhashed, and go away once the reference
disappears.
In this situation however, it seems that the unhashed dentry managed
to acquire a child, which keeps it from being finally killed. The VFS
shouldn't allow this to happen.
> I may just say that I fixed this in UBIFS by not calling
> 'd_splice_alias()' for not found dentries if the parent
> directory inode has 'n_link' = 0. However, ext[23] always
> call 'd_splice_alias()' for not found direntries (passing
> NULL as the 'inode' parameter).
>
> Again, I am not 100% sure this is the right fix, because
> I suspect this should be "fixed" in VFS. I tried to do this
> and I have a small VFS patch, but it is probably incorrect.
The correct fix IMO is to make lookup return ENOENT on an IS_DEADDIR()
inode, before even trying to create the child dentry.
Untested patch attached.
Miklos
----
Subject: vfs: fix lookup on deleted directory
From: Miklos Szeredi <[email protected]>
Lookup can install a child dentry for a deleted directory. This keeps
the directory alive even after all external references have gone away.
Fix this by returning ENOENT for all lookups on a S_DEAD directory
before creating a child dentry.
Reported-by: Artem Bityutskiy <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/namei.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c 2008-07-01 22:28:42.000000000 +0200
+++ linux-2.6/fs/namei.c 2008-07-02 12:57:27.000000000 +0200
@@ -519,7 +519,14 @@ static struct dentry * real_lookup(struc
*/
result = d_lookup(parent, name);
if (!result) {
- struct dentry * dentry = d_alloc(parent, name);
+ struct dentry *dentry;
+
+ /* Don't create child dentry for a dead directory. */
+ result = ERR_PTR(-ENOENT);
+ if (IS_DEADDIR(dir))
+ goto out_unlock;
+
+ dentry = d_alloc(parent, name);
result = ERR_PTR(-ENOMEM);
if (dentry) {
result = dir->i_op->lookup(dir, dentry, nd);
@@ -528,6 +535,7 @@ static struct dentry * real_lookup(struc
else
result = dentry;
}
+out_unlock:
mutex_unlock(&dir->i_mutex);
return result;
}
@@ -1317,7 +1325,14 @@ static struct dentry *__lookup_hash(stru
dentry = cached_lookup(base, name, nd);
if (!dentry) {
- struct dentry *new = d_alloc(base, name);
+ struct dentry *new;
+
+ /* Don't create child dentry for a dead directory. */
+ dentry = ERR_PTR(-ENOENT);
+ if (IS_DEADDIR(inode))
+ goto out;
+
+ new = d_alloc(base, name);
dentry = ERR_PTR(-ENOMEM);
if (!new)
goto out;
Miklos Szeredi wrote:
>> I may just say that I fixed this in UBIFS by not calling
>> 'd_splice_alias()' for not found dentries if the parent
>> directory inode has 'n_link' = 0. However, ext[23] always
>> call 'd_splice_alias()' for not found direntries (passing
>> NULL as the 'inode' parameter).
>>
>> Again, I am not 100% sure this is the right fix, because
>> I suspect this should be "fixed" in VFS. I tried to do this
>> and I have a small VFS patch, but it is probably incorrect.
>
> The correct fix IMO is to make lookup return ENOENT on an IS_DEADDIR()
> inode, before even trying to create the child dentry.
Sounds correct, thanks for the replay.
> Untested patch attached.
Looks fine. Tried it, looks fine and solves my problem. Ran some
of my tests and they did not fail.
Hopefully this patch will make his way to mainline. Thanks.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
Artem Bityutskiy wrote:
> The result of it was that the '->delete_inode()' operation for
> the 'xxx' directory inode is not called. It is not called even
> after 'cd /'. However, if we do not do the 'touch tmp' command
> (which actually fails), '->delete_inode()' _is_ called for 'xxx'.
>
> So 'touch tmp' has a side effect. You may notice this by watching
> 'cat /proc/slabinfo | grep dentry' while running the following
> script:
I tried to reproduce this on my system using an EXT3 filesystem, leaving
the script running the entire night. When I wake up I found that the
system disk was full. So, besides the memory issue, this problem also
consume some bytes on your filesystem.