2008-03-15 16:12:03

by Dmitry Monakhov

[permalink] [raw]
Subject: strange ext{3,4}_settattr logic

Hello.
I've found what ext3_setattr() code has some strange logic. I'm talking
about truncate path.

int ext3_setattr(struct dentry *dentry, struct iattr *attr)
{
...
if (S_ISREG(inode->i_mode) &&
attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
handle_t *handle;
<<< This is shrinking case, and according to function comments:
<<< "In particular, we want to make sure that when the VFS
<<< * shrinks i_size, we put the inode on the orphan list and modify
<<< * i_disksize immediately"
<<< we about to write i_disksize. But WHY do we have to do it explicitly?
<<< Later inode_setattr() will call ext3_truncate() which will do it
<<< this work for us.
handle = ext3_journal_start(inode, 3);
if (IS_ERR(handle)) {
error = PTR_ERR(handle);
goto err_out;
}

error = ext3_orphan_add(handle, inode);
EXT3_I(inode)->i_disksize = attr->ia_size;
rc = ext3_mark_inode_dirty(handle, inode);
if (!error)
error = rc;
ext3_journal_stop(handle);
}
rc = inode_setattr(inode, attr);
<<< Now the most interesting question. What we have to do now in
<<< case of error? We are in tricky situation. Truncate not happened,
<<< and blocks visible to the user, but i_disksize was already written,
<<< so later memory reclaiming/ read_inode will result in unexpected
<<< updating i_size.
/* If inode_setattr's call to ext3_truncate failed to get a
* transaction handle at all, we need to clean up the in-core
* orphan list manually. */
<<< Following code will remove inode only from in memory(because handle = NULL)
<<< orphan list. Please someone explain me what this lines suppose to do
<<< actually.
if (inode->i_nlink)
ext3_orphan_del(NULL, inode);
...
}


2008-03-15 23:06:16

by Andreas Dilger

[permalink] [raw]
Subject: Re: strange ext{3,4}_settattr logic

On Mar 15, 2008 19:07 +0300, Dmitri Monakhov wrote:
> I've found what ext3_setattr() code has some strange logic. I'm talking
> about truncate path.
>
> int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> {
> ...
> if (S_ISREG(inode->i_mode) &&
> attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
> handle_t *handle;
> <<< This is shrinking case, and according to function comments:
> <<< "In particular, we want to make sure that when the VFS
> <<< * shrinks i_size, we put the inode on the orphan list and modify
> <<< * i_disksize immediately"
> <<< we about to write i_disksize. But WHY do we have to do it explicitly?
> <<< Later inode_setattr() will call ext3_truncate() which will do it
> <<< this work for us.

The reason that i_disksize is written to disk here immediately is that the
journal is stopped. Once that is done then in case of a crash the orphan
recovery code will detect the unfinished truncate and complete it before
mounting the filesystem.

Without this it is possible to get a partial truncate after a crash because
the truncate may span several transactions due to the potentially large
number of blocks that need to be modified. What is important with ext3
is that because e2fsck is not run on each boot whatever is on disk needs
to be consistent after a crash.

If there is a file being truncated or unlinked that needs to be completed
after a crash or the blocks will be leaked. To ensure this happens, there
is a singly-linked list of inodes on the disk called the "orphan list"
that keeps track of all inodes currently undergoing truncate or unlink.
After a crash the kernel or e2fsck will walk this list and finish the
truncate or unlink of the inode, freeing the blocks.

> rc = inode_setattr(inode, attr);
> <<< Now the most interesting question. What we have to do now in
> <<< case of error? We are in tricky situation. Truncate not happened,
> <<< and blocks visible to the user, but i_disksize was already written,
> <<< so later memory reclaiming/ read_inode will result in unexpected
> <<< updating i_size.

The only ways inode_setattr() can fail are:
- expanding vmtruncate hits EFBIG, but we checked that above
- shrinking vmtruncate on a swapfile returns ETXTBUSY. This was added
after the ext3_setattr() code was written.

If the ext3_truncate() or mark_inode_dirty() call fails, it does not
return an error code. For ext3 the only way this can fail is if the
journal is aborted, which means the filesystem is already in read-only
mode and nothing can be done to clean up the truncate until the next
mount, at which point the orphan recovery code discussed above will
finish the operation.

> /* If inode_setattr's call to ext3_truncate failed to get a
> * transaction handle at all, we need to clean up the in-core
> * orphan list manually. */
> <<< Following code will remove inode only from in memory(because handle = NULL)
> <<< orphan list. Please someone explain me what this lines suppose to do
> <<< actually.
> if (inode->i_nlink)
> ext3_orphan_del(NULL, inode);

This will only be important in the case of a failed operation above.
The ext3_truncate() code will normally have already removed the inode
from the orphan list when it is finished, but we aren't sure whether
that code was called so we need to do it again here (it is safe to call
even if the inode is not on the list) to ensure we don't hit a J_ASSERT()
that the orphan list is empty in the unmount code (ext3_put_super()).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-03-15 23:55:05

by Andreas Dilger

[permalink] [raw]
Subject: Re: strange ext{3,4}_settattr logic

On Mar 16, 2008 07:05 +0800, Andreas Dilger wrote:
> On Mar 15, 2008 19:07 +0300, Dmitri Monakhov wrote:
> > I've found what ext3_setattr() code has some strange logic. I'm talking
> > about truncate path.
> >
> > int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> > {
> > ...
> > if (S_ISREG(inode->i_mode) &&
> > attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
> > handle_t *handle;
> > <<< This is shrinking case, and according to function comments:
> > <<< "In particular, we want to make sure that when the VFS
> > <<< * shrinks i_size, we put the inode on the orphan list and modify
> > <<< * i_disksize immediately"
> > <<< we about to write i_disksize. But WHY do we have to do it explicitly?
> > <<< Later inode_setattr() will call ext3_truncate() which will do it
> > <<< this work for us.
>
> The reason that i_disksize is written to disk here immediately is that the
> journal is stopped. Once that is done then in case of a crash the orphan
> recovery code will detect the unfinished truncate and complete it before
> mounting the filesystem.
>
> > rc = inode_setattr(inode, attr);
> > <<< Now the most interesting question. What we have to do now in
> > <<< case of error? We are in tricky situation. Truncate not happened,
> > <<< and blocks visible to the user, but i_disksize was already written,
> > <<< so later memory reclaiming/ read_inode will result in unexpected
> > <<< updating i_size.
>
> The only ways inode_setattr() can fail are:
> - expanding vmtruncate hits EFBIG, but we checked that above
> - shrinking vmtruncate on a swapfile returns ETXTBUSY. This was added
> after the ext3_setattr() code was written.

I forgot to add that we need to handle the IS_SWAPFILE() case properly.
Granted, it probably isn't a very common problem, but the IS_SWAPFILE()
check was added explicitly because of clueless users. Also, very few
users run with a swapfile in any case, most use a swap partition.

It would seem that if you have a swapfile, try to truncate it to 0 (which
will fail with -ETXTBUSY) and then unmount the filesystem the size will
be truncated to 0:

[root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo bs=4k count=100
100+0 records in
100+0 records out
[root@lin-cli1 tests]# mkswap /tmp/foo
Setting up swapspace version 1, size = 405 kB
[root@lin-cli1 tests]# swapon /tmp/foo
[root@lin-cli1 tests]# swapon -s
Filename Type Size Used
Priority
/dev/hda5 partition 1052216 136 -1
/tmp/foo file 392 0 -2
[root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo bs=4k count=1 seek=99
dd: advancing past 405504 bytes in output file `/tmp/foo': Text file busy
[root@lin-cli1 tests]# > /tmp/foo
bash: /tmp/foo: Text file busy
[root@lin-cli1 tests]# ls -l /tmp/foo
404 -rw-r--r-- 1 root root 409600 Mar 15 17:12 /tmp/foo
[root@lin-cli1 tests]# swapoff /tmp/foo
[root@lin-cli1 tests]# df /tmp
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/hda1 20641788 8997964 10595184 46% /
[root@lin-cli1 tests]# debugfs /dev/hda1
debugfs 1.40.2.cfs4 (12-Jul-2007)
debugfs: stat /tmp/foo
Inode: 1346639 Type: regular Mode: 0644 Flags: 0x0 Generation:
2276953
732
User: 0 Group: 0 Size: 0 <<< *** oops ***
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 808
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008
atime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008
mtime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008
BLOCKS:
(0):2689410, (1-11):2689460-2689470, (IND):2689471, (12-54):2689472-2689514,
(55 -59):2689516-2689520, (60-63):2689532-2689535, (64-87):2689544-2689567,
(88-96): 2689592-2689600, (97-99):2689619-2689621
TOTAL: 101
debugfs: quit
[root@lin-cli1 tests]# e2fsck -fn /dev/hda1
e2fsck 1.40.2.cfs4 (12-Jul-2007)
Warning! /dev/hda1 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Inode 1346639, i_size is 0, should be 409600. Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/: ********** WARNING: Filesystem still has errors **********

/: 337862/2626560 files (6.9% non-contiguous), 2354290/5242880 blocks


Until e2fsck was actually run on the filesystem (which would do the right
thing and fix the file size) the swap file would have 0 length and fail
to start. It probably isn't fatal for most systems to run without swap
these days, but let's see if this would cause a boot-time failure or if it
is silently ignored (on a RHEL4 system):

[root@lin-cli1 tests]# > /tmp/foo
[root@lin-cli1 tests]# debugfs /dev/hda1
debugfs 1.40.2.cfs4 (12-Jul-2007)
debugfs: stat /tmp/foo
Inode: 1346639 Type: regular Mode: 0644 Flags: 0x0 Generation:
2276953
732
User: 0 Group: 0 Size: 0
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x47dc5a45 -- Sat Mar 15 17:22:45 2008
atime: 0x47dc57fb -- Sat Mar 15 17:12:59 2008
mtime: 0x47dc5a45 -- Sat Mar 15 17:22:45 2008
BLOCKS:

debugfs: quit
[root@lin-cli1 tests]# swapon /tmp/foo
swapon: /tmp/foo: Invalid argument
[root@lin-cli1 tests]# echo $?
255
[root@lin-cli1 tests]# grep -C 3 swapon /etc/rc.sysinit

# Start up swapping.
update_boot_stage RCswap
action $"Enabling swap space: " swapon -a -e

# Set up binfmt_misc
/bin/mount -t binfmt_misc none /proc/sys/fs/binfmt_misc > /dev/null 2>&1


So it appears the error would be ignored, and eventually (because e2fsck
will run periodically on filesystems, unless the user turns this off) it
would be repaired properly. Still, it should be fixed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-03-16 00:23:33

by Andreas Dilger

[permalink] [raw]
Subject: Re: strange ext{3,4}_settattr logic

On Mar 16, 2008 07:54 +0800, Andreas Dilger wrote:
> A call to inode_setattr() can fail by trying a shrinking vmtruncate()
> on a swapfile, which returns ETXTBUSY. This was added after the
> ext3_setattr() code was written.
>
> We need to handle the IS_SWAPFILE() case properly.
> Granted, it probably isn't a very common problem, but the IS_SWAPFILE()
> check was added explicitly because of clueless users, so it must be hit
> occasionally in real life.
>
> It would seem that if you have a swapfile, try to truncate it to 0 (which
> will fail with -ETXTBUSY) and then unmount the filesystem the size will
> be truncated to 0. It is also possible to directly write to a swapfile
> and corrupt memory, or read from a swapfile and access potentially sensitive
> information.

Dmitri, (or anyone)
can you please give this patch a try. It should be OK, but I'm hesitant
to test it on my test box because I'm out of town and if it fails to
reboot I can't fix it for a week. It _should_ still be possible to chown,
rename, ls, etc the swapfile, can you please verify that in addition
to the simple test in my previous email.

Note that the "dd" test I did previously was incorrect since "dd"
was trying to do a truncate before the write, it needs to be run with
conv=notrunc, and then the write succeeds:

[root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo bs=4k count=1 seek=99
dd: advancing past 405504 bytes in output file `/tmp/foo': Text file busy
[root@lin-cli1 tests]# dd if=/dev/zero of=/tmp/foo conv=notrunc bs=4k count=1
seek=99
1+0 records in
1+0 records out
[root@lin-cli1 tests]# ls -l /tmp/foo
404 -rw-r--r-- 1 root root 409600 Mar 15 17:56 /tmp/foo
[root@lin-cli1 tests]# debugfs -R "stat /tmp/foo" /dev/hda1
debugfs 1.40.2.cfs4 (12-Jul-2007)
Inode: 1346639 Type: regular Mode: 0644 Flags: 0x0 Generation:
2276953
732
User: 0 Group: 0 Size: 405504
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 808
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x47dc6211 -- Sat Mar 15 17:56:01 2008
atime: 0x47dc5f06 -- Sat Mar 15 17:43:02 2008
mtime: 0x47dc6211 -- Sat Mar 15 17:56:01 2008
dtime: 0x00148c4f -- Fri Jan 16 07:03:59 1970
BLOCKS:
(0):2688759, (1-11):2688800-2688810, (IND):2688811, (12-44):2688812-2688844,
(45):2689410, (46-99):2689460-2689513
TOTAL: 101



--- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800
+++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800
@@ -233,6 +233,10 @@ int permission(struct inode *inode, int
if (nd)
mnt = nd->mnt;

+ /* Don't allow direct read, write, or truncate on a swapfile */
+ if (IS_SWAPFILE(inode))
+ return -ETXTBUSY;
+
if (mask & MAY_WRITE) {
umode_t mode = inode->i_mode;



Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-03-16 11:44:02

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: strange ext{3,4}_settattr logic

I've added Jens because he may be also interesting in this topic.
On 08:23 Sun 16 Mar , Andreas Dilger wrote:
> On Mar 16, 2008 07:54 +0800, Andreas Dilger wrote:
> > A call to inode_setattr() can fail by trying a shrinking vmtruncate()
> > on a swapfile, which returns ETXTBUSY. This was added after the
> > ext3_setattr() code was written.
> >
> > We need to handle the IS_SWAPFILE() case properly.
> > Granted, it probably isn't a very common problem, but the IS_SWAPFILE()
> > check was added explicitly because of clueless users, so it must be hit
> > occasionally in real life.
> >
> > It would seem that if you have a swapfile, try to truncate it to 0 (which
> > will fail with -ETXTBUSY) and then unmount the filesystem the size will
> > be truncated to 0. It is also possible to directly write to a swapfile
> > and corrupt memory, or read from a swapfile and access potentially sensitive
> > information.
In fact i've triggered this issue while working on fast_loop device
implementation which was proposed by Jens (http://lkml.org/lkml/2008/1/9/50).
In fast_loop device use swapfile approach (submitting bio-s directly to
underlying block device). I think this idea will be included in mainstream
loop device sooner or later. But approach has several issues:
One of the most important is effective control mechanism over truncates for
lower file, this issue was missed in Jens patch set.
This mechanism probably have to have following options.
#1: Shrink truncates must be denied.
#2: Expand truncates may be allowed. This is good because most of non plain
disk image formats (qcow, vmdk, and etc) are growing while adding new data
blocks.
#3: Allow exclusive owner for file, for example only one user(loop_thread in
this case) may truncate file. Provide something similar to bd_claim feature.
without this feature on-line shrinking of disk image looks like this:

mutex_lock(&inode->i_mutex);
inode->i_flags &= ~S_SWAPFILE;
mutex_unlock(&inode->i_mutex);

/* Perform shrinking truncate. This is absolutely racy operation because
* some one else also may perform truncate at this time*/
do_truncate(inode, size);

mutex_lock(&inode->i_mutex);
inode->i_flags |= S_SWAPFILE;
mutex_unlock(&inode->i_mutex);


S_SWAPFILE inode option currently equals to #1, and #2. What's why i want
use this flag for fast_loop devices.

>
> Dmitri, (or anyone)
> can you please give this patch a try. It should be OK, but I'm hesitant
> to test it on my test box because I'm out of town and if it fails to
> reboot I can't fix it for a week. It _should_ still be possible to chown,
> rename, ls, etc the swapfile, can you please verify that in addition
> to the simple test in my previous email.
>
[snip]
>
>
>
> --- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800
> +++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800
> @@ -233,6 +233,10 @@ int permission(struct inode *inode, int
> if (nd)
> mnt = nd->mnt;
>
> + /* Don't allow direct read, write, or truncate on a swapfile */
Your patch disallow expand truncates (#2) which is definitely not good.
In fact if file was opened before S_SWAPFILE flag was setted when it is
possible to directly read, write from file.
> + if (IS_SWAPFILE(inode))
> + return -ETXTBUSY;
> +
> if (mask & MAY_WRITE) {
> umode_t mode = inode->i_mode;

BTW it is impossible to disable swapfile with your patch
[root@ts63 tmp]# swapoff /vz/swap
swapoff: /vz/swap: Text file busy

IMHO S_SWAPFILE flag must be checked in ext3_setattr, see patch attached:


Attachments:
(No filename) (3.47 kB)
swap_flag.patch (463.00 B)
Download all attachments

2008-03-16 15:23:18

by Andreas Dilger

[permalink] [raw]
Subject: Re: strange ext{3,4}_settattr logic

On Mar 16, 2008 14:39 +0300, Dmitri Monakhov wrote:
> > --- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800
> > +++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800
> > @@ -233,6 +233,10 @@ int permission(struct inode *inode, int
> > if (nd)
> > mnt = nd->mnt;
> >
> > + /* Don't allow direct read, write, or truncate on a swapfile */
>
> Your patch disallow expand truncates (#2) which is definitely not good.

How often is that done though? Since this is only for a swapfile then
it isn't needed.

> In fact if file was opened before S_SWAPFILE flag was setted when it is
> possible to directly read, write from file.

I assume that is even a more rare case. I was thinking alternately to
do a "deny_write" on the swapfile during swapon() so that this would
fail in more cases.

> > + if (IS_SWAPFILE(inode))
> > + return -ETXTBUSY;
> > +
> > if (mask & MAY_WRITE) {
> > umode_t mode = inode->i_mode;
>
> BTW it is impossible to disable swapfile with your patch
> [root@ts63 tmp]# swapoff /vz/swap
> swapoff: /vz/swap: Text file busy

I thought some bug like this might appear.

> IMHO S_SWAPFILE flag must be checked in ext3_setattr, see patch attached:

No, that still means every other filesystem is broken. Since the current
filesystem code doesn't know anything about IS_SWAPFILE I'd rather keep
it that way. I think it is better to move my proposed IS_SWAPFILE() check
into "MAY_WRITE" and "MAY_EXEC" cases.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-03-16 17:53:54

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: strange ext{3,4}_settattr logic

On 23:22 Sun 16 Mar , Andreas Dilger wrote:
> On Mar 16, 2008 14:39 +0300, Dmitri Monakhov wrote:
> > > --- linux-2.6.24/fs/namei.c.orig 2008-02-05 07:29:57.000000000 +0800
> > > +++ linux-2.6.24/fs/namei.c 2008-03-16 08:11:41.000000000 +0800
> > > @@ -233,6 +233,10 @@ int permission(struct inode *inode, int
> > > if (nd)
> > > mnt = nd->mnt;
> > >
> > > + /* Don't allow direct read, write, or truncate on a swapfile */
> >
> > Your patch disallow expand truncates (#2) which is definitely not good.
>
> How often is that done though? Since this is only for a swapfile then
> it isn't needed.
I hope what this technique may be used for implementing non plain device image
formats support in loop device driver, currently all such drivers implemented
in userspace ( for example qcow:
http://brick.kernel.dk/git/?p=qemu.git;a=blob;h=730ae67ed8f7095ca4d22e80aca57678be18fb6c;hb=2fd2f67411787297687e4eecbb1db76ac7d06e45;f=block-qcow.c)
which cause bad performance and reliability.
>
> > In fact if file was opened before S_SWAPFILE flag was setted when it is
> > possible to directly read, write from file.
>
> I assume that is even a more rare case. I was thinking alternately to
> do a "deny_write" on the swapfile during swapon() so that this would
> fail in more cases.
>
> > > + if (IS_SWAPFILE(inode))
> > > + return -ETXTBUSY;
> > > +
> > > if (mask & MAY_WRITE) {
> > > umode_t mode = inode->i_mode;
> >
> > BTW it is impossible to disable swapfile with your patch
> > [root@ts63 tmp]# swapoff /vz/swap
> > swapoff: /vz/swap: Text file busy
>
> I thought some bug like this might appear.
>
> > IMHO S_SWAPFILE flag must be checked in ext3_setattr, see patch attached:
>
> No, that still means every other filesystem is broken. Since the current
> filesystem code doesn't know anything about IS_SWAPFILE I'd rather keep
> it that way. I think it is better to move my proposed IS_SWAPFILE() check
> into "MAY_WRITE" and "MAY_EXEC" cases.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-03-17 08:24:18

by Jens Axboe

[permalink] [raw]
Subject: Re: strange ext{3,4}_settattr logic

On Sun, Mar 16 2008, Dmitri Monakhov wrote:
> I've added Jens because he may be also interesting in this topic.
> On 08:23 Sun 16 Mar , Andreas Dilger wrote:
> > On Mar 16, 2008 07:54 +0800, Andreas Dilger wrote:
> > > A call to inode_setattr() can fail by trying a shrinking vmtruncate()
> > > on a swapfile, which returns ETXTBUSY. This was added after the
> > > ext3_setattr() code was written.
> > >
> > > We need to handle the IS_SWAPFILE() case properly.
> > > Granted, it probably isn't a very common problem, but the IS_SWAPFILE()
> > > check was added explicitly because of clueless users, so it must be hit
> > > occasionally in real life.
> > >
> > > It would seem that if you have a swapfile, try to truncate it to 0 (which
> > > will fail with -ETXTBUSY) and then unmount the filesystem the size will
> > > be truncated to 0. It is also possible to directly write to a swapfile
> > > and corrupt memory, or read from a swapfile and access potentially sensitive
> > > information.
> In fact i've triggered this issue while working on fast_loop device
> implementation which was proposed by Jens (http://lkml.org/lkml/2008/1/9/50).
> In fast_loop device use swapfile approach (submitting bio-s directly to
> underlying block device). I think this idea will be included in mainstream
> loop device sooner or later. But approach has several issues:
> One of the most important is effective control mechanism over truncates for
> lower file, this issue was missed in Jens patch set.
> This mechanism probably have to have following options.
> #1: Shrink truncates must be denied.
> #2: Expand truncates may be allowed. This is good because most of non plain
> disk image formats (qcow, vmdk, and etc) are growing while adding new data
> blocks.
> #3: Allow exclusive owner for file, for example only one user(loop_thread in
> this case) may truncate file. Provide something similar to bd_claim feature.
> without this feature on-line shrinking of disk image looks like this:
>
> mutex_lock(&inode->i_mutex);
> inode->i_flags &= ~S_SWAPFILE;
> mutex_unlock(&inode->i_mutex);
>
> /* Perform shrinking truncate. This is absolutely racy operation because
> * some one else also may perform truncate at this time*/
> do_truncate(inode, size);
>
> mutex_lock(&inode->i_mutex);
> inode->i_flags |= S_SWAPFILE;
> mutex_unlock(&inode->i_mutex);
>
>
> S_SWAPFILE inode option currently equals to #1, and #2. What's why i want
> use this flag for fast_loop devices.

Neat, I didn't consider that. Mainly because I had (knowingly) ignored
the exclusive owner bit so far. I have included your suggestion in the
loop-fastfs and loop-extent_map branches, thanks.

--
Jens Axboe