2007-01-09 01:39:14

by Hua Zhong

[permalink] [raw]
Subject: [PATCH] support O_DIRECT in tmpfs/ramfs

Hi,

A while ago there was a discussion about supporting direct-io on tmpfs.

Here is a simple patch that does it.

1. A new fs flag FS_RAM_BASED is added and the O_DIRECT flag is ignored
if this flag is set (suggestions on a better name?)

2. Specify FS_RAM_BASED for tmpfs and ramfs.

3. When EINVAL is returned only a fput is done. I changed it to go
through cleanup_all. But there is still a cleanup problem:

If a new file is created and then EINVAL is returned due to O_DIRECT,
the file is still left on the disk. I am not exactly sure how to fix
it other than adding another fs flag so we could check O_DIRECT
support at a much earlier stage. Comments on how to fix it?

Signed-off-by: Hua Zhong <[email protected]>

diff --git a/fs/open.c b/fs/open.c
index c989fb4..c03285f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -708,11 +708,13 @@ static struct file *__dentry_open(struct

/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops ||
- ((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_page))) {
- fput(f);
- f = ERR_PTR(-EINVAL);
+ if (dentry->d_sb->s_type->fs_flags & FS_RAM_BASED)
+ f->f_flags &= ~O_DIRECT;
+ else if (!f->f_mapping->a_ops ||
+ ((!f->f_mapping->a_ops->direct_IO) &&
+ (!f->f_mapping->a_ops->get_xip_page))) {
+ error = -EINVAL;
+ goto cleanup_all;
}
}

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 2faf4cd..0d4bebc 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -199,11 +199,13 @@ static int rootfs_get_sb(struct file_sys

static struct file_system_type ramfs_fs_type = {
.name = "ramfs",
+ .fs_flags = FS_RAM_BASED,
.get_sb = ramfs_get_sb,
.kill_sb = kill_litter_super,
};
static struct file_system_type rootfs_fs_type = {
.name = "rootfs",
+ .fs_flags = FS_RAM_BASED,
.get_sb = rootfs_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 186da81..0d95988 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -91,6 +91,7 @@ extern int dir_notify_enable;
/* public flags for file_system_type */
#define FS_REQUIRES_DEV 1
#define FS_BINARY_MOUNTDATA 2
+#define FS_RAM_BASED 8192 /* Ignore O_DIRECT */
#define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move()
* during rename() internally.
diff --git a/mm/shmem.c b/mm/shmem.c
index 70da7a0..5d23e8a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2413,6 +2413,7 @@ static int shmem_get_sb(struct file_syst
static struct file_system_type tmpfs_fs_type = {
.owner = THIS_MODULE,
.name = "tmpfs",
+ .fs_flags = FS_RAM_BASED,
.get_sb = shmem_get_sb,
.kill_sb = kill_litter_super,
};


2007-01-09 01:56:33

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH] support O_DIRECT in tmpfs/ramfs



On Jan 8 2007 17:43, Hua Zhong wrote:
>
>1. A new fs flag FS_RAM_BASED is added and the O_DIRECT flag is ignored
> if this flag is set (suggestions on a better name?)

FS_IGNORE_DIRECT. Somehow I think this flag is not only useful for
RAM-based filesystems, but also possibly virtual filesystems (procfs,
sysfs, etc.) Maybe even more such as fuse and unionfs.


-`J'
--

2007-01-09 15:31:05

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH] support O_DIRECT in tmpfs/ramfs

Hua Zhong wrote:
> Hi,
>
> A while ago there was a discussion about supporting direct-io on tmpfs.
>
> Here is a simple patch that does it.
>
> 1. A new fs flag FS_RAM_BASED is added and the O_DIRECT flag is ignored
> if this flag is set (suggestions on a better name?)
>
> 2. Specify FS_RAM_BASED for tmpfs and ramfs.
>
> 3. When EINVAL is returned only a fput is done. I changed it to go
> through cleanup_all. But there is still a cleanup problem:
>
> If a new file is created and then EINVAL is returned due to O_DIRECT,
> the file is still left on the disk. I am not exactly sure how to fix
> it other than adding another fs flag so we could check O_DIRECT
> support at a much earlier stage. Comments on how to fix it?

This would seem to create two different sets of O_DIRECT semantics,
wouldn't it? I think that it would be possible to develop an application
using one of these FS_RAM_BASED file systems as the testbed, but then be
surprised when the application failed to work on other file systems such
as ext3.

ps

2007-01-09 17:17:32

by Michael Reed

[permalink] [raw]
Subject: Re: [PATCH] support O_DIRECT in tmpfs/ramfs



Peter Staubach wrote:
> Hua Zhong wrote:
>> Hi,
>>
>> A while ago there was a discussion about supporting direct-io on tmpfs.
>>
>> Here is a simple patch that does it.
>>
>> 1. A new fs flag FS_RAM_BASED is added and the O_DIRECT flag is ignored
>> if this flag is set (suggestions on a better name?)
>>
>> 2. Specify FS_RAM_BASED for tmpfs and ramfs.
>>
>> 3. When EINVAL is returned only a fput is done. I changed it to go
>> through cleanup_all. But there is still a cleanup problem:
>>
>> If a new file is created and then EINVAL is returned due to O_DIRECT,
>> the file is still left on the disk. I am not exactly sure how to fix
>> it other than adding another fs flag so we could check O_DIRECT
>> support at a much earlier stage. Comments on how to fix it?
>
> This would seem to create two different sets of O_DIRECT semantics,
> wouldn't it? I think that it would be possible to develop an application
> using one of these FS_RAM_BASED file systems as the testbed, but then be
> surprised when the application failed to work on other file systems such
> as ext3.

As I'm ignorant with regard to what is needed for "compliant"
support of O_DIRECT on tmpfs, what are the issues with actually implementing
the proper semantics, including the alignment and any transfer length
restrictions?

My $.02 is that the implementation should be fully compliant with the
current semantics or it shouldn't be implemented. And I think it should
be implemented.

Mike

>
> ps
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2007-01-09 20:44:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] support O_DIRECT in tmpfs/ramfs

On Mon, 8 Jan 2007, Hua Zhong wrote:
> A while ago there was a discussion about supporting direct-io on tmpfs.

Ah, I think I can just about remember that... ;)

>
> Here is a simple patch that does it.

Looks more likely to work than Ken's - which I didn't try,
but I couldn't see what magic prevented it from just going BUG.

But I have to say, having seen the ensuing requests for this to impose
the same constraints as other implementations of O_DIRECT (though NFS
does not), I've veered right back to my original position: this all
just seems silly to me. O_DIRECT is and always has been rather an
awkward hack (Linus described it in stronger terms!), supported by
many but not all filesystems: shall we just leave it at that?

In particular, having now looked into the code, I'm amused to be
reminded that one of its particular effects is to invalidate the
pagecache for the area directly written. Well, it's hardly going
to be worth replicating that behaviour with tmpfs or ramfs; yet
if we don't, then we stand accused of it behaving misleadingly
differently on them.

I think Michael, who started off this discussion, did just the right
thing: used a direct_IO filesystem on a loop device on a tmpfs file.

>
> 1. A new fs flag FS_RAM_BASED is added and the O_DIRECT flag is ignored
> if this flag is set (suggestions on a better name?)
>
> 2. Specify FS_RAM_BASED for tmpfs and ramfs.

If this is pursued (not my preference, but let me stand aside now),
you'd want to add in at least hugetlbfs and tiny-shmem.c. And set
your (renamed) FS_RAM_BASED flag in ext2_aops_xip: that seems to
be what they're wanting, then you can remove that strange test
for f->f_mapping->a_ops->get_xip_page from __dentry_open.

>
> 3. When EINVAL is returned only a fput is done. I changed it to go
> through cleanup_all. But there is still a cleanup problem:

Is that change correct? Are you saying that the existing code leaks
some structures? If so, please do send a patch to fix just that as
soon as you can. But are you sure?

>
> If a new file is created and then EINVAL is returned due to O_DIRECT,
> the file is still left on the disk. I am not exactly sure how to fix
> it other than adding another fs flag so we could check O_DIRECT
> support at a much earlier stage. Comments on how to fix it?

None from me, sorry. It's untidy, but not a new issue you have to fix.

Hugh

>
> Signed-off-by: Hua Zhong <[email protected]>
>
> diff --git a/fs/open.c b/fs/open.c
> index c989fb4..c03285f 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -708,11 +708,13 @@ static struct file *__dentry_open(struct
>
> /* NB: we're sure to have correct a_ops only after f_op->open */
> if (f->f_flags & O_DIRECT) {
> - if (!f->f_mapping->a_ops ||
> - ((!f->f_mapping->a_ops->direct_IO) &&
> - (!f->f_mapping->a_ops->get_xip_page))) {
> - fput(f);
> - f = ERR_PTR(-EINVAL);
> + if (dentry->d_sb->s_type->fs_flags & FS_RAM_BASED)
> + f->f_flags &= ~O_DIRECT;
> + else if (!f->f_mapping->a_ops ||
> + ((!f->f_mapping->a_ops->direct_IO) &&
> + (!f->f_mapping->a_ops->get_xip_page))) {
> + error = -EINVAL;
> + goto cleanup_all;
> }
> }
>
> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index 2faf4cd..0d4bebc 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -199,11 +199,13 @@ static int rootfs_get_sb(struct file_sys
>
> static struct file_system_type ramfs_fs_type = {
> .name = "ramfs",
> + .fs_flags = FS_RAM_BASED,
> .get_sb = ramfs_get_sb,
> .kill_sb = kill_litter_super,
> };
> static struct file_system_type rootfs_fs_type = {
> .name = "rootfs",
> + .fs_flags = FS_RAM_BASED,
> .get_sb = rootfs_get_sb,
> .kill_sb = kill_litter_super,
> };
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 186da81..0d95988 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -91,6 +91,7 @@ extern int dir_notify_enable;
> /* public flags for file_system_type */
> #define FS_REQUIRES_DEV 1
> #define FS_BINARY_MOUNTDATA 2
> +#define FS_RAM_BASED 8192 /* Ignore O_DIRECT */
> #define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */
> #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move()
> * during rename() internally.
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 70da7a0..5d23e8a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2413,6 +2413,7 @@ static int shmem_get_sb(struct file_syst
> static struct file_system_type tmpfs_fs_type = {
> .owner = THIS_MODULE,
> .name = "tmpfs",
> + .fs_flags = FS_RAM_BASED,
> .get_sb = shmem_get_sb,
> .kill_sb = kill_litter_super,
> };

2007-01-09 21:57:26

by Hua Zhong

[permalink] [raw]
Subject: RE: [PATCH] support O_DIRECT in tmpfs/ramfs

> > Here is a simple patch that does it.
>
> Looks more likely to work than Ken's - which I didn't try,
> but I couldn't see what magic prevented it from just going BUG.
>
> But I have to say, having seen the ensuing requests for this
> to impose the same constraints as other implementations of
> O_DIRECT (though NFS does not), I've veered right back to my
> original position: this all just seems silly to me. O_DIRECT
> is and always has been rather an awkward hack (Linus
> described it in stronger terms!), supported by many but not
> all filesystems: shall we just leave it at that?

So I take your word that NFS does not impose this restraint,
which means filesystems could choose their own alignment
requirement that makes sense. So it would not be too horrible
if tmpfs chooses to be liberal.

In fact, in the O_DIRECT man page it says:

O_DIRECT
[....] Under Linux 2.4 transfer sizes, and the alignment of
user buffer and file offset must all be multiples of the logi-
cal block size of the file system. Under Linux 2.6 alignment to
512-byte boundaries suffices.

So even Linux 2.4 and 2.6 are different - 2.6 is less restrictive.

My point is that as long as we don't put more restrictions, it should
not cause real problems.

And about Linus..let's put his comment aside because O_DIRECT
is there to stay. :-) In fact, since O_DIRECT is not the most
beaufitul piece of code in the kernel, what I try to do is just to
make software developer's life easier by making filesystem
behavior as close to each other as possible with the minimal
effort.

> In particular, having now looked into the code, I'm amused to
> be reminded that one of its particular effects is to
> invalidate the pagecache for the area directly written.
> Well, it's hardly going to be worth replicating that
> behaviour with tmpfs or ramfs; yet if we don't, then we stand
> accused of it behaving misleadingly differently on them.
>
> I think Michael, who started off this discussion, did just the right
> thing: used a direct_IO filesystem on a loop device on a tmpfs file.

That's a rather heavy-weight workaround don't you think?

> > 1. A new fs flag FS_RAM_BASED is added and the O_DIRECT
> flag is ignored
> > if this flag is set (suggestions on a better name?)
> >
> > 2. Specify FS_RAM_BASED for tmpfs and ramfs.
>
> If this is pursued (not my preference, but let me stand aside
> now), you'd want to add in at least hugetlbfs and
> tiny-shmem.c. And set your (renamed) FS_RAM_BASED flag in
> ext2_aops_xip: that seems to be what they're wanting, then
> you can remove that strange test for
> f->f_mapping->a_ops->get_xip_page from __dentry_open.
>
> >
> > 3. When EINVAL is returned only a fput is done. I changed it to go
> > through cleanup_all. But there is still a cleanup problem:
>
> Is that change correct? Are you saying that the existing
> code leaks some structures? If so, please do send a patch to
> fix just that as soon as you can. But are you sure?

Having looked at the code more closely, the change is probably
not correct. fput(f) apparently does everything cleanup_all does,
and more, despite it's a single call. I guess those names are
a bit confusing at first glance. :-)

> > If a new file is created and then EINVAL is returned due to
> > O_DIRECT, the file is still left on the disk. I am not exactly
> > sure how to fix it other than adding another fs flag so we
> > could check O_DIRECT support at a much earlier stage.
> > Comments on how to fix it?
>
> None from me, sorry. It's untidy, but not a new issue you
> have to fix.

Well, looks like people are not in consensus to add the tmpfs
direct-io support, but since I've looked at the code, it would be
nice to fix this bug though.

The get_xip_page thing you mentioned makes it a bit more
complicated since XIP support is a mount option, not a
register_filesystem time option. If we ought to add a flag somewhere,
where is the right place? vfsmount?

I can cook up a patch for this bug if people think it's worth fixing.

2007-01-10 08:21:08

by Aubrey Li

[permalink] [raw]
Subject: Re: [PATCH] support O_DIRECT in tmpfs/ramfs

Hi Hua Zhong,

Maybe I misunderstand your patch, but when I tried it on my blackfin
uClinux platform, I can't change anything. See below:
================================
root:/var> mount
/dev/mtdblock0 on / type ext2 (rw)
/proc on /proc type proc (rw)
ramfs on /var type ramfs (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw)
root:/var> ./t_direct
Error open t.bin to read
================================
Error is because O_DIRECT flag was set when call open(). If I remove this flag,
the test program can work ok.

Any suggestions?

Thanks,
-Aubrey

On 1/10/07, Hua Zhong <[email protected]> wrote:
> > > Here is a simple patch that does it.
> >
> > Looks more likely to work than Ken's - which I didn't try,
> > but I couldn't see what magic prevented it from just going BUG.
> >
> > But I have to say, having seen the ensuing requests for this
> > to impose the same constraints as other implementations of
> > O_DIRECT (though NFS does not), I've veered right back to my
> > original position: this all just seems silly to me. O_DIRECT
> > is and always has been rather an awkward hack (Linus
> > described it in stronger terms!), supported by many but not
> > all filesystems: shall we just leave it at that?
>
> So I take your word that NFS does not impose this restraint,
> which means filesystems could choose their own alignment
> requirement that makes sense. So it would not be too horrible
> if tmpfs chooses to be liberal.
>
> In fact, in the O_DIRECT man page it says:
>
> O_DIRECT
> [....] Under Linux 2.4 transfer sizes, and the alignment of
> user buffer and file offset must all be multiples of the logi-
> cal block size of the file system. Under Linux 2.6 alignment to
> 512-byte boundaries suffices.
>
> So even Linux 2.4 and 2.6 are different - 2.6 is less restrictive.
>
> My point is that as long as we don't put more restrictions, it should
> not cause real problems.
>
> And about Linus..let's put his comment aside because O_DIRECT
> is there to stay. :-) In fact, since O_DIRECT is not the most
> beaufitul piece of code in the kernel, what I try to do is just to
> make software developer's life easier by making filesystem
> behavior as close to each other as possible with the minimal
> effort.
>
> > In particular, having now looked into the code, I'm amused to
> > be reminded that one of its particular effects is to
> > invalidate the pagecache for the area directly written.
> > Well, it's hardly going to be worth replicating that
> > behaviour with tmpfs or ramfs; yet if we don't, then we stand
> > accused of it behaving misleadingly differently on them.
> >
> > I think Michael, who started off this discussion, did just the right
> > thing: used a direct_IO filesystem on a loop device on a tmpfs file.
>
> That's a rather heavy-weight workaround don't you think?
>
> > > 1. A new fs flag FS_RAM_BASED is added and the O_DIRECT
> > flag is ignored
> > > if this flag is set (suggestions on a better name?)
> > >
> > > 2. Specify FS_RAM_BASED for tmpfs and ramfs.
> >
> > If this is pursued (not my preference, but let me stand aside
> > now), you'd want to add in at least hugetlbfs and
> > tiny-shmem.c. And set your (renamed) FS_RAM_BASED flag in
> > ext2_aops_xip: that seems to be what they're wanting, then
> > you can remove that strange test for
> > f->f_mapping->a_ops->get_xip_page from __dentry_open.
> >
> > >
> > > 3. When EINVAL is returned only a fput is done. I changed it to go
> > > through cleanup_all. But there is still a cleanup problem:
> >
> > Is that change correct? Are you saying that the existing
> > code leaks some structures? If so, please do send a patch to
> > fix just that as soon as you can. But are you sure?
>
> Having looked at the code more closely, the change is probably
> not correct. fput(f) apparently does everything cleanup_all does,
> and more, despite it's a single call. I guess those names are
> a bit confusing at first glance. :-)
>
> > > If a new file is created and then EINVAL is returned due to
> > > O_DIRECT, the file is still left on the disk. I am not exactly
> > > sure how to fix it other than adding another fs flag so we
> > > could check O_DIRECT support at a much earlier stage.
> > > Comments on how to fix it?
> >
> > None from me, sorry. It's untidy, but not a new issue you
> > have to fix.
>
> Well, looks like people are not in consensus to add the tmpfs
> direct-io support, but since I've looked at the code, it would be
> nice to fix this bug though.
>
> The get_xip_page thing you mentioned makes it a bit more
> complicated since XIP support is a mount option, not a
> register_filesystem time option. If we ought to add a flag somewhere,
> where is the right place? vfsmount?
>
> I can cook up a patch for this bug if people think it's worth fixing.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>