2002-10-04 22:05:00

by Chuck Lever

[permalink] [raw]
Subject: [PATCH] direct-IO API change

hi Linus-

this patch is a pre-requisite for NFS direct I/O support (which many would
like to see in 2.5/6), and is being awaited by the raw dev guys. please
apply this to 2.5.40.

this patch adds a "struct file *" argument to direct I/O. this is needed
by NFS direct I/O to make the file's credentials available to direct I/O
subroutines. we can't remove "struct inode *" yet because the raw driver
still needs it.

andrew, trond, and the raw dev team are OK with this patch.

diff -drN -U2 05-jiffies/drivers/char/raw.c 06-odirect1/drivers/char/raw.c
--- 05-jiffies/drivers/char/raw.c Tue Oct 1 03:07:07 2002
+++ 06-odirect1/drivers/char/raw.c Fri Oct 4 15:23:13 2002
@@ -223,5 +223,5 @@
nr_segs = iov_shorten((struct iovec *)iov, nr_segs, count);
}
- ret = generic_file_direct_IO(rw, inode, iov, *offp, nr_segs);
+ ret = generic_file_direct_IO(rw, filp, inode, iov, *offp, nr_segs);

if (ret > 0)
diff -drN -U2 05-jiffies/fs/block_dev.c 06-odirect1/fs/block_dev.c
--- 05-jiffies/fs/block_dev.c Tue Oct 1 03:07:55 2002
+++ 06-odirect1/fs/block_dev.c Fri Oct 4 15:23:13 2002
@@ -117,6 +117,7 @@

static int
-blkdev_direct_IO(int rw, struct inode *inode, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+blkdev_direct_IO(int rw, struct file *file, struct inode *inode,
+ const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs)
{
return generic_direct_IO(rw, inode, iov, offset,
diff -drN -U2 05-jiffies/fs/direct-io.c 06-odirect1/fs/direct-io.c
--- 05-jiffies/fs/direct-io.c Tue Oct 1 03:07:04 2002
+++ 06-odirect1/fs/direct-io.c Fri Oct 4 15:23:13 2002
@@ -647,11 +647,12 @@

ssize_t
-generic_file_direct_IO(int rw, struct inode *inode, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+generic_file_direct_IO(int rw, struct file *file, struct inode *inode,
+ const struct iovec *iov, loff_t offset, unsigned long nr_segs)
{
struct address_space *mapping = inode->i_mapping;
ssize_t retval;

- retval = mapping->a_ops->direct_IO(rw, inode, iov, offset, nr_segs);
+ retval = mapping->a_ops->direct_IO(rw, file, inode, iov, offset,
+ nr_segs);
if (inode->i_mapping->nrpages)
invalidate_inode_pages2(inode->i_mapping);
diff -drN -U2 05-jiffies/fs/ext2/inode.c 06-odirect1/fs/ext2/inode.c
--- 05-jiffies/fs/ext2/inode.c Tue Oct 1 03:06:16 2002
+++ 06-odirect1/fs/ext2/inode.c Fri Oct 4 15:23:13 2002
@@ -620,6 +620,7 @@

static int
-ext2_direct_IO(int rw, struct inode *inode, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+ext2_direct_IO(int rw, struct file *file, struct inode *inode,
+ const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs)
{
return generic_direct_IO(rw, inode, iov,
diff -drN -U2 05-jiffies/fs/ext3/inode.c 06-odirect1/fs/ext3/inode.c
--- 05-jiffies/fs/ext3/inode.c Tue Oct 1 03:07:11 2002
+++ 06-odirect1/fs/ext3/inode.c Fri Oct 4 15:23:13 2002
@@ -1400,5 +1400,5 @@
* crashes then stale disk data _may_ be exposed inside the file.
*/
-static int ext3_direct_IO(int rw, struct inode *inode,
+static int ext3_direct_IO(int rw, struct file *file, struct inode *inode,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs)
diff -drN -U2 05-jiffies/fs/jfs/inode.c 06-odirect1/fs/jfs/inode.c
--- 05-jiffies/fs/jfs/inode.c Tue Oct 1 03:05:46 2002
+++ 06-odirect1/fs/jfs/inode.c Fri Oct 4 15:23:13 2002
@@ -311,6 +311,7 @@
}

-static int jfs_direct_IO(int rw, struct inode *inode, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+static int jfs_direct_IO(int rw, struct file *file, struct inode *inode,
+ const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs)
{
return generic_direct_IO(rw, inode, iov,
diff -drN -U2 05-jiffies/fs/xfs/linux/xfs_aops.c 06-odirect1/fs/xfs/linux/xfs_aops.c
--- 05-jiffies/fs/xfs/linux/xfs_aops.c Tue Oct 1 03:07:36 2002
+++ 06-odirect1/fs/xfs/linux/xfs_aops.c Fri Oct 4 15:23:13 2002
@@ -682,4 +682,5 @@
linvfs_direct_IO(
int rw,
+ struct file *file,
struct inode *inode,
const struct iovec *iov,
diff -drN -U2 05-jiffies/include/linux/fs.h 06-odirect1/include/linux/fs.h
--- 05-jiffies/include/linux/fs.h Tue Oct 1 03:06:25 2002
+++ 06-odirect1/include/linux/fs.h Fri Oct 4 15:23:13 2002
@@ -309,5 +309,7 @@
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
- int (*direct_IO)(int, struct inode *, const struct iovec *iov, loff_t offset, unsigned long nr_segs);
+ int (*direct_IO)(int, struct file *, struct inode *,
+ const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs);
};

@@ -1248,6 +1250,7 @@
extern ssize_t generic_file_sendfile(struct file *, struct file *, loff_t *, size_t);
extern void do_generic_file_read(struct file *, loff_t *, read_descriptor_t *, read_actor_t);
-extern ssize_t generic_file_direct_IO(int rw, struct inode *inode,
- const struct iovec *iov, loff_t offset, unsigned long nr_segs);
+extern ssize_t generic_file_direct_IO(int rw, struct file *file,
+ struct inode *inode, const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs);
extern int generic_direct_IO(int rw, struct inode *inode, const struct iovec
*iov, loff_t offset, unsigned long nr_segs, get_blocks_t *get_blocks);
diff -drN -U2 05-jiffies/include/linux/nfs_fs.h 06-odirect1/include/linux/nfs_fs.h
--- 05-jiffies/include/linux/nfs_fs.h Tue Oct 1 03:06:30 2002
+++ 06-odirect1/include/linux/nfs_fs.h Fri Oct 4 15:23:13 2002
@@ -15,4 +15,5 @@
#include <linux/pagemap.h>
#include <linux/wait.h>
+#include <linux/uio.h>

#include <linux/nfs_fs_sb.h>
@@ -285,4 +286,10 @@

/*
+ * linux/fs/nfs/direct.c
+ */
+extern int nfs_direct_IO(int, struct file *, struct inode *,
+ const struct iovec *, loff_t, unsigned long);
+
+/*
* linux/fs/nfs/dir.c
*/
diff -drN -U2 05-jiffies/kernel/ksyms.c 06-odirect1/kernel/ksyms.c
--- 05-jiffies/kernel/ksyms.c Tue Oct 1 03:05:49 2002
+++ 06-odirect1/kernel/ksyms.c Fri Oct 4 15:23:13 2002
@@ -193,4 +193,5 @@
EXPORT_SYMBOL(invalidate_device);
EXPORT_SYMBOL(invalidate_inode_pages);
+EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
EXPORT_SYMBOL(truncate_inode_pages);
EXPORT_SYMBOL(fsync_bdev);
diff -drN -U2 05-jiffies/mm/filemap.c 06-odirect1/mm/filemap.c
--- 05-jiffies/mm/filemap.c Wed Oct 2 13:02:19 2002
+++ 06-odirect1/mm/filemap.c Fri Oct 4 15:23:13 2002
@@ -1159,5 +1159,5 @@
nr_segs, count);
}
- retval = generic_file_direct_IO(READ, inode,
+ retval = generic_file_direct_IO(READ, filp, inode,
iov, pos, nr_segs);
if (retval > 0)
@@ -1841,5 +1841,5 @@
nr_segs = iov_shorten((struct iovec *)iov,
nr_segs, count);
- written = generic_file_direct_IO(WRITE, inode,
+ written = generic_file_direct_IO(WRITE, file, inode,
iov, pos, nr_segs);
if (written > 0) {

--

corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>


2002-10-04 22:20:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change




On Fri, 4 Oct 2002, Chuck Lever wrote:
>
> this patch adds a "struct file *" argument to direct I/O. this is needed
> by NFS direct I/O to make the file's credentials available to direct I/O
> subroutines. we can't remove "struct inode *" yet because the raw driver
> still needs it.

Why isn't the raw driver changed to just use file->f_dentry->d_inode
instead?

Linus

2002-10-04 22:32:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change

Linus Torvalds wrote:
>
> On Fri, 4 Oct 2002, Chuck Lever wrote:
> >
> > this patch adds a "struct file *" argument to direct I/O. this is needed
> > by NFS direct I/O to make the file's credentials available to direct I/O
> > subroutines. we can't remove "struct inode *" yet because the raw driver
> > still needs it.
>
> Why isn't the raw driver changed to just use file->f_dentry->d_inode
> instead?
>

Because the file handle which we have is for /dev/raw/raw0,
not for /dev/hda1.

The raw driver binds to major/minor, not a file*. I considered
changing that (change userspace to pass the open fd). But didn't.

An alternative would be to rewrite i_mapping for /dev/raw/raw0
to be bdev->bd_inode->i_mapping. I guess we should look at doing
that.

2002-10-04 23:16:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change


On Fri, 4 Oct 2002, Andrew Morton wrote:
>
> Because the file handle which we have is for /dev/raw/raw0,
> not for /dev/hda1.
>
> The raw driver binds to major/minor, not a file*. I considered
> changing that (change userspace to pass the open fd). But didn't.

Ok. I'd really rather have a cleaner internal API and break the raw driver
for a while, than have a silly API just because the raw driver uses it.

Especially since I thought that O_DIRECT on the regular file (or block
device) performed about as well as raw does anyway these days? Or is that
just one of my LSD-induced flashbacks?

Linus

2002-10-04 23:51:37

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change

>
>
> On Fri, 4 Oct 2002, Andrew Morton wrote:
> >
> > > Especially since I thought that O_DIRECT on the regular file (or block
> > > device) performed about as well as raw does anyway these days? Or is that
> > > just one of my LSD-induced flashbacks?
> > >
> >
> > Now we're not holding i_sem for O_DIRECT writes to blockdevs,
> > I don't think the raw driver offers any advantages at all. It's
> > a compatibility thing to save people from having to add "|O_DIRECT" to
> > their source and then typing `ln -s /dev/hda1 /dev/raw/raw0'.
>
> Maybe the raw driver could become a shell that just adds the O_DIRECT?
> Unless it can do something more, of course..
>
> Linus
>
>

Only issue would be the alignment restriction on blockdev versus raw device.

raw allows 512 byte alignment on userbuff, offset and length.

blockdevice might need 1024/2048 byte alignment.


If we get the alignment patch for DIO, this is not an issue.
(alignment patch - in case of defualt alignment problem,
go down to blkdev_hardsect_size() alignment.)

- Badari

2002-10-04 23:37:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change


On Fri, 4 Oct 2002, Andrew Morton wrote:
>
> > Especially since I thought that O_DIRECT on the regular file (or block
> > device) performed about as well as raw does anyway these days? Or is that
> > just one of my LSD-induced flashbacks?
> >
>
> Now we're not holding i_sem for O_DIRECT writes to blockdevs,
> I don't think the raw driver offers any advantages at all. It's
> a compatibility thing to save people from having to add "|O_DIRECT" to
> their source and then typing `ln -s /dev/hda1 /dev/raw/raw0'.

Maybe the raw driver could become a shell that just adds the O_DIRECT?
Unless it can do something more, of course..

Linus

2002-10-04 23:29:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change

Linus Torvalds wrote:
>
> On Fri, 4 Oct 2002, Andrew Morton wrote:
> >
> > Because the file handle which we have is for /dev/raw/raw0,
> > not for /dev/hda1.
> >
> > The raw driver binds to major/minor, not a file*. I considered
> > changing that (change userspace to pass the open fd). But didn't.
>
> Ok. I'd really rather have a cleaner internal API and break the raw driver
> for a while, than have a silly API just because the raw driver uses it.

OK - bust it.

> Especially since I thought that O_DIRECT on the regular file (or block
> device) performed about as well as raw does anyway these days? Or is that
> just one of my LSD-induced flashbacks?
>

Now we're not holding i_sem for O_DIRECT writes to blockdevs,
I don't think the raw driver offers any advantages at all. It's
a compatibility thing to save people from having to add "|O_DIRECT" to
their source and then typing `ln -s /dev/hda1 /dev/raw/raw0'.

I think we can probably delete the raw driver. But I've Cc'ed Janet
and Badari to find out why that's wrong.

2002-10-05 00:04:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change


On Fri, 4 Oct 2002, Badari Pulavarty wrote:
>
> Only issue would be the alignment restriction on blockdev versus raw device.

Hmm.. We might want to revert the stuff that made the default block device
alignment be the maximal possible, and instead make the default be the
minimum possible.

The offending code is the "while"-loop in bd_set_size(). Just removing
that should make the default size be the minimal one (ie hardsect_size).

It _used_ to make sense to try to maximize the block-size, since it had a
noticeable impact on performance whether we did 8 512-byte requests or
just 1 4kB request. However, all the bio changes have likely made that a
non-issue, since Andrew's code ends up doing things directly one page at a
time _anyway_.

Of course, mounting a filesystem on the device will override the default
anyway, so this by no means will guarantee that direct access will always
have the minimal alignment restrictions, but once you've mounted a 4kB
blocksize filesystem on that device I think you might as well do 4kB
chunks even if you open it through the device interface, no?

Linus

2002-10-05 09:46:25

by Lincoln Dale

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change

At 04:23 PM 4/10/2002 -0700, Linus Torvalds wrote:
>Especially since I thought that O_DIRECT on the regular file (or block
>device) performed about as well as raw does anyway these days? Or is that
>just one of my LSD-induced flashbacks?

from my multiple 64/66 PCI bus + multiple 2gbit/s FC HBA tests, yes,
they're around the same.
(now up to 390mbyte/sec throughput on latest & greatest x86 hardware i
have; front-side-bus no longer the limiting factor, but dual 64/66 PCI).

of course, purely synthetic tests designed to stress Fibre Channel
switching infrastructure, not real-world disk i/o..


cheers,

lincoln.

2002-10-05 19:28:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change

Lincoln Dale wrote:
>
> At 04:23 PM 4/10/2002 -0700, Linus Torvalds wrote:
> >Especially since I thought that O_DIRECT on the regular file (or block
> >device) performed about as well as raw does anyway these days? Or is that
> >just one of my LSD-induced flashbacks?
>
> from my multiple 64/66 PCI bus + multiple 2gbit/s FC HBA tests, yes,
> they're around the same.
> (now up to 390mbyte/sec throughput on latest & greatest x86 hardware i
> have; front-side-bus no longer the limiting factor, but dual 64/66 PCI).
>
> of course, purely synthetic tests designed to stress Fibre Channel
> switching infrastructure, not real-world disk i/o..
>

direct-io has lost its challenge ;)

I'd love to see the result of some pagecache testing on that
setup if you have time.

Nothing fancy - just:

for i in $(each ext2 mountpoint)
do
dd if=/dev/zero of=$i/foo bs-1M count=4000 &
done
vmstat 1

and

for i in $(each ext2 mountpoint)
do
cat $i/foo > /dev/null &
done
vmstat 1

Linus's current BK tree has a few warmups which will help the
writeout phase a little.

2002-10-08 13:28:38

by Terje Eggestad

[permalink] [raw]
Subject: Re: [PATCH] direct-IO API change

On l?r, 2002-10-05 at 01:23, Linus Torvalds wrote:
> On Fri, 4 Oct 2002, Andrew Morton wrote:
> >
> > Because the file handle which we have is for /dev/raw/raw0,
> > not for /dev/hda1.
> >
> > The raw driver binds to major/minor, not a file*. I considered
> > changing that (change userspace to pass the open fd). But didn't.
>
> Ok. I'd really rather have a cleaner internal API and break the raw driver
> for a while, than have a silly API just because the raw driver uses it.
>
> Especially since I thought that O_DIRECT on the regular file (or block
> device) performed about as well as raw does anyway these days? Or is that
> just one of my LSD-induced flashbacks?
>

Well, somewhat LSD induced. /dev/raw/rawN has a merit for shared disk.
Databases (Read Oracle) most notably need raw devices for parallell
databases. It would GREATLY benefit from a cluster aware fs with
O_DIRECT support, Sistina added that to their closed source GFS 5.x.

Working with blockdevices are pure evil in my view anyway, but until a
OSS FS that is cluster aware are a part of a vanilla kernel, I'm afraid
that /dev/raw/rawN are a necessary evil

> Linus
>

TJ

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________