Linux-kernel: is this patch horribly wrong?
On modern Linux, apparently the correct way to bypass the buffer cache
when writing to a block device is to open the block device with
O_DIRECT. This enables, for example, the user to more easily force a
reallocation of a single sector of an IDE disk with a media error
(without overwriting anything but the 1k "sector pair" containing the
error). dd(1) is convenient for this purpose, but is lacking a method
to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
enable this usage.
Alas, I don't do autoconf, so I hope someone else can add the
appropriate stanza to autoconfiscate HAS_O_DIRECT.
-andy
diff -ur coreutils-5.0.91/doc/coreutils.texi coreutils-5.0.91-adi/doc/coreutils.texi
--- coreutils-5.0.91/doc/coreutils.texi 2003-09-04 16:26:51.000000000 -0500
+++ coreutils-5.0.91-adi/doc/coreutils.texi 2004-04-06 13:48:54.000000000 -0500
@@ -6373,6 +6373,11 @@
Pad every input block to size of @samp{ibs} with trailing zero bytes.
When used with @samp{block} or @samp{unblock}, pad with spaces instead of
zero bytes.
+
+@item direct
+@opindex direct
+Open the output file with O_DIRECT, avoiding (on Linux) using the buffer
+cache.
@end table
@end table
diff -ur coreutils-5.0.91/src/dd.c coreutils-5.0.91-adi/src/dd.c
--- coreutils-5.0.91/src/dd.c 2003-07-25 02:43:09.000000000 -0500
+++ coreutils-5.0.91-adi/src/dd.c 2004-04-06 13:44:09.000000000 -0500
@@ -66,6 +66,9 @@
/* Default input and output blocksize. */
#define DEFAULT_BLOCKSIZE 512
+/* XXX hack */
+#define HAVE_O_DIRECT 1
+
/* Conversions bit masks. */
#define C_ASCII 01
#define C_EBCDIC 02
@@ -78,6 +81,7 @@
#define C_NOERROR 0400
#define C_NOTRUNC 01000
#define C_SYNC 02000
+#define C_DIRECT 010000
/* Use separate input and output buffers, and combine partial input blocks. */
#define C_TWOBUFS 04000
@@ -162,6 +166,9 @@
{"noerror", C_NOERROR}, /* Ignore i/o errors. */
{"notrunc", C_NOTRUNC}, /* Do not truncate output file. */
{"sync", C_SYNC}, /* Pad input records to ibs with NULs. */
+#ifdef HAVE_O_DIRECT
+ {"direct", C_DIRECT}, /* open files with O_DIRECT */
+#endif
{NULL, 0}
};
@@ -1190,6 +1197,11 @@
= (O_CREAT
| (seek_records || (conversions_mask & C_NOTRUNC) ? 0 : O_TRUNC));
+#if HAVE_O_DIRECT
+ if (conversions_mask & C_DIRECT)
+ opts |= O_DIRECT;
+#endif
+
/* Open the output file with *read* access only if we might
need to read to satisfy a `seek=' request. If we can't read
the file, go ahead with write-only access; it might work. */
Andy Isaacson <[email protected]> wrote:
>
> On modern Linux, apparently the correct way to bypass the buffer cache
> when writing to a block device is to open the block device with
> O_DIRECT. This enables, for example, the user to more easily force a
> reallocation of a single sector of an IDE disk with a media error
> (without overwriting anything but the 1k "sector pair" containing the
> error). dd(1) is convenient for this purpose, but is lacking a method
> to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
> enable this usage.
This would be rather nice to have. You'll need to ensure that the data
is page-aligned in memory.
While you're there, please add an fsync-before-closing option.
> > On modern Linux, apparently the correct way to bypass the buffer cache
> > when writing to a block device is to open the block device with
> > O_DIRECT. This enables, for example, the user to more easily force a
> > reallocation of a single sector of an IDE disk with a media error
> > (without overwriting anything but the 1k "sector pair" containing the
> > error). dd(1) is convenient for this purpose, but is lacking a method
> > to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
> > enable this usage.
>
> This would be rather nice to have. You'll need to ensure that the data
> is page-aligned in memory.
>
> While you're there, please add an fsync-before-closing option.
Andrew, am I right that this is NOT needed for the proposed O_DIRECT
option, since open(2) says:
"The I/O is synchronous, i.e., at the completion of the read(2) or
write(2) system call, data is guaranteed to have been transferred."
so the write will block until data is physically on the disk.
Cheers,
Bruce
Bruce Allen <[email protected]> wrote:
>
> > > On modern Linux, apparently the correct way to bypass the buffer cache
> > > when writing to a block device is to open the block device with
> > > O_DIRECT. This enables, for example, the user to more easily force a
> > > reallocation of a single sector of an IDE disk with a media error
> > > (without overwriting anything but the 1k "sector pair" containing the
> > > error). dd(1) is convenient for this purpose, but is lacking a method
> > > to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
> > > enable this usage.
> >
> > This would be rather nice to have. You'll need to ensure that the data
> > is page-aligned in memory.
> >
> > While you're there, please add an fsync-before-closing option.
>
> Andrew, am I right that this is NOT needed for the proposed O_DIRECT
> option, since open(2) says:
> "The I/O is synchronous, i.e., at the completion of the read(2) or
> write(2) system call, data is guaranteed to have been transferred."
> so the write will block until data is physically on the disk.
A conv=fsync option is an irrelevant wishlist item. If you were to provide
conv=fsync then dd should still fsync the file after performing the
direct-io read or write.
You're correct that an fsync after a direct-io read or write should not
need to perform any file data I/O. But it will often need to perform file
metadata I/O - file allocation tables, inode, etc. direct-io only syncs
the file data, not the file metadata.
On Tue, Apr 06, 2004 at 05:33:26PM -0700, Andrew Morton wrote:
> Andy Isaacson <[email protected]> wrote:
> > dd(1) is convenient for this purpose, but is lacking a method
> > to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
> > enable this usage.
>
> This would be rather nice to have. You'll need to ensure that the data
> is page-aligned in memory.
So, some confusion on my part about O_DIRECT: I can't get O_DIRECT to
work on ext3, at all, on 2.4.25 -- open(O_DIRECT) succeeds, but the write
returns EINVAL. Same code works fine when writing to a block device.
If the problem is that ext3 can't support O_DIRECT, why does the open
succeed?
> While you're there, please add an fsync-before-closing option.
Easy enough. How does this look? Note that C_TWOBUFS ensures the
output buffer is getpagesize()-aligned.
The next feature to add would be OpenBSD-style "KB/s" reporting. I'm
not going there.
-andy
2004-04-07 Andy Isaacson <[email protected]>
* add conv=direct and conv=fsync options.
diff -ur coreutils-5.0.91/doc/coreutils.texi coreutils-5.0.91-adi/doc/coreutils.texi
--- coreutils-5.0.91/doc/coreutils.texi 2003-09-04 16:26:51.000000000 -0500
+++ coreutils-5.0.91-adi/doc/coreutils.texi 2004-04-07 11:26:36.000000000 -0500
@@ -6373,6 +6373,16 @@
Pad every input block to size of @samp{ibs} with trailing zero bytes.
When used with @samp{block} or @samp{unblock}, pad with spaces instead of
zero bytes.
+
+@item fsync
+@opindex fsync
+Call @samp{fsync(2)} on the output file before exiting. This ensures
+that the file data is written to permanent store.
+
+@item direct
+@opindex direct
+Open the output file with O_DIRECT, avoiding (on Linux) using the buffer
+cache.
@end table
@end table
diff -ur coreutils-5.0.91/src/dd.c coreutils-5.0.91-adi/src/dd.c
--- coreutils-5.0.91/src/dd.c 2003-07-25 02:43:09.000000000 -0500
+++ coreutils-5.0.91-adi/src/dd.c 2004-04-07 11:45:23.000000000 -0500
@@ -66,6 +66,9 @@
/* Default input and output blocksize. */
#define DEFAULT_BLOCKSIZE 512
+/* XXX hack */
+#define HAVE_O_DIRECT 1
+
/* Conversions bit masks. */
#define C_ASCII 01
#define C_EBCDIC 02
@@ -78,6 +81,8 @@
#define C_NOERROR 0400
#define C_NOTRUNC 01000
#define C_SYNC 02000
+#define C_FSYNC 010000
+#define C_DIRECT 020000
/* Use separate input and output buffers, and combine partial input blocks. */
#define C_TWOBUFS 04000
@@ -162,6 +167,10 @@
{"noerror", C_NOERROR}, /* Ignore i/o errors. */
{"notrunc", C_NOTRUNC}, /* Do not truncate output file. */
{"sync", C_SYNC}, /* Pad input records to ibs with NULs. */
+ {"fsync", C_FSYNC}, /* call fsync(2) before closing output file */
+#ifdef HAVE_O_DIRECT
+ {"direct", C_DIRECT | C_TWOBUFS}, /* open files with O_DIRECT */
+#endif
{NULL, 0}
};
@@ -1190,6 +1199,11 @@
= (O_CREAT
| (seek_records || (conversions_mask & C_NOTRUNC) ? 0 : O_TRUNC));
+#if HAVE_O_DIRECT
+ if (conversions_mask & C_DIRECT)
+ opts |= O_DIRECT;
+#endif
+
/* Open the output file with *read* access only if we might
need to read to satisfy a `seek=' request. If we can't read
the file, go ahead with write-only access; it might work. */
@@ -1240,5 +1254,11 @@
exit_status = dd_copy ();
+ if (conversions_mask & C_FSYNC)
+ {
+ if (fsync (STDOUT_FILENO) != 0)
+ error(0, errno, _("cannot fsync %s"), quote (output_file));
+ }
+
quit (exit_status);
}
Andy Isaacson <[email protected]> wrote:
>
> On Tue, Apr 06, 2004 at 05:33:26PM -0700, Andrew Morton wrote:
> > Andy Isaacson <[email protected]> wrote:
> > > dd(1) is convenient for this purpose, but is lacking a method
> > > to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
> > > enable this usage.
> >
> > This would be rather nice to have. You'll need to ensure that the data
> > is page-aligned in memory.
>
> So, some confusion on my part about O_DIRECT: I can't get O_DIRECT to
> work on ext3, at all, on 2.4.25
ext3 doesn't support O_DIRECT in 2.4 kernels. I did a patch once and I
think it's in 2.4-aa kernels.
ext3 supports O_DIRECT in 2.6 kernels. Quite a number of filesystems do.
> -- open(O_DIRECT) succeeds, but the write
> returns EINVAL.
Yup that's a bit silly. In 2.6 we do the check at open() and fcntl() time.
In 2.4 we don't fail until the actual I/O attempt.
> Same code works fine when writing to a block device.
> If the problem is that ext3 can't support O_DIRECT, why does the open
> succeed?
We have been insufficiently assiduous in merging externally-supported
patches into the mainline 2.4 tree.
> > While you're there, please add an fsync-before-closing option.
>
> Easy enough. How does this look? Note that C_TWOBUFS ensures the
> output buffer is getpagesize()-aligned.
Looks nice and simple. You'll need an ext2 filesystem to test it under 2.4.
Be aware that it's rather a challenge to actually get the O_DIRECT #define
in scope under some glibc versions. I think you need to define _GNU_SOURCE
or something like that.
In article <[email protected]>,
Andy Isaacson <[email protected]> wrote:
>The next feature to add would be OpenBSD-style "KB/s" reporting. I'm
>not going there.
>
>diff -ur coreutils-5.0.91/doc/coreutils.texi
>coreutils-5.0.91-adi/doc/coreutils.texi
Doesn't it already do that ?
$ dd if=/dev/zero of=/tmp/file bs=4K count=100
100+0 records in
100+0 records out
409600 bytes transferred in 0.005108 seconds (80189583 bytes/sec)
$ dpkg -S /bin/dd
coreutils: /bin/dd
$ dpkg -s coreutils
Package: coreutils
Version: 5.0.91-2
Mike.
On Wed, Apr 07, 2004 at 11:18:41AM -0700, Andrew Morton wrote:
> Andy Isaacson <[email protected]> wrote:
> > > While you're there, please add an fsync-before-closing option.
> >
> > Easy enough. How does this look? Note that C_TWOBUFS ensures the
> > output buffer is getpagesize()-aligned.
>
> Looks nice and simple. You'll need an ext2 filesystem to test it under 2.4.
I tested against a block device. Easier to just "swapoff /dev/hda2"
than to find space to make a new filesystem.
> Be aware that it's rather a challenge to actually get the O_DIRECT #define
> in scope under some glibc versions. I think you need to define _GNU_SOURCE
> or something like that.
Yeah, I had to -D_GNU_SOURCE to build standalone, but coreutils'
makefile seems to do OK.
Would there be any reason to allow O_DIRECT on the read side?
-andy
Andy Isaacson <[email protected]> wrote:
>
> Would there be any reason to allow O_DIRECT on the read side?
Sure. It saves CPU, avoids blowing pagecache, just as with O_DIRECT
writes.
On Wed, Apr 07, 2004 at 12:34:55PM -0700, Andrew Morton wrote:
> Andy Isaacson <[email protected]> wrote:
> > Would there be any reason to allow O_DIRECT on the read side?
>
> Sure. It saves CPU,
OK, I can see that one. But it seems like a pretty small benefit to me
-- CPU utilization is already really low.
> avoids blowing pagecache,
Um, that sounds like a bad idea to me. It seems to me it's the kernel's
responsibility to figure out "hey, looks like a streaming read - let's
not blow out the buffer cache trying to hold 20GB on a 512M system." If
you're saying that the kernel guys have given up and the established
wisdom is now "you gotta use O_DIRECT if you don't want to throw
everything else out due to streaming data", well... I'm disappointed.
> just as with O_DIRECT writes.
Wouldn't opening both if= and of= with O_DIRECT turn dd into a
synchronous copier? That would really suck in the
"dd if=/dev/hda1 of=/dev/hdc1" case. With buffer cache doing
readahead, that command can get, say, 40MB/s read and 40MB/s write;
with synch read and synch write, it would drop to 40MB/s read+write,
assuming that block sizes are big enough to amortize seek overhead.
Having O_DIRECT on just of=, I think you can get back to 40MB/s+40MB/s.
I claim that O_DIRECT on of= is important because you just plain *can
not* do the minimal-sized IDE block scrub without it. I don't yet see a
similar benefit to O_DIRECT on if= side.
-andy
Andy Isaacson <[email protected]> wrote:
>
> I claim that O_DIRECT on of= is important because you just plain *can
> not* do the minimal-sized IDE block scrub without it. I don't yet see a
> similar benefit to O_DIRECT on if= side.
If you want a block scrubber then write a block scrubber.
If you want to add O_DIRECT support to dd then it should be implemented
properly, and that means implementing it for both read and write.
In fact the user should be able to specify the read-O_DIRECT and the
write-O_DIRECT independently - if for no other reason than that the source
and dest filesytems may not both support O_DIRECT.
On Wed, Apr 07, 2004 at 07:12:21PM +0000, Miquel van Smoorenburg wrote:
> In article <[email protected]>,
> Andy Isaacson <[email protected]> wrote:
> >The next feature to add would be OpenBSD-style "KB/s" reporting. I'm
> >not going there.
> >
> >diff -ur coreutils-5.0.91/doc/coreutils.texi
> >coreutils-5.0.91-adi/doc/coreutils.texi
>
> Doesn't it already do that ?
>
> $ dd if=/dev/zero of=/tmp/file bs=4K count=100
> 100+0 records in
> 100+0 records out
> 409600 bytes transferred in 0.005108 seconds (80189583 bytes/sec)
hmmm...
debian/patches/05_dd-performance-counter.patch
Not upstream.
-andy
Andrew Morton <[email protected]> writes:
> In 2.6 we do the check at open() and fcntl() time. In 2.4 we don't
> fail until the actual I/O attempt.
This raises the issue of what "dd conv=direct" should do in 2.4
kernels. I propose that it should report an error and exit, when the
write fails, since conv=direct can't be implemented. The basic idea
is that on systems that lack direct I/O, conv=direct should fail.
Another issue with this patch: in Solaris, direct I/O is done by
invoking directio(DIRECTIO_ON); see
<http://docs.sun.com/db/doc/816-0213/6m6ne37so?q=directio&a=view>.
Is Solaris direct I/O a direct analog to Linux direct I/O, or are
there subtle differences in semantics that should be made visible to
the users of GNU "dd"?
On Wed, Apr 07, 2004 at 01:03:08PM -0700, Andrew Morton wrote:
> Andy Isaacson <[email protected]> wrote:
> > I claim that O_DIRECT on of= is important because you just plain *can
> > not* do the minimal-sized IDE block scrub without it. I don't yet see a
> > similar benefit to O_DIRECT on if= side.
>
> If you want a block scrubber then write a block scrubber.
They exist. They're a pain in the ass to find when you need one. dd is
almost there; a small patch that might have a chance of getting accepted
upstream seemed like a reasonable time investment.
> If you want to add O_DIRECT support to dd then it should be implemented
> properly, and that means implementing it for both read and write.
>
> In fact the user should be able to specify the read-O_DIRECT and the
> write-O_DIRECT independently - if for no other reason than that the source
> and dest filesytems may not both support O_DIRECT.
Of course if both directions are supported they must be independently
specifiable. I just don't see a compelling use case for the input side.
Nevertheless, here you go. Still needs some autoconfiscation.
2004-04-07 Andy Isaacson <[email protected]>
* add conv=direct, conv=idirect, and conv=fsync options.
diff -u'rx*.1' coreutils-5.0.91/doc/coreutils.texi coreutils-5.0.91-adi/doc/coreutils.texi
--- coreutils-5.0.91/doc/coreutils.texi 2003-09-04 16:26:51.000000000 -0500
+++ coreutils-5.0.91-adi/doc/coreutils.texi 2004-04-07 15:24:17.000000000 -0500
@@ -6373,6 +6373,20 @@
Pad every input block to size of @samp{ibs} with trailing zero bytes.
When used with @samp{block} or @samp{unblock}, pad with spaces instead of
zero bytes.
+
+@item fsync
+@opindex fsync
+Call @samp{fsync(2)} on the output file before exiting. This ensures
+that the file data is written to permanent store.
+
+@item direct
+@opindex direct
+Open the output file with O_DIRECT, avoiding (on Linux) using the buffer
+cache.
+
+@item idirect
+@opindex idirect
+Open the input file with O_DIRECT, avoiding (on Linux) using the buffer
@end table
@end table
diff -u'rx*.1' coreutils-5.0.91/src/dd.c coreutils-5.0.91-adi/src/dd.c
--- coreutils-5.0.91/src/dd.c 2003-07-25 02:43:09.000000000 -0500
+++ coreutils-5.0.91-adi/src/dd.c 2004-04-07 15:23:24.000000000 -0500
@@ -66,6 +66,9 @@
/* Default input and output blocksize. */
#define DEFAULT_BLOCKSIZE 512
+/* XXX hack */
+#define HAVE_O_DIRECT 1
+
/* Conversions bit masks. */
#define C_ASCII 01
#define C_EBCDIC 02
@@ -78,6 +81,9 @@
#define C_NOERROR 0400
#define C_NOTRUNC 01000
#define C_SYNC 02000
+#define C_FSYNC 010000
+#define C_DIRECT 020000
+#define C_IDIRECT 040000
/* Use separate input and output buffers, and combine partial input blocks. */
#define C_TWOBUFS 04000
@@ -162,6 +168,11 @@
{"noerror", C_NOERROR}, /* Ignore i/o errors. */
{"notrunc", C_NOTRUNC}, /* Do not truncate output file. */
{"sync", C_SYNC}, /* Pad input records to ibs with NULs. */
+ {"fsync", C_FSYNC}, /* call fsync(2) before closing output file */
+#ifdef HAVE_O_DIRECT
+ {"direct", C_DIRECT | C_TWOBUFS}, /* open output with O_DIRECT */
+ {"idirect", C_IDIRECT | C_TWOBUFS}, /* open input with O_DIRECT */
+#endif
{NULL, 0}
};
@@ -1177,7 +1188,14 @@
if (input_file != NULL)
{
- if (open_fd (STDIN_FILENO, input_file, O_RDONLY, 0) < 0)
+ int opts = O_RDONLY;
+
+#if HAVE_O_DIRECT
+ if (conversions_mask & C_IDIRECT)
+ opts |= O_DIRECT;
+#endif
+
+ if (open_fd (STDIN_FILENO, input_file, opts, 0) < 0)
error (EXIT_FAILURE, errno, _("opening %s"), quote (input_file));
}
else
@@ -1190,6 +1208,11 @@
= (O_CREAT
| (seek_records || (conversions_mask & C_NOTRUNC) ? 0 : O_TRUNC));
+#if HAVE_O_DIRECT
+ if (conversions_mask & C_DIRECT)
+ opts |= O_DIRECT;
+#endif
+
/* Open the output file with *read* access only if we might
need to read to satisfy a `seek=' request. If we can't read
the file, go ahead with write-only access; it might work. */
@@ -1240,5 +1263,11 @@
exit_status = dd_copy ();
+ if (conversions_mask & C_FSYNC)
+ {
+ if (fsync (STDOUT_FILENO) != 0)
+ error(0, errno, _("cannot fsync %s"), quote (output_file));
+ }
+
quit (exit_status);
}
On Wed, 07 Apr 2004 15:43:41 CDT, Andy Isaacson said:
> On Wed, Apr 07, 2004 at 01:03:08PM -0700, Andrew Morton wrote:
> > If you want a block scrubber then write a block scrubber.
> They exist. They're a pain in the ass to find when you need one. dd is
And finding a block scrubber that actually works *properly* is even more of a
challenge. I've seen more than a few that do stupid things like truncating the
file and then writing the appropriate number of bytes....
Paul Eggert <[email protected]> wrote:
>
> Andrew Morton <[email protected]> writes:
>
> > In 2.6 we do the check at open() and fcntl() time. In 2.4 we don't
> > fail until the actual I/O attempt.
>
> This raises the issue of what "dd conv=direct" should do in 2.4
> kernels. I propose that it should report an error and exit, when the
> write fails,
And when the read fails.
> since conv=direct can't be implemented. The basic idea
> is that on systems that lack direct I/O, conv=direct should fail.
I think that's best. It's a bit user-unfriendly, but a silent fallback is
misleading.
It might be acceptable to print a warning, then fall back.
> Another issue with this patch: in Solaris, direct I/O is done by
> invoking directio(DIRECTIO_ON); see
> <http://docs.sun.com/db/doc/816-0213/6m6ne37so?q=directio&a=view>.
> Is Solaris direct I/O a direct analog to Linux direct I/O, or are
> there subtle differences in semantics that should be made visible to
> the users of GNU "dd"?
solaris directio(DIRECTIO_ON) is a hint only. If the system cannot perform
the uncached zerocopy then it will fall back to buffered IO. Solaris will
perform direct-IO "when the application's buffer is aligned on a two-byte
(short) boundary, the offset into the file is on a device sector boundary,
and the size of the operation is a multiple of device sectors."
On Linux you set direct-io with open(O_DIRECT) or fcntl(F_SETFL, O_DIRECT).
If the filesystem doesn't support direct-io we will fail the open/fcntl
attempt up-front in 2.6 only.
If O_DIRECT was successfully set and the IO is not correctly aligned Linux
will fail the relevant I/O attempt. Alignment requirements are:
2.4: page aligned on-disk and in-memory
2.6: device sector aligned on-disk and in-memory.
I'd recomment that dd use getpagesize() alignment on-disk and in-memory.
On Wed, Apr 07, 2004 at 01:46:31PM -0700, Paul Eggert wrote:
> Andrew Morton <[email protected]> writes:
> > In 2.6 we do the check at open() and fcntl() time. In 2.4 we don't
> > fail until the actual I/O attempt.
>
> This raises the issue of what "dd conv=direct" should do in 2.4
> kernels. I propose that it should report an error and exit, when the
> write fails, since conv=direct can't be implemented. The basic idea
> is that on systems that lack direct I/O, conv=direct should fail.
I agree, this is the appropriate behavior. When "dd conv=direct" is
used on a fd that cannot support direct I/O, and the kernel doesn't tell
us until it's too late, erroring on the I/O is the only sane action.
Fortuitously, that's what happens with the patches I posted.
(Note that it's a function of the kernel and the underlying FS whether
direct I/O is going to work. So some kind of heuristic on kernel
version is a bad idea.)
> Another issue with this patch: in Solaris, direct I/O is done by
> invoking directio(DIRECTIO_ON); see
> <http://docs.sun.com/db/doc/816-0213/6m6ne37so?q=directio&a=view>.
> Is Solaris direct I/O a direct analog to Linux direct I/O, or are
> there subtle differences in semantics that should be made visible to
> the users of GNU "dd"?
The Solaris semantics are more sane than the Linux ones, to be sure --
it always does the I/O, falling back to buffered I/O if conditions are
not right for direct.
I don't believe I have a test system for the Solaris case (when was
directio(3C) added?) so someone else will have to add that support.
-andy
> > If you want to add O_DIRECT support to dd then it should be implemented
> > properly, and that means implementing it for both read and write.
> >
> > In fact the user should be able to specify the read-O_DIRECT and the
> > write-O_DIRECT independently - if for no other reason than that the source
> > and dest filesytems may not both support O_DIRECT.
>
> Of course if both directions are supported they must be independently
> specifiable. I just don't see a compelling use case for the input side.
Andy, since I was the one who suggested adding a conv=odirect flag to the
output side, so that we could use a standard tool (dd) to force
reallocation of bad sectors by writing to them, let me argue that it ALSO
makes sense to allow O_DIRECT (as an independent option) on the input
side.
At least one obvious use is to help FIND unreadable disk sectors (or to
verify that some sector is indeed unreadable). One would like to be able
to do:
dd if=/dev/hda of=/dev/null bs=512 count=1 skip=LBA conv=idirect
and see if the dd suceeds or fails.
If dd fails, then without having an O_DIRECT option (idirect) at the input
side, there is no way of telling if the failure was because of an
unreadable sector at LBA, or an unreadable sector somewhere else in the
block containing LBA. With the O_DIRECT input option, you can be sure that
if this command fails it's because the sector at LBA was unreadable, not
some nearby sector.
Cheers,
Bruce
On Tue, Apr 06, 2004 at 05:03:58PM -0500, Andy Isaacson wrote:
> Linux-kernel: is this patch horribly wrong?
...
> to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
> enable this usage.
Adding the functionality to conv= doesn't seem right to me. conv= is
for converting the data in some way. This is just changing the way data
is written. Right?
Have you looked at lmdd? It supports O_DIRECT and many other things not
in the standard dd.
--
Nate Straz [email protected]
sgi, inc http://www.sgi.com/
Linux Test Project http://ltp.sf.net/
On Wed, Apr 07, 2004 at 05:02:24PM -0500, Nathan Straz wrote:
> On Tue, Apr 06, 2004 at 05:03:58PM -0500, Andy Isaacson wrote:
> > Linux-kernel: is this patch horribly wrong?
> ...
> > to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
> > enable this usage.
>
> Adding the functionality to conv= doesn't seem right to me. conv= is
> for converting the data in some way. This is just changing the way data
> is written. Right?
There's already conv=notrunc which means "open without O_TRUNC". I
agree that it's a nonintuitive interface, but then, the entire dd(1)
command line is.
> Have you looked at lmdd? It supports O_DIRECT and many other things not
> in the standard dd.
Wasn't aware of that one. Thanks for the pointer.
-andy
Bruce Allen <[email protected]> writes:
> let me argue that it ALSO makes sense to allow O_DIRECT (as an
> independent option) on the input side.
At the end of this message is a proposed patch to implement everything
people other than myself have asked for so far, along with several
other things since I was in the neighborhood.
This patch adds the following NEWS entry, which I'll repeat here to
make it easy to see what's going on:
dd has new conversions for the conv= option:
nocreat do not create the output file
excl fail if the output file already exists
fdatasync physically write output file data before finishing
fsync likewise, but also write metadata
dd has new iflags= and oflags= options with the following flags:
append append mode (makes sense for output file only)
direct use direct I/O for data
dsync use synchronized I/O for data
sync likewise, but also for metadata
nonblock use non-blocking I/O
nofollow do not follow symlinks
noctty do not assign controlling terminal from file
This patch is relative to coreutils CVS, so your line numbers may
differ. I didn't add hooks for Solaris-style directio, because that
turns out to be a systemwide property and perhaps should be in a
different command.
2004-04-07 Paul Eggert <[email protected]>
* NEWS: New dd conv= symbols nocreat, excl, fdatasync, fsync,
and new dd options iflag= and oflag=.
* doc/coreutils.texi (dd invocation): Likewise.
* src/dd.c (usage): Likewise.
* m4/jm-macros.m4 (jm_MACROS): Check for fdatasync within
-lrt and -lposix4, so that it can be used in Solaris 2.5.1 and later.
* src/Makefile.am (dd_LDADD, shred_LDADD): Add fdatasync's lib.
* src/dd.c (fdatasync) [!HAVE_FDATASYNC]: New macro.
(C_NOCREAT, C_EXCL, C_FDATASYNC, C_FSYNC): New macros.
(input_flags, output_flags): New vars.
(LONGEST_SYMBOL): New macro.
(struct symbol_value): Renamed from struct conversion. Members
symbol and value renamed from convname and conversion. The
symbol value is now an array instead of a pointer; this saves
a bit of space and time in practice. All uses changed.
(conversions): Add nocreat, excl, fdatasync, fsync. Now const.
(flags): New constant array.
(iflag_error_msgid, oflag_error_msgid): New constants.
(parse_symbols): Renamed from parse_conversion and generalized
to handle either conversion or flag symbols.
(scanargs): Adjust uses of parse_symbols accodingly. Add
support for iflag= and oflag=. Reject attempts to use
both excl and nocreat.
(set_fd_flags): New function.
(dd_copy): Just return X rather than calling quit (X), since our
caller invokes quit with the returned value. Add support for
fdatasync and fsync.
(main): Add support for iflags=, oflags=, and new conv= symbols.
* src/system.h (O_DIRECT, O_DSYNC, O_NDELAY, O_NOFOLLOW,
O_RSYNC, O_SYNC): Define to 0 if not already defined.
* NEWS: Remove duplicate mention of BLOCKSIZE.
Index: NEWS
===================================================================
RCS file: /home/meyering/coreutils/cu/NEWS,v
retrieving revision 1.198
diff -p -u -r1.198 NEWS
--- NEWS 7 Apr 2004 10:56:00 -0000 1.198
+++ NEWS 8 Apr 2004 04:01:34 -0000
@@ -9,14 +9,24 @@ GNU coreutils NEWS
** New features
- stty now provides support (iutf8) for setting UTF-8 input mode.
+ dd has new conversions for the conv= option:
- 'df', 'du', and 'ls' now take the default block size from the
- BLOCKSIZE environment variable if the BLOCK_SIZE, DF_BLOCK_SIZE,
- DU_BLOCK_SIZE, and LS_BLOCK_SIZE environment variables are not set.
- Unlike the other variables, though, BLOCKSIZE does not affect
- values like 'ls -l' sizes that are normally displayed as bytes.
- This new behavior is for compatibility with BSD.
+ nocreat do not create the output file
+ excl fail if the output file already exists
+ fdatasync physically write output file data before finishing
+ fsync likewise, but also write metadata
+
+ dd has new iflags= and oflags= options with the following flags:
+
+ append append mode (makes sense for output file only)
+ direct use direct I/O for data
+ dsync use synchronized I/O for data
+ sync likewise, but also for metadata
+ nonblock use non-blocking I/O
+ nofollow do not follow symlinks
+ noctty do not assign controlling terminal from file
+
+ stty now provides support (iutf8) for setting UTF-8 input mode.
With stat, a specified format is no longer automatically newline terminated.
If you want a newline at the end of your output, append `\n' to the format
Index: doc/coreutils.texi
===================================================================
RCS file: /home/meyering/coreutils/cu/doc/coreutils.texi,v
retrieving revision 1.175
diff -p -u -r1.175 coreutils.texi
--- doc/coreutils.texi 7 Apr 2004 09:50:48 -0000 1.175
+++ doc/coreutils.texi 8 Apr 2004 04:55:38 -0000
@@ -6580,6 +6580,17 @@ when an odd number of bytes are read---t
@cindex read errors, ignoring
Continue after read errors.
+@item nocreat
+@opindex nocreat
+@cindex creating output file, avoiding
+Do not create the output file; the output file must already exist.
+
+@item excl
+@opindex excl
+@cindex creating output file, requiring
+Fail if the output file already exists; @command{dd} must create the
+output file itself.
+
@item notrunc
@opindex notrunc
@cindex truncating output file, avoiding
@@ -6590,7 +6601,85 @@ Do not truncate the output file.
Pad every input block to size of @samp{ibs} with trailing zero bytes.
When used with @samp{block} or @samp{unblock}, pad with spaces instead of
zero bytes.
+
+@item fdatasync
+@opindex fdatasync
+@cindex synchronized data writes, before finishing
+Synchronize output data just before finishing. This forces a physical
+write of output data.
+
+@item fsync
+@opindex fsync
+@cindex synchronized data and metadata writes, before finishing
+Synchronize output data and metadata just before finishing. This
+forces a physical write of output data and metadata.
+
+@end table
+
+@item iflag=@var{flag}[,@var{flag}]@dots{}
+@opindex iflag
+Access the input file using the flags specified by the @var{flag}
+argument(s). (No spaces around any comma(s).)
+
+@item oflag=@var{flag}[,@var{flag}]@dots{}
+@opindex oflag
+Access the output file using the flags specified by the @var{flag}
+argument(s). (No spaces around any comma(s).)
+
+Flags:
+
+@table @samp
+
+@item append
+@opindex append
+@cindex appending to the output file
+Write in append mode, so that even if some other process is writing to
+this file, every @command{dd} write will append to the current
+contents of the file. This flag makes sense only for output.
+
+@item direct
+@opindex direct
+@cindex direct I/O
+Use direct I/O for data, avoiding the buffer cache.
+
+@item dsync
+@opindex dsync
+@cindex synchronized data reads
+Use synchronized I/O for data. For the output file, this forces a
+physical write of output data on each write. For the input file,
+this flag can matter when reading from a remote file that has been
+written to synchronously by some other process. Metadata (e.g.,
+last-access and last-modified time) is not necessarily synchronized.
+
+@item sync
+@opindex sync
+@cindex synchronized data and metadata I/O
+Use synchronized I/O for both data and metadata.
+
+@item nonblock
+@opindex nonblock
+@cindex nonblocking I/O
+Use non-blocking I/O.
+
+@item nofollow
+@opindex nofollow
+@cindex symbolic links, following
+Do not follow symbolic links.
+
+@item noctty
+@opindex noctty
+@cindex controlling terminal
+Do not assign the file to be a controlling terminal for @command{dd}.
+This has no effect when the file is not a terminal.
+
@end table
+
+These flags are not supported on all systems, and @samp{dd} rejects
+attempts to use them when they are not supported. When reading from
+standard input or writing to standard output, the @samp{nofollow} and
+@samp{noctty} flags should not be specified, and the other flags
+(e.g., @samp{nonblock}) can affect how other processes behave with the
+affected file descriptors, even after @command{dd} exits.
@end table
Index: m4/jm-macros.m4
===================================================================
RCS file: /home/meyering/coreutils/cu/m4/jm-macros.m4,v
retrieving revision 1.184
diff -p -u -r1.184 jm-macros.m4
--- m4/jm-macros.m4 20 Dec 2003 17:57:30 -0000 1.184
+++ m4/jm-macros.m4 8 Apr 2004 05:30:05 -0000
@@ -81,7 +81,6 @@ AC_DEFUN([jm_MACROS],
AC_CHECK_FUNCS( \
endgrent \
endpwent \
- fdatasync \
ftruncate \
gethrtime \
hasmntopt \
@@ -110,6 +109,15 @@ AC_DEFUN([jm_MACROS],
AC_FUNC_STRTOD
AC_REQUIRE([GL_FUNC_GETCWD_PATH_MAX])
AC_REQUIRE([GL_FUNC_READDIR])
+
+ # for dd.c and shred.c
+ fetish_saved_libs=$LIBS
+ AC_SEARCH_LIBS([fdatasync], [rt posix4],
+ [test "$ac_cv_search_fdatasync" = "none required" ||
+ LIB_FDATASYNC=$ac_cv_search_fdatasync])
+ AC_SUBST([LIB_FDATASYNC])
+ AC_CHECK_FUNCS(fdatasync)
+ LIBS=$fetish_saved_libs
# See if linking `seq' requires -lm.
# It does on nearly every system. The single exception (so far) is
Index: src/Makefile.am
===================================================================
RCS file: /home/meyering/coreutils/cu/src/Makefile.am,v
retrieving revision 1.34
diff -p -u -r1.34 Makefile.am
--- src/Makefile.am 17 Mar 2004 10:14:17 -0000 1.34
+++ src/Makefile.am 8 Apr 2004 05:23:13 -0000
@@ -32,10 +32,11 @@ AM_CPPFLAGS = -I.. -I$(srcdir) -I$(top_s
# replacement functions defined in libfetish.a.
LDADD = ../lib/libfetish.a $(LIBINTL) ../lib/libfetish.a
-# for clock_gettime
+# for clock_gettime and fdatasync
+dd_LDADD = $(LDADD) $(LIB_FDATASYNC)
dir_LDADD = $(LDADD) $(LIB_CLOCK_GETTIME)
ls_LDADD = $(LDADD) $(LIB_CLOCK_GETTIME)
-shred_LDADD = $(LDADD) $(LIB_CLOCK_GETTIME)
+shred_LDADD = $(LDADD) $(LIB_CLOCK_GETTIME) $(LIB_FDATASYNC)
vdir_LDADD = $(LDADD) $(LIB_CLOCK_GETTIME)
## If necessary, add -lm to resolve use of pow in lib/strtod.c.
Index: src/dd.c
===================================================================
RCS file: /home/meyering/coreutils/cu/src/dd.c,v
retrieving revision 1.154
diff -p -u -r1.154 dd.c
--- src/dd.c 21 Jan 2004 22:53:49 -0000 1.154
+++ src/dd.c 8 Apr 2004 06:26:20 -0000
@@ -49,6 +49,10 @@
# define S_TYPEISSHM(Stat_ptr) 0
#endif
+#if ! HAVE_FDATASYNC
+# define fdatasync(fd) (errno = ENOSYS, -1)
+#endif
+
#define ROUND_UP_OFFSET(X, M) ((M) - 1 - (((X) + (M) - 1) % (M)))
#define PTR_ALIGN(Ptr, M) ((Ptr) \
+ ROUND_UP_OFFSET ((char *)(Ptr) - (char *)0, (M)))
@@ -80,6 +84,10 @@
#define C_SYNC 02000
/* Use separate input and output buffers, and combine partial input blocks. */
#define C_TWOBUFS 04000
+#define C_NOCREAT 010000
+#define C_EXCL 020000
+#define C_FDATASYNC 040000
+#define C_FSYNC 0100000
/* The name this program was run with. */
char *program_name;
@@ -111,6 +119,10 @@ static uintmax_t max_records = (uintmax_
/* Bit vector of conversions to apply. */
static int conversions_mask = 0;
+/* Open flags for the input and output files. */
+static int input_flags = 0;
+static int output_flags = 0;
+
/* If nonzero, filter characters through the translation table. */
static int translation_needed = 0;
@@ -143,13 +155,18 @@ static size_t oc = 0;
/* Index into current line, for `conv=block' and `conv=unblock'. */
static size_t col = 0;
-struct conversion
+/* A longest symbol in the struct symbol_values table below. */
+#define LONGEST_SYMBOL "fdatasync"
+
+/* A symbol and the corresponding integer value. */
+struct symbol_value
{
- char *convname;
- int conversion;
+ char symbol[sizeof LONGEST_SYMBOL];
+ int value;
};
-static struct conversion conversions[] =
+/* Conversion symbols, for conv="...". */
+static struct symbol_value const conversions[] =
{
{"ascii", C_ASCII | C_TWOBUFS}, /* EBCDIC to ASCII. */
{"ebcdic", C_EBCDIC | C_TWOBUFS}, /* ASCII to EBCDIC. */
@@ -160,9 +177,26 @@ static struct conversion conversions[] =
{"ucase", C_UCASE | C_TWOBUFS}, /* Translate lower to upper case. */
{"swab", C_SWAB | C_TWOBUFS}, /* Swap bytes of input. */
{"noerror", C_NOERROR}, /* Ignore i/o errors. */
+ {"nocreat", C_NOCREAT}, /* Do not create output file. */
+ {"excl", C_EXCL}, /* Fail if the output file already exists. */
{"notrunc", C_NOTRUNC}, /* Do not truncate output file. */
{"sync", C_SYNC}, /* Pad input records to ibs with NULs. */
- {NULL, 0}
+ {"fdatasync", C_FDATASYNC}, /* Synchronize output data before finishing. */
+ {"fsync", C_FSYNC}, /* Also synchronize output metadata. */
+ {"", 0}
+};
+
+/* Flags, for iflag="..." and oflag="...". */
+static struct symbol_value const flags[] =
+{
+ {"append", O_APPEND},
+ {"direct", O_DIRECT},
+ {"dsync", O_DSYNC},
+ {"noctty", O_NOCTTY},
+ {"nofollow", O_NOFOLLOW},
+ {"nonblock", O_NONBLOCK},
+ {"sync", O_SYNC},
+ {"", 0}
};
/* Translation table formed by applying successive transformations. */
@@ -290,14 +324,16 @@ Copy a file, converting and formatting a
\n\
bs=BYTES force ibs=BYTES and obs=BYTES\n\
cbs=BYTES convert BYTES bytes at a time\n\
- conv=KEYWORDS convert the file as per the comma separated keyword list\n\
+ conv=CONVS convert the file as per the comma separated symbol list\n\
count=BLOCKS copy only BLOCKS input blocks\n\
ibs=BYTES read BYTES bytes at a time\n\
"), stdout);
fputs (_("\
if=FILE read from FILE instead of stdin\n\
+ iflag=FLAGS read as per the comma separated symbol list\n\
obs=BYTES write BYTES bytes at a time\n\
of=FILE write to FILE instead of stdout\n\
+ oflag=FLAGS write as per the comma separated symbol list\n\
seek=BLOCKS skip BLOCKS obs-sized blocks at start of output\n\
skip=BLOCKS skip BLOCKS ibs-sized blocks at start of input\n\
"), stdout);
@@ -308,7 +344,8 @@ Copy a file, converting and formatting a
BLOCKS and BYTES may be followed by the following multiplicative suffixes:\n\
xM M, c 1, w 2, b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,\n\
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.\n\
-Each KEYWORD may be:\n\
+\n\
+Each CONV symbol may be:\n\
\n\
"), stdout);
fputs (_("\
@@ -320,15 +357,38 @@ Each KEYWORD may be:\n\
lcase change upper case to lower case\n\
"), stdout);
fputs (_("\
+ nocreat do not create the output file\n\
+ excl fail if the output file already exists\n\
notrunc do not truncate the output file\n\
ucase change lower case to upper case\n\
swab swap every pair of input bytes\n\
noerror continue after read errors\n\
sync pad every input block with NULs to ibs-size; when used\n\
with block or unblock, pad with spaces rather than NULs\n\
+ fdatasync physically write output file data before finishing\n\
+ fsync likewise, but also write metadata\n\
"), stdout);
fputs (_("\
\n\
+Each FLAG symbol may be:\n\
+\n\
+ append append mode (makes sense only for output)\n\
+"), stdout);
+ if (O_DIRECT)
+ fputs (_(" direct use direct I/O for data\n"), stdout);
+ if (O_DSYNC)
+ fputs (_(" dsync use synchronized I/O for data\n"), stdout);
+ if (O_SYNC)
+ fputs (_(" sync likewise, but also for metadata\n"), stdout);
+ if (O_NONBLOCK)
+ fputs (_(" nonblock use non-blocking I/O\n"), stdout);
+ if (O_NOFOLLOW)
+ fputs (_(" nofollow do not follow symlinks\n"), stdout);
+ if (O_NOCTTY)
+ fputs (_(" noctty do not assign controlling terminal from file\n"),
+ stdout);
+ fputs (_("\
+\n\
Note that sending a SIGUSR1 signal to a running `dd' process makes it\n\
print to standard error the number of records read and written so far,\n\
then to resume copying.\n\
@@ -486,33 +546,47 @@ write_output (void)
oc = 0;
}
-/* Interpret one "conv=..." option.
+/* Diagnostics for invalid iflag="..." and oflag="..." symbols. */
+static char const iflag_error_msgid[] = N_("invalid input flag: %s");
+static char const oflag_error_msgid[] = N_("invalid output flag: %s");
+
+/* Interpret one "conv=..." or similar option STR according to the
+ symbols in TABLE, returning the flags specified. If the option
+ cannot be parsed, use ERROR_MSGID to generate a diagnostic.
As a by product, this function replaces each `,' in STR with a NUL byte. */
-static void
-parse_conversion (char *str)
+static int
+parse_symbols (char *str, struct symbol_value const *table,
+ char const *error_msgid)
{
- char *new;
- int i;
+ int value = 0;
do
{
- new = strchr (str, ',');
+ struct symbol_value const *entry;
+ char *new = strchr (str, ',');
if (new != NULL)
*new++ = '\0';
- for (i = 0; conversions[i].convname != NULL; i++)
- if (STREQ (conversions[i].convname, str))
- {
- conversions_mask |= conversions[i].conversion;
- break;
- }
- if (conversions[i].convname == NULL)
+ for (entry = table; ; entry++)
{
- error (0, 0, _("invalid conversion: %s"), quote (str));
- usage (EXIT_FAILURE);
+ if (! entry->symbol[0])
+ {
+ error (0, 0, _(error_msgid), quote (str));
+ usage (EXIT_FAILURE);
+ }
+ if (STREQ (entry->symbol, str))
+ {
+ if (! entry->value)
+ error (EXIT_FAILURE, 0, _(error_msgid), quote (str));
+ value |= entry->value;
+ break;
+ }
}
str = new;
- } while (new != NULL);
+ }
+ while (str);
+
+ return value;
}
/* Return the value of STR, interpreted as a non-negative decimal integer,
@@ -574,7 +648,12 @@ scanargs (int argc, char **argv)
else if (STREQ (name, "of"))
output_file = val;
else if (STREQ (name, "conv"))
- parse_conversion (val);
+ conversions_mask |= parse_symbols (val, conversions,
+ N_("invalid conversion: %s"));
+ else if (STREQ (name, "iflag"))
+ input_flags |= parse_symbols (val, flags, iflag_error_msgid);
+ else if (STREQ (name, "oflag"))
+ output_flags |= parse_symbols (val, flags, oflag_error_msgid);
else
{
int invalid = 0;
@@ -638,6 +717,12 @@ scanargs (int argc, char **argv)
output_blocksize = DEFAULT_BLOCKSIZE;
if (conversion_blocksize == 0)
conversions_mask &= ~(C_BLOCK | C_UNBLOCK);
+
+ if (input_flags & (O_DSYNC | O_SYNC))
+ input_flags |= O_RSYNC;
+
+ if ((conversions_mask & (C_EXCL | C_NOCREAT)) == (C_EXCL | C_NOCREAT))
+ error (EXIT_FAILURE, 0, _("cannot combine excl and nocreat"));
}
/* Fix up translation table. */
@@ -928,6 +1013,22 @@ copy_with_unblock (char const *buf, size
}
}
+/* Set the file descriptor flags for FD that correspond to the nonzero bits
+ in FLAGS. The file's name is NAME. */
+
+static void
+set_fd_flags (int fd, int flags, char const *name)
+{
+ if (flags)
+ {
+ int old_flags = fcntl (fd, F_GETFL);
+ int new_flags = old_flags | flags;
+ if (old_flags < 0
+ || (new_flags != old_flags && fcntl (fd, F_SETFL, new_flags) == -1))
+ error (EXIT_FAILURE, errno, _("setting flags for %s"), quote (name));
+ }
+}
+
/* The main loop. */
static int
@@ -994,7 +1095,7 @@ dd_copy (void)
}
if (max_records == 0)
- quit (exit_status);
+ return exit_status;
while (1)
{
@@ -1061,7 +1162,7 @@ dd_copy (void)
if (nwritten != n_bytes_read)
{
error (0, errno, _("writing %s"), quote (output_file));
- quit (EXIT_FAILURE);
+ return EXIT_FAILURE;
}
else if (n_bytes_read == input_blocksize)
w_full++;
@@ -1122,7 +1223,7 @@ dd_copy (void)
if (nwritten != oc)
{
error (0, errno, _("writing %s"), quote (output_file));
- quit (EXIT_FAILURE);
+ return EXIT_FAILURE;
}
}
@@ -1130,6 +1231,24 @@ dd_copy (void)
if (real_obuf)
free (real_obuf);
+ if ((conversions_mask & C_FDATASYNC) && fdatasync (STDOUT_FILENO) != 0)
+ {
+ if (errno != ENOSYS && errno != EINVAL)
+ {
+ error (0, errno, "fdatasync %s", quote (output_file));
+ exit_status = EXIT_FAILURE;
+ }
+ conversions_mask |= C_FSYNC;
+ }
+
+ if (conversions_mask & C_FSYNC)
+ while (fsync (STDOUT_FILENO) != 0)
+ if (errno != EINTR)
+ {
+ error (0, errno, "fsync %s", quote (output_file));
+ return EXIT_FAILURE;
+ }
+
return exit_status;
}
@@ -1174,19 +1293,29 @@ main (int argc, char **argv)
apply_translations ();
- if (input_file != NULL)
+ if (input_file == NULL)
{
- if (open_fd (STDIN_FILENO, input_file, O_RDONLY, 0) < 0)
- error (EXIT_FAILURE, errno, _("opening %s"), quote (input_file));
+ input_file = _("standard input");
+ set_fd_flags (STDIN_FILENO, input_flags, input_file);
}
else
- input_file = _("standard input");
+ {
+ if (open_fd (STDIN_FILENO, input_file, O_RDONLY | input_flags, 0) < 0)
+ error (EXIT_FAILURE, errno, _("opening %s"), quote (input_file));
+ }
- if (output_file != NULL)
+ if (output_file == NULL)
+ {
+ output_file = _("standard output");
+ set_fd_flags (STDOUT_FILENO, output_flags, output_file);
+ }
+ else
{
mode_t perms = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH;
int opts
- = (O_CREAT
+ = (output_flags
+ | (conversions_mask & C_NOCREAT ? 0 : O_CREAT)
+ | (conversions_mask & C_EXCL ? O_EXCL : 0)
| (seek_records || (conversions_mask & C_NOTRUNC) ? 0 : O_TRUNC));
/* Open the output file with *read* access only if we might
@@ -1210,8 +1339,8 @@ main (int argc, char **argv)
quote (output_file));
/* Complain only when ftruncate fails on a regular file, a
- directory, or a shared memory object, as the 2000-08
- POSIX draft specifies ftruncate's behavior only for these
+ directory, or a shared memory object, as
+ POSIX 1003.1-2003 specifies ftruncate's behavior only for these
file types. For example, do not complain when Linux 2.4
ftruncate fails on /dev/fd0. */
if (ftruncate (STDOUT_FILENO, o) != 0
@@ -1226,10 +1355,6 @@ main (int argc, char **argv)
}
}
#endif
- }
- else
- {
- output_file = _("standard output");
}
install_handler (SIGINT, interrupt_handler);
Index: src/system.h
===================================================================
RCS file: /home/meyering/coreutils/cu/src/system.h,v
retrieving revision 1.82
diff -p -u -r1.82 system.h
--- src/system.h 6 Apr 2004 13:55:00 -0000 1.82
+++ src/system.h 8 Apr 2004 03:39:01 -0000
@@ -197,6 +197,14 @@ initialize_exit_failure (int status)
# define O_TEXT _O_TEXT
#endif
+#if !defined O_DIRECT
+# define O_DIRECT 0
+#endif
+
+#if !defined O_DSYNC
+# define O_DSYNC 0
+#endif
+
#if !defined O_NDELAY
# define O_NDELAY 0
#endif
@@ -207,6 +215,18 @@ initialize_exit_failure (int status)
#if !defined O_NOCTTY
# define O_NOCTTY 0
+#endif
+
+#if !defined O_NOFOLLOW
+# define O_NOFOLLOW 0
+#endif
+
+#if !defined O_RSYNC
+# define O_RSYNC 0
+#endif
+
+#if !defined O_SYNC
+# define O_SYNC 0
#endif
#ifdef __BEOS__
Paul Eggert <[email protected]> wrote:
> At the end of this message is a proposed patch to implement everything
> people other than myself have asked for so far, along with several
> other things since I was in the neighborhood.
Thanks for that nice patch!
I've applied it, and made these additional changes:
2004-04-08 Jim Meyering <[email protected]>
* src/dd.c (set_fd_flags): Don't OR in -1 when fcntl fails.
Rename parameter, flags, to avoid shadowing global.
Index: dd.c
===================================================================
RCS file: /fetish/cu/src/dd.c,v
retrieving revision 1.155
diff -u -p -r1.155 dd.c
--- a/dd.c 8 Apr 2004 10:22:05 -0000 1.155
+++ b/dd.c 8 Apr 2004 11:02:20 -0000
@@ -1017,12 +1017,12 @@ copy_with_unblock (char const *buf, size
in FLAGS. The file's name is NAME. */
static void
-set_fd_flags (int fd, int flags, char const *name)
+set_fd_flags (int fd, int add_flags, char const *name)
{
- if (flags)
+ if (add_flags)
{
int old_flags = fcntl (fd, F_GETFL);
- int new_flags = old_flags | flags;
+ int new_flags = old_flags < 0 ? add_flags : (old_flags | add_flags);
if (old_flags < 0
|| (new_flags != old_flags && fcntl (fd, F_SETFL, new_flags) == -1))
error (EXIT_FAILURE, errno, _("setting flags for %s"), quote (name));
In article <[email protected]>,
Andy Isaacson <[email protected]> wrote:
>On Wed, Apr 07, 2004 at 05:02:24PM -0500, Nathan Straz wrote:
>> On Tue, Apr 06, 2004 at 05:03:58PM -0500, Andy Isaacson wrote:
>> > Linux-kernel: is this patch horribly wrong?
>> ...
>> > to force O_DIRECT. The enclosed patch adds a "conv=direct" flag to
>> > enable this usage.
>>
>> Adding the functionality to conv= doesn't seem right to me. conv= is
>> for converting the data in some way. This is just changing the way data
>> is written. Right?
>
>There's already conv=notrunc which means "open without O_TRUNC". I
>agree that it's a nonintuitive interface, but then, the entire dd(1)
>command line is.
Well as you already added iflags and oflags, it makes more sense to
use those. oflags=fsync makes more sense to me than conv=fsync.
Likewise oflags=notrunc.
Mike.
Paul Eggert <[email protected]> writes:
> dd has new iflags= and oflags= options with the following flags:
>
> append append mode (makes sense for output file only)
> direct use direct I/O for data
> dsync use synchronized I/O for data
> sync likewise, but also for metadata
> nonblock use non-blocking I/O
> nofollow do not follow symlinks
> noctty do not assign controlling terminal from file
noctty definitely seems overkill... I can't see why dd would ever want
to open a file without O_NOCTTY. On systems where O_NOCTTY makes sense
(SvR4) that is.
Phil.
Jim Meyering <[email protected]> writes:
> 2004-04-08 Jim Meyering <[email protected]>
>
> * src/dd.c (set_fd_flags): Don't OR in -1 when fcntl fails.
Doesn't that fix generate worse code in the usual case, since it
causes two conditional branches instead of one?
How about this further patch? It relies on common subexpression
elimination, but that's common these days.
2004-04-08 Paul Eggert <[email protected]>
* src/dd.c (set_fd_flags): Don't test old_flags < 0 twice.
Index: src/dd.c
===================================================================
RCS file: /home/meyering/coreutils/cu/src/dd.c,v
retrieving revision 1.157
diff -p -u -r1.157 dd.c
--- src/dd.c 8 Apr 2004 15:25:39 -0000 1.157
+++ src/dd.c 8 Apr 2004 19:17:02 -0000
@@ -1014,7 +1014,7 @@ copy_with_unblock (char const *buf, size
}
/* Set the file descriptor flags for FD that correspond to the nonzero bits
- in FLAGS. The file's name is NAME. */
+ in ADD_FLAGS. The file's name is NAME. */
static void
set_fd_flags (int fd, int add_flags, char const *name)
@@ -1022,9 +1022,9 @@ set_fd_flags (int fd, int add_flags, cha
if (add_flags)
{
int old_flags = fcntl (fd, F_GETFL);
- int new_flags = old_flags < 0 ? add_flags : (old_flags | add_flags);
if (old_flags < 0
- || (new_flags != old_flags && fcntl (fd, F_SETFL, new_flags) == -1))
+ || (old_flags != (old_flags | add_flags)
+ && fcntl (fd, F_SETFL, old_flags | add_flags) == -1))
error (EXIT_FAILURE, errno, _("setting flags for %s"), quote (name));
}
}
Paul Eggert <[email protected]> wrote:
> Jim Meyering <[email protected]> writes:
>> 2004-04-08 Jim Meyering <[email protected]>
>>
>> * src/dd.c (set_fd_flags): Don't OR in -1 when fcntl fails.
>
> Doesn't that fix generate worse code in the usual case, since it
> causes two conditional branches instead of one?
I think it's unnecessary anyway - if old_flags is -1, then the
conditional jumps to error() before looking at new_flags at all.
paul
Philippe Troin <[email protected]> writes:
> noctty definitely seems overkill... I can't see why dd would ever want
> to open a file without O_NOCTTY.
Good point; it's just a confusion for the user. Here's a patch to
cause dd to always use O_NOCTTY. If someone ever needs it the other
way (not likely) I suppose we can add a "ctty" flag.
2004-04-08 Paul Eggert <[email protected]>
* NEWS: Remove noctty flag from dd. Suggested by Philippe Troin.
* doc/coreutils.texi (dd invocation): Likewise.
* src/shred.c (O_NOCTTY): Remove redundant decl.
* src/dd.c (flags, usage): Remove noctty flag.
(main): Always use O_NOCTTY when opening files.
Index: NEWS
===================================================================
RCS file: /home/meyering/coreutils/cu/NEWS,v
retrieving revision 1.199
diff -p -u -r1.199 NEWS
--- NEWS 8 Apr 2004 10:25:27 -0000 1.199
+++ NEWS 8 Apr 2004 20:04:42 -0000
@@ -24,7 +24,6 @@ GNU coreutils NEWS
sync likewise, but also for metadata
nonblock use non-blocking I/O
nofollow do not follow symlinks
- noctty do not assign controlling terminal from file
stty now provides support (iutf8) for setting UTF-8 input mode.
Index: doc/coreutils.texi
===================================================================
RCS file: /home/meyering/coreutils/cu/doc/coreutils.texi,v
retrieving revision 1.176
diff -p -u -r1.176 coreutils.texi
--- doc/coreutils.texi 8 Apr 2004 15:35:40 -0000 1.176
+++ doc/coreutils.texi 8 Apr 2004 20:05:38 -0000
@@ -6666,18 +6666,12 @@ Use non-blocking I/O.
@cindex symbolic links, following
Do not follow symbolic links.
-@item noctty
-@opindex noctty
-@cindex controlling terminal
-Do not assign the file to be a controlling terminal for @command{dd}.
-This has no effect when the file is not a terminal.
-
@end table
These flags are not supported on all systems, and @samp{dd} rejects
attempts to use them when they are not supported. When reading from
-standard input or writing to standard output, the @samp{nofollow} and
-@samp{noctty} flags should not be specified, and the other flags
+standard input or writing to standard output, the @samp{nofollow} flag
+should not be specified, and the other flags
(e.g., @samp{nonblock}) can affect how other processes behave with the
affected file descriptors, even after @command{dd} exits.
Index: src/shred.c
===================================================================
RCS file: /home/meyering/coreutils/cu/src/shred.c,v
retrieving revision 1.84
diff -p -u -r1.84 shred.c
--- src/shred.c 21 Jan 2004 23:39:34 -0000 1.84
+++ src/shred.c 8 Apr 2004 20:04:45 -0000
@@ -110,10 +110,6 @@
#include "quotearg.h" /* For quotearg_colon */
#include "quote.h" /* For quotearg_colon */
-#ifndef O_NOCTTY
-# define O_NOCTTY 0 /* This is a very optional frill */
-#endif
-
#define DEFAULT_PASSES 25 /* Default */
/* How many seconds to wait before checking whether to output another
--- src/dd.c-bak Thu Apr 8 12:17:02 2004
+++ src/dd.c Thu Apr 8 13:18:30 2004
@@ -192,7 +192,6 @@ static struct symbol_value const flags[]
{"append", O_APPEND},
{"direct", O_DIRECT},
{"dsync", O_DSYNC},
- {"noctty", O_NOCTTY},
{"nofollow", O_NOFOLLOW},
{"nonblock", O_NONBLOCK},
{"sync", O_SYNC},
@@ -384,9 +383,6 @@ Each FLAG symbol may be:\n\
fputs (_(" nonblock use non-blocking I/O\n"), stdout);
if (O_NOFOLLOW)
fputs (_(" nofollow do not follow symlinks\n"), stdout);
- if (O_NOCTTY)
- fputs (_(" noctty do not assign controlling terminal from file\n"),
- stdout);
fputs (_("\
\n\
Note that sending a SIGUSR1 signal to a running `dd' process makes it\n\
@@ -1300,7 +1296,8 @@ main (int argc, char **argv)
}
else
{
- if (open_fd (STDIN_FILENO, input_file, O_RDONLY | input_flags, 0) < 0)
+ int opts = input_flags | O_NOCTTY;
+ if (open_fd (STDIN_FILENO, input_file, O_RDONLY | opts, 0) < 0)
error (EXIT_FAILURE, errno, _("opening %s"), quote (input_file));
}
@@ -1313,7 +1310,7 @@ main (int argc, char **argv)
{
mode_t perms = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH;
int opts
- = (output_flags
+ = (output_flags | O_NOCTTY
| (conversions_mask & C_NOCREAT ? 0 : O_CREAT)
| (conversions_mask & C_EXCL ? O_EXCL : 0)
| (seek_records || (conversions_mask & C_NOTRUNC) ? 0 : O_TRUNC));
Paul Eggert <[email protected]> wrote:
> Jim Meyering <[email protected]> writes:
>
>> 2004-04-08 Jim Meyering <[email protected]>
>>
>> * src/dd.c (set_fd_flags): Don't OR in -1 when fcntl fails.
>
> Doesn't that fix generate worse code in the usual case, since it
> causes two conditional branches instead of one?
>
> How about this further patch? It relies on common subexpression
> elimination, but that's common these days.
>
> 2004-04-08 Paul Eggert <[email protected]>
>
> * src/dd.c (set_fd_flags): Don't test old_flags < 0 twice.
I see your point. Thanks for the comment fix.
I've gone ahead and reverted my change,
since I think your original code is a little more
readable than the more-efficient-when-fcntl-fails code
that you're proposing.
(set_fd_flags): Undo part of today's change: it's a little
cleaner -- and more efficient in the common case -- to go
ahead and OR in the -1 when fcntl fails.
Index: src/dd.c
===================================================================
RCS file: /fetish/cu/src/dd.c,v
retrieving revision 1.158
retrieving revision 1.159
diff -u -p -u -r1.158 -r1.159
--- a/src/dd.c 8 Apr 2004 21:26:28 -0000 1.158
+++ b/src/dd.c 8 Apr 2004 21:30:18 -0000 1.159
@@ -1014,7 +1014,7 @@ copy_with_unblock (char const *buf, size
}
/* Set the file descriptor flags for FD that correspond to the nonzero bits
- in FLAGS. The file's name is NAME. */
+ in ADD_FLAGS. The file's name is NAME. */
static void
set_fd_flags (int fd, int add_flags, char const *name)
@@ -1022,7 +1022,7 @@ set_fd_flags (int fd, int add_flags, cha
if (add_flags)
{
int old_flags = fcntl (fd, F_GETFL);
- int new_flags = old_flags < 0 ? add_flags : (old_flags | add_flags);
+ int new_flags = old_flags | add_flags;
if (old_flags < 0
|| (new_flags != old_flags && fcntl (fd, F_SETFL, new_flags) == -1))
error (EXIT_FAILURE, errno, _("setting flags for %s"), quote (name));
Paul Eggert <[email protected]> wrote:
> Philippe Troin <[email protected]> writes:
>
>> noctty definitely seems overkill... I can't see why dd would ever want
>> to open a file without O_NOCTTY.
>
> Good point; it's just a confusion for the user. Here's a patch to
> cause dd to always use O_NOCTTY. If someone ever needs it the other
> way (not likely) I suppose we can add a "ctty" flag.
>
> 2004-04-08 Paul Eggert <[email protected]>
>
> * NEWS: Remove noctty flag from dd. Suggested by Philippe Troin.
> * doc/coreutils.texi (dd invocation): Likewise.
> * src/shred.c (O_NOCTTY): Remove redundant decl.
> * src/dd.c (flags, usage): Remove noctty flag.
> (main): Always use O_NOCTTY when opening files.
Applied. Thanks!
Hi,
> OK, I can see that one. But it seems like a pretty small benefit to me
> -- CPU utilization is already really low.
Maybe not to you but it does make a big difference on our 500 disk setup.
At the moment we use dd to do an initial sniff, then ext3 utils to do
O_DIRECT reads/writes. With O_DIRECT read/write in dd we could use it
instead. (We are basically interested in IO performance that a database
would see).
> Um, that sounds like a bad idea to me. It seems to me it's the kernel's
> responsibility to figure out "hey, looks like a streaming read - let's
> not blow out the buffer cache trying to hold 20GB on a 512M system." If
> you're saying that the kernel guys have given up and the established
> wisdom is now "you gotta use O_DIRECT if you don't want to throw
> everything else out due to streaming data", well... I'm disappointed.
When you start hitting memory bandwidth limits, O_DIRECT will help you.
Sure it wont be an issue for your dd copy scenario, but I wanted to point
out there are other valid uses for it.
Anton
philip copeland did a whole set of patches for coreutils to allow
directio for both read write and mixed sizes even
the rpm is at
http://oss.oracle.com/projects/ocfs/files/source/RHAT/RHAS3/coreutils-4.5.3-33.src.rpm,
I think he took it up with the maintainers but so far had no luck
it makes sense to have this stuff, we use it a lot.
Wim
On Fri, Apr 09, 2004 at 10:37:37AM +1000, Anton Blanchard wrote:
>
> Hi,
>
> > OK, I can see that one. But it seems like a pretty small benefit to me
> > -- CPU utilization is already really low.
>
> Maybe not to you but it does make a big difference on our 500 disk setup.
> At the moment we use dd to do an initial sniff, then ext3 utils to do
> O_DIRECT reads/writes. With O_DIRECT read/write in dd we could use it
> instead. (We are basically interested in IO performance that a database
> would see).
>
> > Um, that sounds like a bad idea to me. It seems to me it's the kernel's
> > responsibility to figure out "hey, looks like a streaming read - let's
> > not blow out the buffer cache trying to hold 20GB on a 512M system." If
> > you're saying that the kernel guys have given up and the established
> > wisdom is now "you gotta use O_DIRECT if you don't want to throw
> > everything else out due to streaming data", well... I'm disappointed.
>
> When you start hitting memory bandwidth limits, O_DIRECT will help you.
> Sure it wont be an issue for your dd copy scenario, but I wanted to point
> out there are other valid uses for it.
>
> Anton
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Wim Coekaerts <[email protected]> wrote:
> philip copeland did a whole set of patches for coreutils to allow
> directio for both read write and mixed sizes even
> the rpm is at
> http://oss.oracle.com/projects/ocfs/files/source/RHAT/RHAS3/coreutils-4.5.3-33.src.rpm,
Thanks for the pointer.
FYI, that URL didn't work for me. This one did:
http://oss.oracle.com/projects/ocfs/dist/files/source/RHAT/RHAS3/coreutils-4.5.3-33.src.rpm
Whoever maintains that code should consider merging their changes with
something newer. There are over 500 lines of NEWS entries alone for the
coreutils releases that have been made since 4.5.3.
Besides, those patches have portability and robustness problems,
and they aren't even based on coreutils-4.5.3. Maybe they're
based on some vendor's version of coreutils?
> I think he took it up with the maintainers but so far had no luck
As far as I know, no one ever submitted such O_DIRECT-based patches
to me or any of the bug-*@gnu.org mailing lists.
The only pending O_DIRECT-based patch is one for shred.
Clean, complete, and well-justified patches usually go in pretty quickly.