2001-12-08 03:46:36

by Quinn Harris

[permalink] [raw]
Subject: File copy system call proposal

I would like to propose implementing a file copy system call.
I expect the initial reaction to such a proposal would be "feature
bloat" but I believe some substantial benefits can be seen possibly
making it worthwhile, primarily the following:

Copy on write:
>From my experience most files that are copied on the same partition are
copied from a source code directory (eg /usr/src/{src dir}) to somewhere
else in /usr. These copied files are seldomly modified but usually
truncated (when copied over again).
Instead of actually copying the file in these circumstances something
similar to a hard link could be created. But unlike a hard link, when
data is written to the file a real duplicate of the file (or possibly
part of the file) will be created. This is basically identical to the
way a processes memory space is duplicated on a fork. To create an
illusion of an actual copied file the file system will need to
explicitly support this feature. This can also eliminate duplication in
the buffer cache when a file is copied.

This feature would drastically reduce the time taken to install a
program from a compiled source tarball. I also expect on my system this
feature would save about 1/6 of my hard drive space. Of course this
wouldn't affect performance if the source and destination files are on
different partitions.

All kernel copy:
Commands like cp and install open the source and destination file using
the open sys call. The data from the source is copied to the
destination by repeatedly calling the read then write sys calls. This
process involves copying the data in the file from kernel memory space
to the user memory space and back again. Note that all this copying is
done by the kernel upon calling read or write. I would expect if this
can be moved completely into the kernel no memory copy operations would
be performed by the processor by using hardware DMA.

On my system a copy takes about 1s of the CPU time per 20MB copied (PII
300Mhz) much of which I expect is spent just copying memory. This
figure seems a bit high to copy memory so someone please correct me if I
am wrong.


Implementing these features especially the copy on write I expect will
not be trivial. In addition code that copies files like cp must be
modified to take advantage of these features.

Will many other users benefit from these features? Will implementing
them (especially copy on write) cause an excessive addition to the code
of the kernel?

Quinn Harris ([email protected])


2001-12-08 04:00:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: File copy system call proposal

Followup to: <[email protected]>
By author: Quinn Harris <[email protected]>
In newsgroup: linux.dev.kernel
>
> All kernel copy:
> Commands like cp and install open the source and destination file using
> the open sys call. The data from the source is copied to the
> destination by repeatedly calling the read then write sys calls. This
> process involves copying the data in the file from kernel memory space
> to the user memory space and back again. Note that all this copying is
> done by the kernel upon calling read or write. I would expect if this
> can be moved completely into the kernel no memory copy operations would
> be performed by the processor by using hardware DMA.
>

mmap(source file);
write(target file, mmap region);

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-12-08 04:25:59

by Christian Lavoie

[permalink] [raw]
Subject: Re: File copy system call proposal

On Friday 07 December 2001 22:42, Quinn Harris wrote:

You might be interested in reading about the MIT exokernel, who proposes much
of the same thing you are...

http://www.pdos.lcs.mit.edu/exo.html

... and ends up saying that it's a _huge_ performance boost in several
typical operations.

--
Christian Lavoie
[email protected]

2001-12-08 06:07:44

by Quinn Harris

[permalink] [raw]
Subject: Re: File copy system call proposal

On Fri, 2001-12-07 at 21:00, H. Peter Anvin wrote:
> Followup to: <[email protected]>
> By author: Quinn Harris <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > All kernel copy:
> > Commands like cp and install open the source and destination file using
> > the open sys call. The data from the source is copied to the
> > destination by repeatedly calling the read then write sys calls. This
> > process involves copying the data in the file from kernel memory space
> > to the user memory space and back again. Note that all this copying is
> > done by the kernel upon calling read or write. I would expect if this
> > can be moved completely into the kernel no memory copy operations would
> > be performed by the processor by using hardware DMA.
> >
>
> mmap(source file);
> write(target file, mmap region);
>
> -hpa

mmap will indeed get a file into user mem space without any memcopy
operation. But as far as I can tell from examining generic_file_write
(in mm/filemap.c) used by ext2 and I asume many others, a write will
copy the memory even if it was mapped via mmap. Am I missing
something? This isn't true if kiobuf is used but as I understand it,
this bypasses the buffer cache.

I would like to see a zero-memcopy file copy. A file copy that would
read the file into the buffer cache if its not already there similar to
a normal read then write it back to disk from the cache without
duplicating the pages. This would probably lead to modifying the buffer
cache to allow multiple buffer_heads to refer to the same data which
might not be worth the overhead.

One might implement such a thing by attempting in the generic_file_read
to determine if the memory range is an actual page or pages that can be
eventually written to the disk without a memcopy. This could
conceivably make duplications between different files to not take up
duplicate pages. But what is the chance that there are any sizeable
number of identical pages in the buffer cache do to anything but a file
copy?


2001-12-08 13:54:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: File copy system call proposal

On December 8, 2001 07:03 am, Quinn Harris wrote:
> On Fri, 2001-12-07 at 21:00, H. Peter Anvin wrote:
> > > Commands like cp and install open the source and destination file using
> > > the open sys call. The data from the source is copied to the
> > > destination by repeatedly calling the read then write sys calls. This
> > > process involves copying the data in the file from kernel memory space
> > > to the user memory space and back again. Note that all this copying is
> > > done by the kernel upon calling read or write. I would expect if this
> > > can be moved completely into the kernel no memory copy operations would
> > > be performed by the processor by using hardware DMA.
> >
> > mmap(source file);
> > write(target file, mmap region);
>
> mmap will indeed get a file into user mem space without any memcopy
> operation. But as far as I can tell from examining generic_file_write
> (in mm/filemap.c) used by ext2 and I asume many others, a write will
> copy the memory even if it was mapped via mmap. Am I missing
> something? This isn't true if kiobuf is used but as I understand it,
> this bypasses the buffer cache.
^^^^^^---page

> I would like to see a zero-memcopy file copy. A file copy that would
> read the file into the buffer cache if its not already there similar to
> a normal read then write it back to disk from the cache without
> duplicating the pages. This would probably lead to modifying the buffer
> cache to allow multiple buffer_heads to refer to the same data which
^^^^^^^^^^^^---struct page
> might not be worth the overhead.
>
> One might implement such a thing by attempting in the generic_file_read
generic_file_write---^^^^^^^^^^^^^^^^^
> to determine if the memory range is an actual page or pages that can be
> eventually written to the disk without a memcopy.

There's some merit to this idea. As Peter pointed out, an in-kernel cp isn't
needed: mmap+write does the job. The question is, how to avoid the
copy_from_user and double caching of data?

Generic_file_write would have to determine that the source range is an mmap,
then do the required xxx_get_block's (somehow) to determine the physical
destination of the write, then finally use kio to dma the in-memory data to
the destination. (Never mind that get_block isn't a vfs method, that's a
detail ;-)

The obvious problem is that, should somebody subsequently read the file, it's
not in cache. Oops, so much for the performance gain. So some mechanism
would have to be devised to get hold of the original page data by way of the
destination inode, and to keep that consistent through the various
combinations of events that can occur to the two files involved. This is
where things start to diverge quite a lot from the current page cache design,
so if you're interested in pursuing this idea, a good way to start would be
to find out why.

Before you put a lot of energy into it, you might consider measuring the
actual cost of the memory copy versus the two disk transfers.

--
Daniel

2001-12-08 19:27:38

by Quinn Harris

[permalink] [raw]
Subject: Re: File copy system call proposal

On Sat, 2001-12-08 at 10:39, Thomas Cataldo wrote:
> On Sat, 2001-12-08 at 04:42, Quinn Harris wrote:
> > I would like to propose implementing a file copy system call.
> > I expect the initial reaction to such a proposal would be "feature
> > bloat" but I believe some substantial benefits can be seen possibly
> > making it worthwhile, primarily the following:
>
> I think
>
> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
>
> is what you are looking for, isn't it ?
>
> >
> > Copy on write:
> > >From my experience most files that are copied on the same partition are
> > copied from a source code directory (eg /usr/src/{src dir}) to somewhere
> > else in /usr. These copied files are seldomly modified but usually
> > truncated (when copied over again).
> > Instead of actually copying the file in these circumstances something
> > similar to a hard link could be created. But unlike a hard link, when
> > data is written to the file a real duplicate of the file (or possibly
> > part of the file) will be created. This is basically identical to the
> > way a processes memory space is duplicated on a fork. To create an
> > illusion of an actual copied file the file system will need to
> > explicitly support this feature. This can also eliminate duplication in
> > the buffer cache when a file is copied.
> >
> > This feature would drastically reduce the time taken to install a
> > program from a compiled source tarball. I also expect on my system this
> > feature would save about 1/6 of my hard drive space. Of course this
> > wouldn't affect performance if the source and destination files are on
> > different partitions.
> >
> > All kernel copy:
> > Commands like cp and install open the source and destination file using
> > the open sys call. The data from the source is copied to the
> > destination by repeatedly calling the read then write sys calls. This
> > process involves copying the data in the file from kernel memory space
> > to the user memory space and back again. Note that all this copying is
> > done by the kernel upon calling read or write. I would expect if this
> > can be moved completely into the kernel no memory copy operations would
> > be performed by the processor by using hardware DMA.
> >
> > On my system a copy takes about 1s of the CPU time per 20MB copied (PII
> > 300Mhz) much of which I expect is spent just copying memory. This
> > figure seems a bit high to copy memory so someone please correct me if I
> > am wrong.
> >
> >
> > Implementing these features especially the copy on write I expect will
> > not be trivial. In addition code that copies files like cp must be
> > modified to take advantage of these features.
> >
> > Will many other users benefit from these features? Will implementing
> > them (especially copy on write) cause an excessive addition to the code
> > of the kernel?
> >
> > Quinn Harris ([email protected])
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
>

I wasn't aware of the sendfile system call. But it apears that just
like the mmap, write method suggested by H. Peter Anvin a memory copy is
still performed when copying files from discs to duplicate the data for
the buffer cache. This would undoubtedly be faster than repeatedly
calling read and write as it avoids one mem copy. Yet GNU
fileutils-4.1, that cp and install are part of, uses the read/write
method. I expect this is primarily because of portability issues but I
wouldn't think the use of mmap would cause portability issues.

Infact a patch for file utils that does this is availible at
http://mail.gnu.org/pipermail/bug-fileutils/2001-May/001700.html and by
using time on cp, a mmap copy apears to require nearly half the CPU time
of normal. This would suggest to me that eliminating the memcopy on the
call to write would allow even the largest file copies to be performed
with very nominal support from the processor.



2001-12-08 23:12:44

by linux-kernel

[permalink] [raw]
Subject: Re: File copy system call proposal

In article <[email protected]>,
Quinn Harris <[email protected]> writes:
> I wasn't aware of the sendfile system call. But it apears that just
> like the mmap, write method suggested by H. Peter Anvin a memory copy is
> still performed when copying files from discs to duplicate the data for
> the buffer cache. This would undoubtedly be faster than repeatedly
> calling read and write as it avoids one mem copy. Yet GNU
> fileutils-4.1, that cp and install are part of, uses the read/write
> method. I expect this is primarily because of portability issues but I
> wouldn't think the use of mmap would cause portability issues.

It does in fact. On some systems locks and mmaps are mutually exclusive.

2001-12-09 00:20:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: File copy system call proposal

Followup to: <[email protected]>
By author: Daniel Phillips <[email protected]>
In newsgroup: linux.dev.kernel
>
> There's some merit to this idea. As Peter pointed out, an in-kernel cp isn't
> needed: mmap+write does the job. The question is, how to avoid the
> copy_from_user and double caching of data?
>

One thing that one could do for an in-kernel copy is to extend
sendfile() to support any kind of file descriptor. That'd be a very
clean way to do it.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-12-09 05:00:15

by Quinn Harris

[permalink] [raw]
Subject: Re: File copy system call proposal

On Sat, 2001-12-08 at 17:19, H. Peter Anvin wrote:
>
> One thing that one could do for an in-kernel copy is to extend
> sendfile() to support any kind of file descriptor. That'd be a very
> clean way to do it.
>

I think the best thing would be to extend generic_file_write
(mm/filemap.c) to recognize if its writing a complete page to disk in
which case it will not duplicate that page. (Issues with getting the
buffer cache to support this remain.) This should make either the
mmap/write or sendfile aproach be zero-copy. I am given the impression
that this is just what the TCP/IP version of write does to make it
zero-copy.

2001-12-10 05:45:14

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: File copy system call proposal

Daniel Phillips writes:

> There's some merit to this idea. As Peter pointed out,
> an in-kernel cp isn' needed: mmap+write does the job.
> The question is, how to avoid the copy_from_user and
> double caching of data?

No, mmap+write does not do the job. SMB file servers have
a remote copy operation. There shouldn't be any need to
pull data over the network only to push it back again!

The user-space copy operation is also highly likely to
lose stuff that the kernel would know about:

extended attributes (IRIX, OS/2, NT)
forks / extra streams (MacOS, NT)
creation time stamp (Microsoft: not ctime or mtime)
author (GNU HURD: person who created the file)
file type (MacOS)
creator app (MacOS)
unique ID (Win2K)
mandatory access control data (Trusted Foo)
non-UNIX permission bits (every other OS)
ACLs (NFSv4, NT, Solaris...)
translator (HURD)
trustees (NetWare)

2001-12-10 08:26:56

by Hans Reiser

[permalink] [raw]
Subject: Re: File copy system call proposal

We'll have functionality resembling this in reiser4(). It is a little
too early to get into the details on it though. Quinn/Albert are right.

Hans

2001-12-10 10:37:40

by Pavel Machek

[permalink] [raw]
Subject: Re: File copy system call proposal

Hi!

> I would like to propose implementing a file copy system call.
> I expect the initial reaction to such a proposal would be "feature
> bloat" but I believe some substantial benefits can be seen possibly
> making it worthwhile, primarily the following:
>
> Copy on write:

You want cowlink() syscall, not copy() syscall. If they are on different
partitions, let userspace do the job.

> Will many other users benefit from these features? Will implementing
> them (especially copy on write) cause an excessive addition to the code
> of the kernel?

Hmm, I have almost 20 different copies of kernel on my systems.... Yep it
would save me a *lot* of space.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2001-12-10 11:21:55

by Denis Vlasenko

[permalink] [raw]
Subject: Re: File copy system call proposal

On Sunday 09 December 2001 13:35, Pavel Machek wrote:
> > I would like to propose implementing a file copy system call.
> > I expect the initial reaction to such a proposal would be "feature
> > bloat" but I believe some substantial benefits can be seen possibly
> > making it worthwhile, primarily the following:
> >
> > Copy on write:
>
> You want cowlink() syscall, not copy() syscall. If they are on different
> partitions, let userspace do the job.

A filesystem with support of COW files would be *extremely* useful,
especially when writes trigger COW on block level, not file-by-file.

And it will definitely need in-kernel copyfile()/cowlink()/whatever name you
want...

> > Will many other users benefit from these features? Will implementing
> > them (especially copy on write) cause an excessive addition to the code
> > of the kernel?

> Hmm, I have almost 20 different copies of kernel on my systems.... Yep it
> would save me a *lot* of space.

Me too
--
vda

2001-12-10 11:51:26

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: File copy system call proposal

>> I would like to propose implementing a file copy system call.
>> I expect the initial reaction to such a proposal would be "feature
>> bloat" but I believe some substantial benefits can be seen possibly
>> making it worthwhile, primarily the following:
>>
>> Copy on write:
>
> You want cowlink() syscall, not copy() syscall. If they are on different
> partitions, let userspace do the job.

That looks like a knee-jerk reaction to stuff going in the kernel.
I want maximum survival of non-UNIX metadata and maximum performance
for this common operation. Let's say you are telecommuting, and...

You have mounted an SMB share from a Windows XP server.
You need to copy a file that has NTFS security data.
The file is 99 GB in size, on the far side of a 33.6 kb/s modem link.
Now copy this file!
Better yet, maybe you have two mount points or mounted two shares.

????

Filesystem-specific user tools are abominations BTW. We don't
have reiser-mv, reiser-cp, reiser-gmc, reiser-rm, etc.

2001-12-10 12:14:21

by Pavel Machek

[permalink] [raw]
Subject: Re: File copy system call proposal

Hi!

> >> I would like to propose implementing a file copy system call.
> >> I expect the initial reaction to such a proposal would be "feature
> >> bloat" but I believe some substantial benefits can be seen possibly
> >> making it worthwhile, primarily the following:
> >>
> >> Copy on write:
> >
> > You want cowlink() syscall, not copy() syscall. If they are on different
> > partitions, let userspace do the job.
>
> That looks like a knee-jerk reaction to stuff going in the kernel.
> I want maximum survival of non-UNIX metadata and maximum performance
> for this common operation. Let's say you are telecommuting, and...

It would be very ugly if cp -a started behaving differently after you
upgrade it to use copyfile(). Better preserve only metadata you
"know".
Pavel
--
Casualities in World Trade Center: 6453 dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2001-12-10 14:51:27

by Hans Reiser

[permalink] [raw]
Subject: Re: File copy system call proposal

Albert D. Cahalan wrote:

>>>I would like to propose implementing a file copy system call.
>>>I expect the initial reaction to such a proposal would be "feature
>>>bloat" but I believe some substantial benefits can be seen possibly
>>>making it worthwhile, primarily the following:
>>>
>>>Copy on write:
>>>
>>You want cowlink() syscall, not copy() syscall. If they are on different
>>partitions, let userspace do the job.
>>
>
>That looks like a knee-jerk reaction to stuff going in the kernel.
>I want maximum survival of non-UNIX metadata and maximum performance
>for this common operation. Let's say you are telecommuting, and...
>
>You have mounted an SMB share from a Windows XP server.
>You need to copy a file that has NTFS security data.
>The file is 99 GB in size, on the far side of a 33.6 kb/s modem link.
>Now copy this file!
>Better yet, maybe you have two mount points or mounted two shares.
>
>????
>
>Filesystem-specific user tools are abominations BTW. We don't
>have reiser-mv, reiser-cp, reiser-gmc, reiser-rm, etc.
>
I think that it is legitimate to first implement a piece of
functionality in one filesystem, and only after it has that real design
stability that comes from real code that users have critiqued,
proselytize to other filesystems. The disadvantage to the approach
though is that it advantages one filesystem, and can cause you to lose
adherents in the other filesystem camps. For instance, the journaling
code of ext3, we just weren't willing to wait, and conversely I think
that ext3/XFS aren't willing to wait for how we do extended attributes.
However, I suspect that how we want to do extended attributes (that is,
not to do them, but instead to do a better file API) is going to be a
tough sell until it is working code. I don't think i could have
convinced the ext2 camp of the advantages of trees and tail combining
without shipping code that did it (ok, I didn't really try, but somehow
I think this.....). Now they seem to think it is good stuff, and are
responding with a rather interesting htree implementation that, if I
understand right, does fewer memory copies due to packing less tightly
(which we may have to contemplate doing as a performance tweak for
reiser4.1).

So in sum, I think that the right approach is to say: "Hey, I think this
is a nice feature, any other filesystems interested?", and if there is
no enthusiasm then go and implement it and let the users convince them
it belongs in VFS.

So, for the record, I think that sendfile for all file types, and
cowlink, are both different features and worthwhile.

Reiser4 needs a plugin interconnect, and I think this plugin
interconnect should translate from one filetype to another, and if you
say reiser4("fileA<-fileB") it should accomplish copy() functionality.
reiser4() will also accomplish subfile copying of ranges, etc., if you
specify them, but that is going beyond this thread.

I think I don't have the slightest chance of getting the ext2 crowd
interested in these features before it works in reiserfs based on their
remarks at the linux kernel summit. There are some folks like me who
think filesystems are behind other namespaces such as search engines and
databases and should catch up, and others who think putting keyword
search or transactions into the kernel is just nutty. (Though it is
actually a lot easier than putting balanced trees into the kernel, but....)

I'd like to thank Albert for persisting here with reminding people of
his example of where Windows does it better, API design wise (it isn't
his first email on the topic).

Hans

2001-12-10 15:17:19

by Daniel Phillips

[permalink] [raw]
Subject: Re: File copy system call proposal

On December 10, 2001 06:44 am, Albert D. Cahalan wrote:
> Daniel Phillips writes:
>
> > There's some merit to this idea. As Peter pointed out,
> > an in-kernel cp isn' needed: mmap+write does the job.
> > The question is, how to avoid the copy_from_user and
> > double caching of data?
>
> No, mmap+write does not do the job. SMB file servers have
> a remote copy operation. There shouldn't be any need to
> pull data over the network only to push it back again!

Hi Albert,

I don't get it, you're saying that this zero-copy optimization, which happens
entirely within the vfs, shouldn't be done because smb can't do it over a
network?

> The user-space copy operation is also highly likely to
> lose stuff that the kernel would know about:
>
> extended attributes (IRIX, OS/2, NT)
> forks / extra streams (MacOS, NT)
> creation time stamp (Microsoft: not ctime or mtime)
> author (GNU HURD: person who created the file)
> file type (MacOS)
> creator app (MacOS)
> unique ID (Win2K)
> mandatory access control data (Trusted Foo)
> non-UNIX permission bits (every other OS)
> ACLs (NFSv4, NT, Solaris...)
> translator (HURD)
> trustees (NetWare)

I'd think the mmap-based copy would only use the technique on the data
portion of a file.

Note that I'm not seriously proposing to do this, there are about 1,000 more
important things. I'm suggesting the original poster go take a look at the
issues involved in making it happen.

--
Daniel

2001-12-10 17:45:47

by Petr Vandrovec

[permalink] [raw]
Subject: Re: File copy system call proposal

On 10 Dec 01 at 16:19, Daniel Phillips wrote:
> On December 10, 2001 06:44 am, Albert D. Cahalan wrote:
> >
> > No, mmap+write does not do the job. SMB file servers have
> > a remote copy operation. There shouldn't be any need to
> > pull data over the network only to push it back again!
>
> Hi Albert,
>
> I don't get it, you're saying that this zero-copy optimization, which happens
> entirely within the vfs, shouldn't be done because smb can't do it over a
> network?

VFS can do this optimization (but why...), but having FS-specific
sendfile would be nice too - FS can verify whether both source & destination
are on same filesystem, and if they do, it can perform server filecopy
(if server's implementation filecopy can copy arbitrary long chunk at
arbitrary offset into another file at some else offset, like Netware's
NWFileServerFileCopy() does).

> > trustees (NetWare)
>
> I'd think the mmap-based copy would only use the technique on the data
> portion of a file.

At least from my exprience with Netware I can say that copy which copies
file trustees happens in at most 1% of all copies (and on Netware
trustees flow through the tree down, so you have usually no trustees
assigned to leaf files) - and this 1% is when you do backup and restore.
In all other cases it would be great surprise that all users which had
rights to old copy have these rights to new copy too.
Best regards,
Petr Vandrovec
[email protected]

2001-12-13 10:02:41

by Andreas Dilger

[permalink] [raw]
Subject: Re: File copy system call proposal

On Dec 10, 2001 16:19 +0100, Daniel Phillips wrote:
> On December 10, 2001 06:44 am, Albert D. Cahalan wrote:
> > No, mmap+write does not do the job. SMB file servers have
> > a remote copy operation. There shouldn't be any need to
> > pull data over the network only to push it back again!
>
> I don't get it, you're saying that this zero-copy optimization, which happens
> entirely within the vfs, shouldn't be done because smb can't do it over a
> network?

No, I think he means just the opposite - that having a "copy(2)" syscall
would greatly _help_ SMB in that the copy could be done entirely at the
server side, rather than having to pull _all_ of the data to the client
and then sending it back again.

When I was working on another network storage system (formerly called
Lustre, don't know what it is called now) we had a "copy" primitive in
the VFS interface, and there were lots of useful things you could do
with it.

Consider the _very_ common case (that nobody has mentioned yet) where you
are editing a large file. When you write to the file, the editor copies
the file to a backup, then immediately truncates the original file and
writes the new data there. What would be _far_ preferrable is to just
"copy" the file to the new location within the kernel (zero work), and
then the new data will be the only I/O going to disk. This requires
some smarts on the part of the filesystem (essentially COW semantics),
but it well worth it on network storage. Even for "dumb" filesystems,
we could save the two (or one, with mmap) userspace copies, and optimize
to-boot (because we know the full size of the file in advance).

What about "link" you say? Well, emacs at least does a full copy instead
of a link, so that things like "cp -al linux-2.4.17 linux-2.4.17-new" will
work properly when you edit files in one of those trees. Not that I'm
an emacs user...

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-12-19 20:15:20

by Pavel Machek

[permalink] [raw]
Subject: Re: File copy system call proposal

Hi!

> No, I think he means just the opposite - that having a "copy(2)" syscall
> would greatly _help_ SMB in that the copy could be done entirely at the
> server side, rather than having to pull _all_ of the data to the client
> and then sending it back again.
>
> When I was working on another network storage system (formerly called
> Lustre, don't know what it is called now) we had a "copy" primitive in
> the VFS interface, and there were lots of useful things you could do
> with it.
>
> Consider the _very_ common case (that nobody has mentioned yet) where you
> are editing a large file. When you write to the file, the editor copies
> the file to a backup, then immediately truncates the original file and
> writes the new data there. What would be _far_ preferrable is to
> just

Are you sure? I think editor just _moves_ original to backup.
Pavel

--
"I do not steal MS software. It is not worth it."
-- Pavel Kankovsky

2001-12-19 20:25:42

by Daniel Phillips

[permalink] [raw]
Subject: Re: File copy system call proposal

On December 13, 2001 10:17 pm, Pavel Machek wrote:
> [Andreas Dilger <[email protected]> wrote:]
> > No, I think he means just the opposite - that having a "copy(2)" syscall
> > would greatly _help_ SMB in that the copy could be done entirely at the
> > server side, rather than having to pull _all_ of the data to the client
> > and then sending it back again.
> >
> > When I was working on another network storage system (formerly called
> > Lustre, don't know what it is called now) we had a "copy" primitive in
> > the VFS interface, and there were lots of useful things you could do
> > with it.
> >
> > Consider the _very_ common case (that nobody has mentioned yet) where you
> > are editing a large file. When you write to the file, the editor copies
> > the file to a backup, then immediately truncates the original file and
> > writes the new data there. What would be _far_ preferrable is to
> > just
>
> Are you sure? I think editor just _moves_ original to backup.

Hi,

It would be so nice if all editors did that, but most don't according to the
tests I've done, especially the newer ones like kedit, gnome-edit etc. I
think this is largely due to developers not knowing why it's good to do it
this way.

--
Daniel

2001-12-20 10:10:12

by Pavel Machek

[permalink] [raw]
Subject: Re: File copy system call proposal

Hi!

> > > Consider the _very_ common case (that nobody has mentioned yet) where you
> > > are editing a large file. When you write to the file, the editor copies
> > > the file to a backup, then immediately truncates the original file and
> > > writes the new data there. What would be _far_ preferrable is to
> > > just
> >
> > Are you sure? I think editor just _moves_ original to backup.
>
> Hi,
>
> It would be so nice if all editors did that, but most don't according to the
> tests I've done, especially the newer ones like kedit, gnome-edit etc. I
> think this is largely due to developers not knowing why it's good to do it
> this way.

They need to get a clue. No need to work around their bugs in kernel.

Anyway copyfile syscall would be nice for other reasons. (cp -a kernel
tree then apply patch without waiting for physical copy to be done
would be handy).
Pavel
--
Casualities in World Trade Center: 6453 dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2001-12-20 13:39:14

by Svein Ove Aas

[permalink] [raw]
Subject: Re: File copy system call proposal

On Thursday 20. December 2001 11:09, Pavel Machek wrote:
>
> They need to get a clue. No need to work around their bugs in kernel.
>
> Anyway copyfile syscall would be nice for other reasons. (cp -a kernel
> tree then apply patch without waiting for physical copy to be done
> would be handy).
> Pavel

Never mind that it might save a great deal of space...
I often operate with three/more different kernel trees, but the differences
are often trivial.
If the VFS created a COW node when I use cp -a I would, obviously, save a
great deal of space; this goes for numerous other source trees too.

Now there's a real world example for you.

- Svein Ove Aas

2001-12-20 13:53:47

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: File copy system call proposal

On Thu, Dec 20, 2001 at 02:38:50PM +0100, Svein Ove Aas wrote:
> On Thursday 20. December 2001 11:09, Pavel Machek wrote:
> >
> > They need to get a clue. No need to work around their bugs in kernel.
> >
> > Anyway copyfile syscall would be nice for other reasons. (cp -a kernel
> > tree then apply patch without waiting for physical copy to be done
> > would be handy).
> > Pavel
>
> Never mind that it might save a great deal of space...
> I often operate with three/more different kernel trees, but the differences
> are often trivial.
> If the VFS created a COW node when I use cp -a I would, obviously, save a
> great deal of space; this goes for numerous other source trees too.
>
> Now there's a real world example for you.

No graphical file manager would use it - how would you show progress
information to the user when coping a single huge file ?

So, someone might hack up a 'cp' that used it, and in a few years when
everyone is at 2.4.x (where x >= version with copyfile()) maybe some
distribution would ship it.

Take a look at Win32, then have it. Then, look further, and you'll see
that they have system calls for just about everything else. It's
a slippery slope, leading to horrors like CreateProcess() which takes
TEN arguments, where about half of them are pointers to STRUCTURES.

I'm not saying that adding copyfile() would take us there immediately,
but we'd be taking the direction, when you can get about all the speedup
with mmap()+write() or the likes anyway.

Just my 0.02 Euro

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-12-20 14:01:00

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: File copy system call proposal

On Thu, Dec 20, 2001 at 02:53:28PM +0100, Jakob ?stergaard wrote:
...
> > Now there's a real world example for you.
>
> No graphical file manager would use it - how would you show progress
> information to the user when coping a single huge file ?

Sorry for replying to my own mail - I shouldn't send mail while talking
to people at the same time...

The progress stuff is of course relevant only when you cannot do COW.

>
> So, someone might hack up a 'cp' that used it, and in a few years when
> everyone is at 2.4.x (where x >= version with copyfile()) maybe some
> distribution would ship it.
>
> Take a look at Win32, then have it. Then, look further, and you'll see
> that they have system calls for just about everything else. It's
> a slippery slope, leading to horrors like CreateProcess() which takes
> TEN arguments, where about half of them are pointers to STRUCTURES.

s/then have it/they have it/

Sorry,

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2001-12-20 14:32:25

by David Woodhouse

[permalink] [raw]
Subject: Re: File copy system call proposal


[email protected] said:
> Take a look at Win32, they have it.

Which is partly why when you want to copy a large file on an SMB-exported
file system, the client host doesn't have to actually read it all and write
it back across the network - it can just issue a copyfile request.

--
dwmw2


2001-12-20 15:07:10

by George Greer

[permalink] [raw]
Subject: Re: File copy system call proposal

On Thu, 20 Dec 2001, David Woodhouse wrote:

>[email protected] said:
>> Take a look at Win32, they have it.
>
>Which is partly why when you want to copy a large file on an SMB-exported
>file system, the client host doesn't have to actually read it all and write
>it back across the network - it can just issue a copyfile request.

What does a copy system call have to do with the file server program being
smart enough to do a copy locally? You can't do it with FTP (or at least
the ftpd I have) but it's certainly not because read()+write() are
insufficient.

-George Greer

2001-12-20 15:08:10

by David Woodhouse

[permalink] [raw]
Subject: Re: File copy system call proposal


[email protected] said:
> What does a copy system call have to do with the file server program
> being smart enough to do a copy locally? You can't do it with FTP (or
> at least the ftpd I have) but it's certainly not because
> read()+write() are insufficient.

Think smbfs.

--
dwmw2


2001-12-20 21:37:37

by Jamie Lokier

[permalink] [raw]
Subject: Re: File copy system call proposal

Daniel Phillips wrote:
> > > Consider the _very_ common case (that nobody has mentioned yet) where you
> > > are editing a large file. When you write to the file, the editor copies
> > > the file to a backup, then immediately truncates the original file and
> > > writes the new data there. What would be _far_ preferrable is to
> > > just
> >
> > Are you sure? I think editor just _moves_ original to backup.
>
> It would be so nice if all editors did that, but most don't according to the
> tests I've done, especially the newer ones like kedit, gnome-edit etc. I
> think this is largely due to developers not knowing why it's good to do it
> this way.

Moving the original to make a backup is a _bad_ thing if the original
might be hard-linked and you'd like all instances to be written to.
OTOH it'a a good thing if you're using hard links for poor man's version
control (`cp -rl'). Hey :)

Emacs does this perfectly with `backup-by-copying-when-linked' (an
option you can change, but I like it on).

-- Jamie

2001-12-23 23:06:35

by Pavel Machek

[permalink] [raw]
Subject: Re: File copy system call proposal

Hi!

> > Now there's a real world example for you.
>
> No graphical file manager would use it - how would you show progress
> information to the user when coping a single huge file ?

They can't do that today (think writeback)...

> So, someone might hack up a 'cp' that used it, and in a few years when
> everyone is at 2.4.x (where x >= version with copyfile()) maybe some
> distribution would ship it.
>
> Take a look at Win32, then have it. Then, look further, and you'll see
> that they have system calls for just about everything else. It's

Windows are stupid. But copyfile is different from read+write -- it
allows you to do on-server copy and allows COW.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.