it may be orribly RTFM... but writing a simple framework I realized
there is no libc/POSIX/whoknows
copy(const char* dest_file_name, const char* src_file_name)
What is the technical reason???
I understand that there may be little space for kernel side
optimizations in this area but anyway I'm surprised I have to write
< the bits to clone the metadata of src_file_name on opening
dest_file_name >
const int BUFSIZE = 1<<12;
char buffer[BUFSIZE];
int nrb;
while((nrb = read(infd, buffer, BUFSIZE) != -1) {
ret = write(outfd, buffer, nrb);
if(ret != nrb) {...}
}
instead of something similar to:
sys_fscopy(...)
regards
Mr. Rossetti,
It is horribly RTFM.
man 2 sendfile is what you're after.
Brad
=====
Brad Chapman
Permanent e-mail: [email protected]
__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree
sendfile(2) - ?
Davide Rossetti wrote:
> it may be orribly RTFM... but writing a simple framework I realized
> there is no libc/POSIX/whoknows
> copy(const char* dest_file_name, const char* src_file_name)
>
> What is the technical reason???
>
> I understand that there may be little space for kernel side
> optimizations in this area but anyway I'm surprised I have to write
>
> < the bits to clone the metadata of src_file_name on opening
> dest_file_name >
> const int BUFSIZE = 1<<12;
> char buffer[BUFSIZE];
> int nrb;
> while((nrb = read(infd, buffer, BUFSIZE) != -1) {
> ret = write(outfd, buffer, nrb);
> if(ret != nrb) {...}
> }
>
> instead of something similar to:
> sys_fscopy(...)
>
> regards
>
--
Ihar 'Philips' Filipau / with best regards from Saarbruecken.
-- _ _ _
"... and for $64000 question, could you get yourself |_|*|_|
vaguely familiar with the notion of on-topic posting?" |_|_|*|
-- Al Viro @ LKML |*|*|*|
On Mon, 10 Nov 2003, Bradley Chapman wrote:
> Mr. Rossetti,
>
> It is horribly RTFM.
>
> man 2 sendfile is what you're after.
mhm
sendfile() can copy extended attributes and ACL ?
(i'm not think, that copy is the right candidate to syscall)
MOJE
--
Konir Tomas
Czech Republic
Brno
ICQ 25849167
On Monday 10 November 2003 06:08, Ihar 'Philips' Filipau wrote:
> sendfile(2) - ?
I don't think that is what he was referring to.. The sample
code is strictly user mode file->file copying.
> Davide Rossetti wrote:
> > it may be orribly RTFM... but writing a simple framework I realized
> > there is no libc/POSIX/whoknows
> > copy(const char* dest_file_name, const char* src_file_name)
> >
> > What is the technical reason???
It isn't an application for the kernel.
> > I understand that there may be little space for kernel side
> > optimizations in this area but anyway I'm surprised I have to write
> >
> > < the bits to clone the metadata of src_file_name on opening
> > dest_file_name >
> > const int BUFSIZE = 1<<12;
> > char buffer[BUFSIZE];
> > int nrb;
> > while((nrb = read(infd, buffer, BUFSIZE) != -1) {
> > ret = write(outfd, buffer, nrb);
> > if(ret != nrb) {...}
> > }
> >
> > instead of something similar to:
> > sys_fscopy(...)
It is too simple to implement in user mode.
There are some other issues too:
The security context of the output depends on the user process.
If it is a privileged process (ie, may change the context of the
result) then the user process has to setup that context before
the file is copied.
There are also some issues with mandatory security controls. If it
is copied in kernel mode, then the previous labels could be automatically
carried over to the resulting file... But that may not be what you
want (and frequently, it isn't).
Now back to the copy.. You don't have to use a read/write loop- mmap
is faster. And this is the other reason for not doing it in Kernel mode.
Buffer management of this type is much easier in user space since the
copy procedure doesn't have to deal with memory limitations, cache flushes
page faulting of processes unrelated to the copy, but is related to cache
pressure.
On Mon, Nov 10, 2003 at 07:29:15AM -0600, Jesse Pollard wrote:
> Now back to the copy.. You don't have to use a read/write loop- mmap
> is faster. And this is the other reason for not doing it in Kernel mode.
Actually, last I checked, read/write was actually faster. Linus
explained why a month or two ago.
--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer
On Mon, 2003-11-10 at 07:29 -0600, Jesse Pollard wrote:
> > > sys_fscopy(...)
>
> It is too simple to implement in user mode.
Is it? Please explain the simple steps which cp(1) should take in order
to observe that it is being asked to duplicate a file on a file system
such as CIFS (or NFSv4?) which allows the client to issue a 'copy file'
command over the network without actually transferring the data twice,
and to invoke such a command.
--
dwmw2
On Monday 10 November 2003 09:19, David Woodhouse wrote:
> On Mon, 2003-11-10 at 07:29 -0600, Jesse Pollard wrote:
> > > > sys_fscopy(...)
> >
> > It is too simple to implement in user mode.
>
> Is it? Please explain the simple steps which cp(1) should take in order
> to observe that it is being asked to duplicate a file on a file system
> such as CIFS (or NFSv4?) which allows the client to issue a 'copy file'
> command over the network without actually transferring the data twice,
> and to invoke such a command.
Ah. That is an optimization question, not a question of kernel/user mode.
Since the error checking for source and destination both include doing
a stat and statfs, the device information (and FS info) can both be retrieved.
And mmap doesn't require data transfer "twice" (local copy). Since that copy
only pagefaults (though read/write may be faster for some files - I thought
that was true for small files that fit in cache, and large files faster via
mmap and depends on the page size; and the tradeoff would be system
dependant).
And since both source and destination may be remote you do get to decide
based on source and destination devices: if they are the same, and one on
a remote node, then BOTH will be on the remote, then you get to use the
CIFS/NFS file copy. (check the doc on "stat/statfs" for additional info).
I don't believe it works when source and destination are on DIFFERENT remote
nodes, though.
Strictly up to the implementation of cp/mv.
Though you will loose portability of cp/mv. (Of course, you also loose
it with a syscall for file copy too; as well as the MUCH more complicated
implementation/security checks).
On Mon, 10 Nov 2003, Bradley Chapman wrote:
> Mr. Rossetti,
>
> It is horribly RTFM.
>
> man 2 sendfile is what you're after.
I'm afraid it's not horribly RTFM at all.
sendfile won't do what he needs in 2.6.x.
> It is too simple to implement in user mode.
That works for a plain byte-stream on a
local UNIX-style filesystem. (though it
likely isn't the fastest)
It doesn't work for Macintosh files.
It's too slow for CIFS over a modem.
It doesn't work for Windows security data.
It doesn't allow copy-on-write files.
It eats CPU time on compressed filesystems.
> The security context of the output depends
> on the user process. If it is a privileged
> process (ie, may change the context of the
> result) then the user process has to setup
> that context before the file is copied.
So open the file, change context, and then:
long copy_fd_to_file(int fd, const char *name, ...)
(if you can no longer read from the OPEN fd,
either we override that or we just don't care
about such mostly-fictional cases)
> There are also some issues with mandatory
> security controls. If it is copied in kernel
> mode, then the previous labels could be
> automatically carried over to the resulting
> file... But that may not be what you want
> (and frequently, it isn't).
If it matters:
// security as if a new file were created
#define CF_REPLACE_SECURITY 0x00000001
// if unable to replicate, up or down?
#define CF_ROUND_SECURITY_UP 0x00000002
#define CF_ROUND_SECURITY_DOWN 0x00000004
// fail if security can't be replicated
#define CF_SECURITY_EXACT 0x00000008
> Now back to the copy.. You don't have to
> use a read/write loop- mmap is faster.
It's slower. (this is Linux, not SunOS)
Use a 4 kB or 8 kB read/write loop.
> And this is the other reason for not doing
> it in Kernel mode. Buffer management of
> this type is much easier in user space
> since the copy procedure doesn't have to
> deal with memory limitations, cache flushes
> page faulting of processes unrelated to the
> copy, but is related to cache pressure.
Buffer management is very much a kernel thing.
>> Is it? Please explain the simple steps which
>> cp(1) should take in order to observe that it
>> is being asked to duplicate a file on a file
>> system such as CIFS (or NFSv4?) which allows
>> the client to issue a 'copy file' command
>> over the network without actually transferring
>> the data twice, and to invoke such a command.
>
> Ah. That is an optimization question, not a
> question of kernel/user mode.
Note that /bin/cp isn't always going to have
the necessary passwords and such. You're headed
down a path toward setuid /bin/cp.
> Since the error checking for source and
> destination both include doing a stat and
> statfs, the device information (and FS info)
> can both be retrieved.
>
> And mmap doesn't require data transfer "twice"
> (local copy).
Huh? Over the network from server to client
counts as once. Then /bin/cp gets the data.
Then it goes back over the network from the
client to the server. That's "twice". That's
horribly painful for a multi-gigabyte file
and a DSL or cable-modem connection, never
mind a dial-up connection.
> Since that copy only pagefaults (though
> read/write may be faster for some files
> - I thought that was true for small files
> that fit in cache, and large files faster
> via mmap and depends on the page size;
> and the tradeoff would be system dependant).
Keep the read/write loop small for speed.
> And since both source and destination may
> be remote you do get to decide based on
> source and destination devices: if they
> are the same, and one on a remote node,
> then BOTH will be on the remote, then you
> get to use the CIFS/NFS file copy. (check
> the doc on "stat/statfs" for additional info).
>
> I don't believe it works when source and
> destination are on DIFFERENT remote nodes,
> though.
>
> Strictly up to the implementation of cp/mv.
>
> Though you will loose portability of cp/mv.
> (Of course, you also loose it with a syscall
> for file copy too; as well as the MUCH more
> complicated implementation/security checks).
Doing that in cp/mv is just insane. For one,
it bypasses any local security control over
access to the filesystem. There's not even a
way to be sure you're dealing with the server
you think you're dealing with.
On Nov 10, 2003 20:05 -0500, Albert Cahalan wrote:
> > It is too simple to implement in user mode.
>
> That works for a plain byte-stream on a
> local UNIX-style filesystem. (though it
> likely isn't the fastest)
>
> It doesn't work for Macintosh files.
> It's too slow for CIFS over a modem.
> It doesn't work for Windows security data.
> It doesn't allow copy-on-write files.
> It eats CPU time on compressed filesystems.
Having a sys_copy() syscall would be incredibly useful for Lustre
(distributed Linux fs). We could start a copy from one storage node
to another (or more likely many to many for a file striped over many
storage nodes) at num_stripes * uni-directional bandwidth with no
impact to the client node. Instead, we have to copy files at best a
single client's bi-directional network_bandwidth.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
On Mon, 2003-11-10 at 22:50, Andreas Dilger wrote:
> Having a sys_copy() syscall would be incredibly useful for Lustre
> (distributed Linux fs). We could start a copy from one storage node
> to another (or more likely many to many for a file striped over many
> storage nodes) at num_stripes * uni-directional bandwidth with no
> impact to the client node. Instead, we have to copy files at best a
> single client's bi-directional network_bandwidth.
Plus a sys_copy() syscall could be used as a generic way for filesystems
to set up Copy-on-Write. Right now, you'd need to have userspace call
sys-reiser4 or something like that.
--
Daniel Gryniewicz <[email protected]>
On Mon, 10 Nov 2003 23:03:26 EST, Daniel Gryniewicz said:
> Plus a sys_copy() syscall could be used as a generic way for filesystems
> to set up Copy-on-Write. Right now, you'd need to have userspace call
> sys-reiser4 or something like that.
This is fast turning into a creeping horror of aggregation. I defy anybody
to create an API to cover all the options mentioned so far and *not* have it
look like the process_clone horror we so roundly derided a few weeks ago.
On Nov 10, 2003 23:14 -0500, [email protected] wrote:
> On Mon, 10 Nov 2003 23:03:26 EST, Daniel Gryniewicz said:
> > Plus a sys_copy() syscall could be used as a generic way for filesystems
> > to set up Copy-on-Write. Right now, you'd need to have userspace call
> > sys-reiser4 or something like that.
>
> This is fast turning into a creeping horror of aggregation. I defy anybody
> to create an API to cover all the options mentioned so far and *not* have it
> look like the process_clone horror we so roundly derided a few weeks ago.
int sys_copy(int fd_src, int fd_dst)
It is up to the filesystem to decide if both files are on the same device
and can be copied with a copy RPC (or whatever). If the filesystem returns
-EOPNOTSUPP then the VFS goes into a simple readpages/writepages loop to do
the copy instead, maybe also copying ACLs or other things the VFS understands.
All of the "extra functionality" is being handled in the filesystem itself
and not the VFS or the API. Copy-on-write is an fs-internal issue depending
on whether fs supports it, how it was mounted, etc. Remote copy is also an
fs-internal issue depending on whether inodes are in same filesystem, support,
etc. You might get into fun things like doing zero-copy.
Telling the filesystem we are doing a copy vs. a bunch of reads mixed
with a bunch of writes is just semantically something that the filesystem
should know about.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
On Mon, Nov 10, 2003 at 08:50:12PM -0700, Andreas Dilger wrote:
> On Nov 10, 2003 20:05 -0500, Albert Cahalan wrote:
> > > It is too simple to implement in user mode.
> >
> > That works for a plain byte-stream on a
> > local UNIX-style filesystem. (though it
> > likely isn't the fastest)
It would be something similar than sendfile() ?
- G?bor (larta'H)
Andreas Dilger wrote:
> > This is fast turning into a creeping horror of aggregation. I defy anybody
> > to create an API to cover all the options mentioned so far and *not* have it
> > look like the process_clone horror we so roundly derided a few weeks ago.
>
> int sys_copy(int fd_src, int fd_dst)
Doesn't work. You have to set the security attributes while you open
fd_dst.
Florian Weimer wrote:
> Andreas Dilger wrote:
>
>
>>>This is fast turning into a creeping horror of aggregation. I defy anybody
>>>to create an API to cover all the options mentioned so far and *not* have it
>>>look like the process_clone horror we so roundly derided a few weeks ago.
>>
>> int sys_copy(int fd_src, int fd_dst)
>
>
> Doesn't work. You have to set the security attributes while you open
> fd_dst.
int new_fd = sys_copy( int src_fd ); /* cloned copy, out of any fs */
fchmod( new_fd, XXX_WHAT_EVER ); /* do the job. */
...
flink(new_fd, "/some/path/some/file/name"); /* commit to fs */
close(new_fd); /* bye-bye */
I beleive this can be more useful. Not only in naive tries to replace
cp(1) with kernel ;-)
--
Ihar 'Philips' Filipau / with best regards from Saarbruecken.
-- _ _ _
"... and for $64000 question, could you get yourself |_|*|_|
vaguely familiar with the notion of on-topic posting?" |_|_|*|
-- Al Viro @ LKML |*|*|*|
On Tue, Nov 11, 2003 at 09:58:06AM +0100, Florian Weimer wrote:
> Andreas Dilger wrote:
>
> > > This is fast turning into a creeping horror of aggregation. I defy anybody
> > > to create an API to cover all the options mentioned so far and *not* have it
> > > look like the process_clone horror we so roundly derided a few weeks ago.
> >
> > int sys_copy(int fd_src, int fd_dst)
That sounds a lot like a sendfile with a file as the
destination. Useful but still happening on the local system.
My understanding was that this was to be sent to a remote
system where the file descriptors might not be open.
>
> Doesn't work. You have to set the security attributes while you open
> fd_dst.
That would have been done with open().
To operate on paths so it could be sent to a fileserver it
would need the same arguments as open() with the addition of
the newpath.
int sys_copy(const char *oldpath, const char *oldpath,
int flags, mode_t mode);
O_TRUNC replace an existing file.
O_EXCL prevent replacing an existing file.
O_APPEND concatenate (useful feature creep).
O_NDELAY/O_NONBLOCK return and ignore ENOSPACE condition, ick!
O_SYNC if O_SYNC supported for open
O_NOFOLLOW don't follow symlink (no need for a lcopy())
EXDEV (see link(2)) seems a better error code for cases
where the source and destination are on different servers.
Otherwise the error codes would conform to open(2).
I've long thought a file copy syscall was missing from unix
but until you start networking it isn't an issue.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]
Remember Cernan and Schmitt
On Tue, Nov 11, 2003 at 10:51:10AM +0100, Ihar 'Philips' Filipau wrote:
> Florian Weimer wrote:
> >Andreas Dilger wrote:
> >
> >
> >>>This is fast turning into a creeping horror of aggregation. I defy
> >>>anybody
> >>>to create an API to cover all the options mentioned so far and *not*
> >>>have it
> >>>look like the process_clone horror we so roundly derided a few weeks ago.
> >>
> >> int sys_copy(int fd_src, int fd_dst)
> >
> >
> >Doesn't work. You have to set the security attributes while you open
> >fd_dst.
>
> int new_fd = sys_copy( int src_fd ); /* cloned copy, out of any fs */
> fchmod( new_fd, XXX_WHAT_EVER ); /* do the job. */
> ...
> flink(new_fd, "/some/path/some/file/name"); /* commit to fs */
The associate open file descriptor with a new path system
call (flink here) has already been rejected for solid
security reasons.
> close(new_fd); /* bye-bye */
>
> I beleive this can be more useful. Not only in naive tries to replace
> cp(1) with kernel ;-)
Eliminating the flink and using file descriptors you wind up
with something like:
in_fd = open(oldpath, O_RDONLY);
fstat(in_fd, statbuf);
out_fd = open(newpath, O_WRONLY|flags, statbuf->st_mode);
sendfile(out_fd, in_fd, 0, statbuf->st_size);
close(out_fd);
close(in_fd);
So if you can do it with open file descriptors why do you
need a new system call?
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]
Remember Cernan and Schmitt
On Mon, 10 Nov 2003, Jesse Pollard wrote:
> On Monday 10 November 2003 06:08, Ihar 'Philips' Filipau wrote:
> > sendfile(2) - ?
> I don't think that is what he was referring to.. The sample
> code is strictly user mode file->file copying.
> > Davide Rossetti wrote:
> > > it may be orribly RTFM... but writing a simple framework I realized
> > > there is no libc/POSIX/whoknows
> > > copy(const char* dest_file_name, const char* src_file_name)
> > >
> > > What is the technical reason???
>
> It isn't an application for the kernel.
Maybe I was misunderstood... I'm asking why the libc/iso/ansi/posix
engineer did not add the spec a user-mode API to do copy file to file ???
if there was such a standard _user_ API, we could talk about user/kernel
implementation issues... but my question is more "primitive" somehow :)
> > > I understand that there may be little space for kernel side
> > > optimizations in this area but anyway I'm surprised I have to write
> > >
> > > < the bits to clone the metadata of src_file_name on opening
> > > dest_file_name >
> > > const int BUFSIZE = 1<<12;
> > > char buffer[BUFSIZE];
> > > int nrb;
> > > while((nrb = read(infd, buffer, BUFSIZE) != -1) {
> > > ret = write(outfd, buffer, nrb);
> > > if(ret != nrb) {...}
> > > }
> > >
> > > instead of something similar to:
> > > sys_fscopy(...)
>
> It is too simple to implement in user mode.
>
> There are some other issues too:
>
> The security context of the output depends on the user process.
> If it is a privileged process (ie, may change the context of the
> result) then the user process has to setup that context before
> the file is copied.
>
> There are also some issues with mandatory security controls. If it
> is copied in kernel mode, then the previous labels could be automatically
> carried over to the resulting file... But that may not be what you
> want (and frequently, it isn't).
>
> Now back to the copy.. You don't have to use a read/write loop- mmap
> is faster. And this is the other reason for not doing it in Kernel mode.
> Buffer management of this type is much easier in user space since the
> copy procedure doesn't have to deal with memory limitations, cache flushes
> page faulting of processes unrelated to the copy, but is related to cache
> pressure.
ok... so I have to code a framework routine which auto-benchmarks (at
either runtime or configure time) and uses at least 2 implementations, one
using read/write and another mmap(), as I know for sure that on
different Unices they perform differently... ah.. and the day we add
sys_sendfile(fd,fd) (if it is not there yet) I have to add yet another
implementation... and doing file copies of gigabyte sized files with
mmap() on 32bit archs isn't so trivial, you have to do windowing I
guess...
seems scary at least ;)
<joke>
it seems similar to saying that we do not need a rename() Posix/XOpen/etc
API as we can do:
rename(to, from) {
link(to, from); // make hardlink
unlink(from); // remove original
}
</joke>
regards
--
______/ Rossetti Davide INFN - Roma I - APE group \______________
pho +390649914507/412 web: http://apegate.roma1.infn.it/~rossetti
fax +390649914423 email: [email protected]
"davide.rossetti" <[email protected]> writes:
> Maybe I was misunderstood... I'm asking why the libc/iso/ansi/posix
> engineer did not add the spec a user-mode API to do copy file to file ???
Because there was no prior art.
Andreas.
--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
On Tue, 11 Nov 2003, Andreas Schwab wrote:
> "davide.rossetti" <[email protected]> writes:
>
> > Maybe I was misunderstood... I'm asking why the libc/iso/ansi/posix
> > engineer did not add the spec a user-mode API to do copy file to file ???
>
> Because there was no prior art.
:) but late revisions of specs are really recent!!!
folks are talking about implementing all sort of stuff (web servers,
parallel filesystems, ...) (partly) in kernel mode and no one cares of
(maybe accelerated) fs copies ???
--
______/ Rossetti Davide INFN - Roma I - APE group \______________
pho +390649914507/412 web: http://apegate.roma1.infn.it/~rossetti
fax +390649914423 email: [email protected]
jw schultz wrote:
> On Tue, Nov 11, 2003 at 10:51:10AM +0100, Ihar 'Philips' Filipau wrote:
>
>>Florian Weimer wrote:
>>
>>>Andreas Dilger wrote:
>>>
>>>
>>>
>>>>>This is fast turning into a creeping horror of aggregation. I defy
>>>>>anybody
>>>>>to create an API to cover all the options mentioned so far and *not*
>>>>>have it
>>>>>look like the process_clone horror we so roundly derided a few weeks ago.
>>>>
>>>> int sys_copy(int fd_src, int fd_dst)
>>>
>>>
>>>Doesn't work. You have to set the security attributes while you open
>>>fd_dst.
>>
>> int new_fd = sys_copy( int src_fd ); /* cloned copy, out of any fs */
>> fchmod( new_fd, XXX_WHAT_EVER ); /* do the job. */
>> ...
>> flink(new_fd, "/some/path/some/file/name"); /* commit to fs */
>
>
> The associate open file descriptor with a new path system
> call (flink here) has already been rejected for solid
> security reasons.
>
So it was my point - without flink() IMHO it makes no sense.
Just try to imagine any application for sys_copy(char*,char*).
None _I_ _can_ imagine.
"int new_fd = sys_copy( old_fd );" make sense to me - but you need to
have counter-part of it - "flink();" - to commit it to file system.
You really do not need a copy of a file just for copy of a file.
That's what hard link is for.
My way vim/emacs can:
fd = open("originalfile");
new_fildes = copy(fd);
close(fd);
... [do the editing] ...
flink(new_fildes, "newfile"); /* if user decides to save this job */
close(new_fildes);
This make sense - and this is the way usually we do processing of
information. Mimicing cp - is really bad example.
I have re-read thread. I see flink() not as security hole - but they
use should be managed in some way.
Original thread about flink() - everthing doable.
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=20030406190025%241ec6%40gated-at.bofh.it&rnum=50&prev=/&frame=on
And there was no real security issue given whatsoever.
Only design considerations ;-)
>
> So if you can do it with open file descriptors why do you
> need a new system call?
>
The point, that different fs's can optimize this as they wish.
This would be really nice thing to have in our networked age.
Sshing just to copy huge file - is little bit annoying ;-)
P.S. actually my mind keeps spining idea of cut()/paste(). So file
descriptor without assoc. file path can be useful.
Say:
-----------
fd_part_1 = open("some file");
seek(fd_part_1, 100, 0);
fd_part_2 = cut( fd_part_1 ); /* XXX */
/* here eof(fd_part_1) == 1 && "some file" is truncated to 100b. */
flink(fd_part_2, "second part"); /* create file
with rest of "some file" */
-----------
fd_part_1 = open("some file");
fd_part_2 = open("second part");
paste(fd_part_1, fd_part_2); /* XXX */
/* fd_part_2 is auto close()d
and "second part" file unlinked */
close(fd_part_1);
/* here "some file" will be the same as in the begining */
-----------
This should help video/audio editing much.
P.P.S. not relevant but in any way SUSv3 docs for fattach()
http://www.opengroup.org/onlinepubs/007904975/functions/fattach.html
--
Ihar 'Philips' Filipau / with best regards from Saarbruecken.
-- _ _ _
"... and for $64000 question, could you get yourself |_|*|_|
vaguely familiar with the notion of on-topic posting?" |_|_|*|
-- Al Viro @ LKML |*|*|*|
On Mon, Nov 10, 2003 at 08:05:11PM -0500, Albert Cahalan wrote:
> So open the file, change context, and then:
>
> long copy_fd_to_file(int fd, const char *name, ...)
>
> (if you can no longer read from the OPEN fd,
> either we override that or we just don't care
> about such mostly-fictional cases)
Actually, I think we should have a:
long copy_fd_to_fd (int src, int dst, int len)
type of systemcall.
It should do something like:
while ((nbytes = read (src, buf, BUFSIZE)) >= 0) {
if (write (dst, buf, nbytes) < 0)
return totbytes;
totbytes += nbytes;
}
but it allows kernel-space to optimize this whenever possible. Kernel
then becomes responsible for detecting and handling the optimizable
cases.
The kernel then becomes something
if (islocalfile (src) && issocket (dst))
/* Call the old sendfile */
return sendfile (....);
if (isCIFS (src), isCIFS(dst))
/* Tell remote host to copy the file. */
return CIFS_copy_file (....);
...
and then the default implementation. This is nice and expandible, and
provides a default for the case that cannot be optimized.
And if you don't want the extra code, we could enclose the different
optimizations with ifdefs.
But alas, last time Linus didn't agree with me and decided we should
do something like "sendfile", which is IMHO just a special case of
this one.
If we implement this in kernel (at first just the copy_fd_fd and the
default implementation), then we can get "cp" to use this, and then
suddenly whenever we upgrade the kernel, cp can use the newly
optimized copying mechanism. (e.g. whenever we manage to specify a
socket as the destination, cp would suddenly start to use
"sendfile"!!)
(It might be better to include a "buffer" argument in the interface,
freeing the implementation of allocating a buffer when an optimization
is not possible).
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****
On Tue, 2003-11-11 at 08:53 -0500, Jakub Jelinek wrote:
> But e.g. the CIFS copy can be done as sendfile hook.
Can it? I thought it took filenames.
--
dwmw2
On Tue, Nov 11, 2003 at 02:38:59PM +0100, Rogier Wolff wrote:
> On Mon, Nov 10, 2003 at 08:05:11PM -0500, Albert Cahalan wrote:
> > So open the file, change context, and then:
> >
> > long copy_fd_to_file(int fd, const char *name, ...)
> >
> > (if you can no longer read from the OPEN fd,
> > either we override that or we just don't care
> > about such mostly-fictional cases)
>
>
> Actually, I think we should have a:
>
> long copy_fd_to_fd (int src, int dst, int len)
>
> type of systemcall.
We have one, sendfile(2).
> It should do something like:
>
> while ((nbytes = read (src, buf, BUFSIZE)) >= 0) {
> if (write (dst, buf, nbytes) < 0)
> return totbytes;
> totbytes += nbytes;
> }
>
> but it allows kernel-space to optimize this whenever possible. Kernel
> then becomes responsible for detecting and handling the optimizable
> cases.
>
> The kernel then becomes something
>
> if (islocalfile (src) && issocket (dst))
> /* Call the old sendfile */
> return sendfile (....);
>
> if (isCIFS (src), isCIFS(dst))
> /* Tell remote host to copy the file. */
> return CIFS_copy_file (....);
>
> ...
Can you explain why this cannot be in sys_sendfile?
It doesn't make much sense to provide any default in the kernel,
that's something the userland can handle equally well.
But e.g. the CIFS copy can be done as sendfile hook.
Jakub
Rogier Wolff wrote:
> On Mon, Nov 10, 2003 at 08:05:11PM -0500, Albert Cahalan wrote:
>
> long copy_fd_to_fd (int src, int dst, int len)
>
> The kernel then becomes something
>
> if (islocalfile (src) && issocket (dst))
> /* Call the old sendfile */
> return sendfile (....);
>
> if (isCIFS (src), isCIFS(dst))
> /* Tell remote host to copy the file. */
> return CIFS_copy_file (....);
>
B.S.
>
> But alas, last time Linus didn't agree with me and decided we should
> do something like "sendfile", which is IMHO just a special case of
> this one.
>
I will reply on behalf of Linus: "Send patch!"
I beleive you are not developer - so you even cannot estimate what
you are proposing.
This kind of patch will never be accepted.
Just try to imagine: 20 file systems, so 20*20 == 400 ifs?
So I beleive you will get more more positive responses, If you will
start improveing vfs, e.g. adding generic routines for optimized move of
file from one file system to another, with API which allow it to
extrapolate nicely to networked file systems.
Since right now there is no way to pass file from one fs to another -
so basicly this thread is already, well, over ;-)
>
> If we implement this in kernel (at first just the copy_fd_fd and the
> default implementation), then we can get "cp" to use this, and then
> suddenly whenever we upgrade the kernel, cp can use the newly
> optimized copying mechanism. (e.g. whenever we manage to specify a
> socket as the destination, cp would suddenly start to use
> "sendfile"!!)
>
Silly. cp is least frequent application I use.
And cvs I beleive already uses sendfile().
So all your /arguments/ go directly into /dev/null, since if file is
not in cvs - you know - it just doesn't exist ;-)))
>
> Roger.
>
--
Ihar 'Philips' Filipau / with best regards from Saarbruecken.
-- _ _ _
"... and for $64000 question, could you get yourself |_|*|_|
vaguely familiar with the notion of on-topic posting?" |_|_|*|
-- Al Viro @ LKML |*|*|*|
On Tue, 2003-11-11 at 08:38, Rogier Wolff wrote:
> On Mon, Nov 10, 2003 at 08:05:11PM -0500, Albert Cahalan wrote:
> > So open the file, change context, and then:
> >
> > long copy_fd_to_file(int fd, const char *name, ...)
> >
> > (if you can no longer read from the OPEN fd,
> > either we override that or we just don't care
> > about such mostly-fictional cases)
>
>
> Actually, I think we should have a:
>
> long copy_fd_to_fd (int src, int dst, int len)
>
> type of systemcall.
I don't think that works. To have a destination
file descriptor, you have to already have created
the destination file. Having done so, it may now
be impossible to transfer the security data. This
is especially the case with network filesystems.
I can well imagine providing a file descriptor for
the destination directory and making the filename
optional. This helps pin things down if there's
worry about an attacker moving directories, and it
neatly allows for fully anonymous temporary files
if a file descriptor is returned.
On Tue, Nov 11, 2003 at 03:11:26PM +0100, Ihar 'Philips' Filipau wrote:
> Rogier Wolff wrote:
> >But alas, last time Linus didn't agree with me and decided we should
> >do something like "sendfile", which is IMHO just a special case of
> >this one.
> >
>
> I will reply on behalf of Linus: "Send patch!"
>
> I beleive you are not developer - so you even cannot estimate what
> you are proposing.
Wrong.
> This kind of patch will never be accepted.
Yes. As I said: Linus doesn't agree with me. I don't sleep less from
knowing that. Feel free to disagree with me as well.
> Just try to imagine: 20 file systems, so 20*20 == 400 ifs?
Right! And: Wrong!
The idea is that the default will make sure that the kernel handles
the call. It's just as efficient as the userspace implementation.
But currently we have decided that the extra efficiency of
"local file -> socket"
matters enough to us that we want to optimize that case. Fine. So now
we have "sendfile". This is currently implemented as a special
systemcall. I.e. one of those 400 cases you mentioned.
But I expect that only a few cases will be important enough
that we care to optimize their implementation.
If we end up with 400 ifs, because we CAN optimize each and every case
by itself, and we find that important enough to actually implement,
then of course the "string of ifs" is a nice candidate to optimize
again.
> So I beleive you will get more more positive responses, If you will
> start improveing vfs, e.g. adding generic routines for optimized move of
> file from one file system to another, with API which allow it to
> extrapolate nicely to networked file systems.
Once my proposed "copy_fd_to_fd" is in place, the road is open towards
just leaving the current special case that detects: "src uses pagecache
dst is a socket" and then calls the current sendfile implementation.
> Silly. cp is least frequent application I use.
Yeah. So? You reject a general idea just because you don't use the
application that I used as an example in my proposal.
> And cvs I beleive already uses sendfile().
Fine. For compatibilty we'll leave "sendfile" in place. But if somehow
someone builds a filesystem which cannot use the pagecache, then
"sendfile" will fail. Or if somehow we manage to get the socket hooked
up to something else (*). Either CVS needs to handle that case
internally, or it will fail. In the first case, that causes extra code
in lots of applications that want to continue to work, in the latter
case, it's bad.
Roger.
(*) Suppose I manage to stop and restart an application. The "restart"
program might need to "sit between" the original application and its
filedescriptors. So now, what used to be a socket suddenly becomes a
pipe. It'd be nice if things would continue to work. Everything is a
file remember?
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****
Rogier Wolff wrote:
>
> Fine. For compatibilty we'll leave "sendfile" in place. But if somehow
> someone builds a filesystem which cannot use the pagecache, then
> "sendfile" will fail. Or if somehow we manage to get the socket hooked
> up to something else (*). Either CVS needs to handle that case
> internally, or it will fail. In the first case, that causes extra code
> in lots of applications that want to continue to work, in the latter
> case, it's bad.
>
I beleive - if you really want to have something like this - you need
to go to e.g. nfs/coda/smbfs developers and talk with them: how it can
be implemented in this situations.
Implement it with ioctl() - to really see make it sense or it just
complicates things enourmously. Actually given networked file systems
could be just NOT capable of this kind of operation at all.
Insisting on new syscall is silly: syscall is interface - it has
nothing to do with functionality. Ocasionally syscalls are used to
access functionality ;-) So start from functionality first. Syscall (or
whatever interface will fit better) can be implemented in 15 minutes any
time after functionality is in place.
--
Ihar 'Philips' Filipau / with best regards from Saarbruecken.
-- _ _ _
"... and for $64000 question, could you get yourself |_|*|_|
vaguely familiar with the notion of on-topic posting?" |_|_|*|
-- Al Viro @ LKML |*|*|*|
On Tue, Nov 11, 2003 at 02:27:42AM -0800, jw schultz wrote:
> On Tue, Nov 11, 2003 at 09:58:06AM +0100, Florian Weimer wrote:
> > Andreas Dilger wrote:
> >
> > > > This is fast turning into a creeping horror of aggregation. I defy anybody
> > > > to create an API to cover all the options mentioned so far and *not* have it
> > > > look like the process_clone horror we so roundly derided a few weeks ago.
> > >
> > > int sys_copy(int fd_src, int fd_dst)
>
> That sounds a lot like a sendfile with a file as the
> destination. Useful but still happening on the local system.
> My understanding was that this was to be sent to a remote
> system where the file descriptors might not be open.
It probably should be sendfile, where the destination fd is a local file
instead of a socket. We really do not want to pass pathnames down into
the filesystem layer. As far as I know, no existing VFS operation does
that and it probably isn't a good idea to start doing it now.
Somehow the filesystem that 'hosts' the src_fd object should get a
chance to see/intercept the sendfile syscall, and it can then decide
based on the dst_fd object what to do. If the destination happens to be
in the same filesystem it could possibly use a special internal copyfile
rpc call or CoW implementation.
The userspace/libc code could provide a copyfile(char* src, char* dst,
int flags, int mode) wrapper, which can also handle falling back to a
simple read/write loop when sendfile fails.
So we clearly don't need a new system call, sendfile would do fine and
interestingly the manual page I'm reading now mentions that the source
has to be a mmap-able object, but lists no such restrictions on the
destination fd. Maybe sendfile already works and we just need to give the
filesystems a chance to override it.
Jan
On Tue, Nov 11, 2003 at 04:02:56PM +0100, Rogier Wolff wrote:
> Fine. For compatibilty we'll leave "sendfile" in place. But if somehow
> someone builds a filesystem which cannot use the pagecache, then
> "sendfile" will fail. Or if somehow we manage to get the socket hooked
...
> (*) Suppose I manage to stop and restart an application. The "restart"
> program might need to "sit between" the original application and its
> filedescriptors. So now, what used to be a socket suddenly becomes a
> pipe. It'd be nice if things would continue to work. Everything is a
> file remember?
man sendfile(2)
NOTES
...
Applications may wish to fall back to read/write in the case
where sendfile() fails with EINVAL or ENOSYS.
So we get something in a userspace library (libc?) that does
copyfile(whatever, whereever) and uses a few kernel primitives like
open/close/sendfile and the appropriate fallback code to a read/write
loop whenever the sendfile doesn't work.
It works now, and it will work better when sendfile becomes more
versatile, and the sky is the limit once the underlying filesystem can
provide it's own optimized implementation for instance when both fd's
refer to objects within the same (remote) filesystem.
Similarily, we might at some point be able to optimize sendfile between
two sockets by pushing the connection off to a router somewhere in the
network completely bypassing the local NIC.
Jan
On Tue, 11 Nov 2003 15:22:09 EST, Jan Harkes <[email protected]> said:
> Similarily, we might at some point be able to optimize sendfile between
> two sockets by pushing the connection off to a router somewhere in the
> network completely bypassing the local NIC.
Security can of worms there.. :)
On Mon, Nov 10, 2003 at 09:22:22AM -0500, Daniel Jacobowitz wrote:
> On Mon, Nov 10, 2003 at 07:29:15AM -0600, Jesse Pollard wrote:
> > Now back to the copy.. You don't have to use a read/write loop- mmap
> > is faster. And this is the other reason for not doing it in Kernel mode.
>
> Actually, last I checked, read/write was actually faster. Linus
> explained why a month or two ago.
It would also not break on large files...
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
On Monday 10 November 2003 19:05, Albert Cahalan wrote:
> > It is too simple to implement in user mode.
>
> That works for a plain byte-stream on a
> local UNIX-style filesystem. (though it
> likely isn't the fastest)
Yes - this was the local copy
> It doesn't work for Macintosh files.
> It's too slow for CIFS over a modem.
> It doesn't work for Windows security data.
> It doesn't allow copy-on-write files.
> It eats CPU time on compressed filesystems.
>
> > The security context of the output depends
> > on the user process. If it is a privileged
> > process (ie, may change the context of the
> > result) then the user process has to setup
> > that context before the file is copied.
>
> So open the file, change context, and then:
>
> long copy_fd_to_file(int fd, const char *name, ...)
Easy to do in user mode.
>
> (if you can no longer read from the OPEN fd,
> either we override that or we just don't care
> about such mostly-fictional cases)
correct - If you can't read, fail.
> > There are also some issues with mandatory
> > security controls. If it is copied in kernel
> > mode, then the previous labels could be
> > automatically carried over to the resulting
> > file... But that may not be what you want
> > (and frequently, it isn't).
>
> If it matters:
>
> // security as if a new file were created
> #define CF_REPLACE_SECURITY 0x00000001
> // if unable to replicate, up or down?
> #define CF_ROUND_SECURITY_UP 0x00000002
> #define CF_ROUND_SECURITY_DOWN 0x00000004
> // fail if security can't be replicated
> #define CF_SECURITY_EXACT 0x00000008
>
> > Now back to the copy.. You don't have to
> > use a read/write loop- mmap is faster.
>
> It's slower. (this is Linux, not SunOS)
> Use a 4 kB or 8 kB read/write loop.
yup local.
> > And this is the other reason for not doing
> > it in Kernel mode. Buffer management of
> > this type is much easier in user space
> > since the copy procedure doesn't have to
> > deal with memory limitations, cache flushes
> > page faulting of processes unrelated to the
> > copy, but is related to cache pressure.
>
> Buffer management is very much a kernel thing.
Yes it is, but do you want to push process dependant
buffer management into the page management? It's just
easier to do this in user mode, and allow the kernel
to handle global page managment.
> >> Is it? Please explain the simple steps which
> >> cp(1) should take in order to observe that it
> >> is being asked to duplicate a file on a file
> >> system such as CIFS (or NFSv4?) which allows
> >> the client to issue a 'copy file' command
> >> over the network without actually transferring
> >> the data twice, and to invoke such a command.
> >
> > Ah. That is an optimization question, not a
> > question of kernel/user mode.
>
> Note that /bin/cp isn't always going to have
> the necessary passwords and such. You're headed
> down a path toward setuid /bin/cp.
If cp doesn't have access to the proper security credentials,
then the file should not be copied.
> > Since the error checking for source and
> > destination both include doing a stat and
> > statfs, the device information (and FS info)
> > can both be retrieved.
> >
> > And mmap doesn't require data transfer "twice"
> > (local copy).
>
> Huh? Over the network from server to client
> counts as once. Then /bin/cp gets the data.
> Then it goes back over the network from the
> client to the server. That's "twice". That's
> horribly painful for a multi-gigabyte file
> and a DSL or cable-modem connection, never
> mind a dial-up connection.
True for all networked file systems. I had ment
to say (local filesystem copy).
> > Since that copy only pagefaults (though
> > read/write may be faster for some files
> > - I thought that was true for small files
> > that fit in cache, and large files faster
> > via mmap and depends on the page size;
> > and the tradeoff would be system dependant).
>
> Keep the read/write loop small for speed.
yes.
> > And since both source and destination may
> > be remote you do get to decide based on
> > source and destination devices: if they
> > are the same, and one on a remote node,
> > then BOTH will be on the remote, then you
> > get to use the CIFS/NFS file copy. (check
> > the doc on "stat/statfs" for additional info).
> >
> > I don't believe it works when source and
> > destination are on DIFFERENT remote nodes,
> > though.
> >
> > Strictly up to the implementation of cp/mv.
> >
> > Though you will loose portability of cp/mv.
> > (Of course, you also loose it with a syscall
> > for file copy too; as well as the MUCH more
> > complicated implementation/security checks).
>
> Doing that in cp/mv is just insane. For one,
> it bypasses any local security control over
> access to the filesystem. There's not even a
> way to be sure you're dealing with the server
> you think you're dealing with.
It shouldn't matter - first the source file must be opened
for read AND the destination file opened for write.
This should give the proper local security evaluation and
context for the copy. Once this has been approved,
the remote copy request can be made (provided they are
on the same "networked" device). Just making
the request still doesn't mean that it will succeed -
after all, the final security decisions are made by
the remote server implementing the file copy.
Though if the copy is valid locally, then the use of
the filesystem supported copy should work. It is an
equivalent operation, it just all takes place on the server.
Identity of the server is irrelevent, as long as it is
the same server (or farm) for both source and destination.
If the remote file copy is defined, then it should work
even when the actual source and destination are different
physical machines - the remote filesystem CLAIMS it will
work (identical is determined from the "device" mounted,
one mount, one device as far as network filesystems go).
And if they are not identical then you fall back to using
a local copy.
All bets are off if the local pathnames are required by
the remote server. That is silly. How would a networked
client even know what the pathname would be? The parameters
should be the two file handles passed to the remote filesystem.
Personally, I don't think any changes should be made.
It's just that this level of transfer is what the original
poster was talking about. It just shouldn't be done in
kernel mode.
On Tuesday 11 November 2003 02:58, Florian Weimer wrote:
> Andreas Dilger wrote:
> > > This is fast turning into a creeping horror of aggregation. I defy
> > > anybody to create an API to cover all the options mentioned so far and
> > > *not* have it look like the process_clone horror we so roundly derided
> > > a few weeks ago.
> >
> > int sys_copy(int fd_src, int fd_dst)
>
> Doesn't work. You have to set the security attributes while you open
> fd_dst.
Why? the open for fd_src should have the security attributes (both locally
and in the file server if networked). Opening fd_dst should SET the security
attributes desired (again, locally and in the target fileserver).
Then the sys_copy(fd_src,fd_dst) can take place in the FS code. And of course
it is necessary that fd_src and fd_dst reside on the same device. If they
don't, then the sys_copy should fail.
If the sys_copy is a remote filesystem then fd_src, and fd_dst must be
replaced by the remote file handles and this passed to the remote server.
Any additional checks may then be made from the evaluation of the file handles
locally on the file server, using the security credentials belonging to the
file handles.
Followup to: <[email protected]>
By author: Jakub Jelinek <[email protected]>
In newsgroup: linux.dev.kernel
> >
> > Actually, I think we should have a:
> >
> > long copy_fd_to_fd (int src, int dst, int len)
> >
> > type of systemcall.
>
> We have one, sendfile(2).
>
It would be very nice if we could (a) expand the uses of sendfile(2),
and (b) have the libc do the fallback to read/write/mmap as needed.
-hpa
--
<[email protected]> at work, <[email protected]> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64
On Thu, Nov 13, 2003 at 12:22:14PM -0800, H. Peter Anvin wrote:
> Followup to: <[email protected]>
> By author: Jakub Jelinek <[email protected]>
> In newsgroup: linux.dev.kernel
> > >
> > > Actually, I think we should have a:
> > >
> > > long copy_fd_to_fd (int src, int dst, int len)
> > >
> > > type of systemcall.
> >
> > We have one, sendfile(2).
> >
>
> It would be very nice if we could (a) expand the uses of sendfile(2),
> and (b) have the libc do the fallback to read/write/mmap as needed.
I actually hacked cp for a while and it improved cp some point percent
on normal machines.
See ftp://ftp.suse.com/pub/people/andrea/cp-sendfile/
the main downside and the reason it wasn't applied IIRC is the lack of
interruption of sendfile, basically for an huge file it would take a
while before C^c has any effect. The kernel isn't interrupting the
syscall. This is no different from a huge read or write syscall (but
read/write are never huge or the buffer would need to be huge too, not
the case for sendfile that works zerocopy), so in theory we could
workaround it by entering/exiting kernel multiple times just to allow
the signal to be handled like in the read/write case.
On Fri, Nov 14, 2003 at 12:39:15AM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 13, 2003 at 12:22:14PM -0800, H. Peter Anvin wrote:
> > Followup to: <[email protected]>
> > By author: Jakub Jelinek <[email protected]>
> > In newsgroup: linux.dev.kernel
> > > >
> > > > Actually, I think we should have a:
> > > >
> > > > long copy_fd_to_fd (int src, int dst, int len)
> > > >
> > > > type of systemcall.
> > >
> > > We have one, sendfile(2).
> > >
> >
> > It would be very nice if we could (a) expand the uses of sendfile(2),
> > and (b) have the libc do the fallback to read/write/mmap as needed.
>
> I actually hacked cp for a while and it improved cp some point percent
> on normal machines.
>
> See ftp://ftp.suse.com/pub/people/andrea/cp-sendfile/
>
> the main downside and the reason it wasn't applied IIRC is the lack of
> interruption of sendfile, basically for an huge file it would take a
> while before C^c has any effect. The kernel isn't interrupting the
> syscall. This is no different from a huge read or write syscall (but
> read/write are never huge or the buffer would need to be huge too, not
> the case for sendfile that works zerocopy), so in theory we could
> workaround it by entering/exiting kernel multiple times just to allow
> the signal to be handled like in the read/write case.
Until interrupt and restart (as has been discussed
here for other syscalls) handling is improved there could be
a sanity check with an E2BIG or something if the size is
insane. I dislike the thought of sendfile going sitting in D
state on a multi-gigabyte file.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]
Remember Cernan and Schmitt
Andrea Arcangeli wrote:
>
> I actually hacked cp for a while and it improved cp some point percent
> on normal machines.
>
> See ftp://ftp.suse.com/pub/people/andrea/cp-sendfile/
>
> the main downside and the reason it wasn't applied IIRC is the lack of
> interruption of sendfile, basically for an huge file it would take a
> while before C^c has any effect. The kernel isn't interrupting the
> syscall. This is no different from a huge read or write syscall (but
> read/write are never huge or the buffer would need to be huge too, not
> the case for sendfile that works zerocopy), so in theory we could
> workaround it by entering/exiting kernel multiple times just to allow
> the signal to be handled like in the read/write case.
... or we could put in checks into the kernel for signal pending, and
return EINTR.
-hpa
On Thu, Nov 13, 2003 at 04:36:26PM -0800, H. Peter Anvin wrote:
> ... or we could put in checks into the kernel for signal pending, and
> return EINTR.
that would be even better indeed.
Andrea Arcangeli wrote:
> On Thu, Nov 13, 2003 at 04:36:26PM -0800, H. Peter Anvin wrote:
>
>>... or we could put in checks into the kernel for signal pending, and
>>return EINTR.
>
> that would be even better indeed.
>
s/EINTR/short count/, of course :)
-hpa
On Wed, 2003-11-12 at 10:19, Jesse Pollard wrote:
> On Monday 10 November 2003 19:05, Albert Cahalan wrote:
> > > The security context of the output depends
> > > on the user process. If it is a privileged
> > > process (ie, may change the context of the
> > > result) then the user process has to setup
> > > that context before the file is copied.
> >
> > So open the file, change context, and then:
> >
> > long copy_fd_to_file(int fd, const char *name, ...)
>
> Easy to do in user mode.
It isn't, because the user-mode code would
need to have a full understanding of whatever
fancy (seLinux, RSBAC, lomac...) security
mechanism the kernel is using. It's not enough
to just know about switching to some named
context via a common API.
> > >> Is it? Please explain the simple steps which
> > >> cp(1) should take in order to observe that it
> > >> is being asked to duplicate a file on a file
> > >> system such as CIFS (or NFSv4?) which allows
> > >> the client to issue a 'copy file' command
> > >> over the network without actually transferring
> > >> the data twice, and to invoke such a command.
> > >
> > > Ah. That is an optimization question, not a
> > > question of kernel/user mode.
> >
> > Note that /bin/cp isn't always going to have
> > the necessary passwords and such. You're headed
> > down a path toward setuid /bin/cp.
>
> If cp doesn't have access to the proper security credentials,
> then the file should not be copied.
You have proper credentials for access through
the mounted filesystem. That filesystem was
mounted by root, using some secret key that is
specific to the local machine. You could try
to directly contact the server over the network,
but you won't have the keys.
You're allowed to indirectly use the keys by
going through the mounted filesystem. For example,
you can call rmdir() to remove a directory but
you can not cause the same effect by sending a
message over the network directly to the server.
You have no ability to bypass the local kernel.
So you can copy that file, but you have to use
the file-oriented system calls to do it. You'll
need kernel support to invoke a remote-copy
operation. (or a setuid-root /bin/cp that looks
up the keys, determines the correct server, makes
a network connection, etc.)
> > > And since both source and destination may
> > > be remote you do get to decide based on
> > > source and destination devices: if they
> > > are the same, and one on a remote node,
> > > then BOTH will be on the remote, then you
> > > get to use the CIFS/NFS file copy. (check
> > > the doc on "stat/statfs" for additional info).
> > >
> > > I don't believe it works when source and
> > > destination are on DIFFERENT remote nodes,
> > > though.
> > >
> > > Strictly up to the implementation of cp/mv.
> > >
> > > Though you will loose portability of cp/mv.
> > > (Of course, you also loose it with a syscall
> > > for file copy too; as well as the MUCH more
> > > complicated implementation/security checks).
> >
> > Doing that in cp/mv is just insane. For one,
> > it bypasses any local security control over
> > access to the filesystem. There's not even a
> > way to be sure you're dealing with the server
> > you think you're dealing with.
>
> It shouldn't matter - first the source file must be opened
> for read AND the destination file opened for write.
> This should give the proper local security evaluation and
> context for the copy. Once this has been approved,
> the remote copy request can be made (provided they are
> on the same "networked" device). Just making
> the request still doesn't mean that it will succeed -
> after all, the final security decisions are made by
> the remote server implementing the file copy.
>
> Though if the copy is valid locally, then the use of
> the filesystem supported copy should work. It is an
> equivalent operation, it just all takes place on the server.
>
> Identity of the server is irrelevent, as long as it is
> the same server (or farm) for both source and destination.
> If the remote file copy is defined, then it should work
> even when the actual source and destination are different
> physical machines - the remote filesystem CLAIMS it will
> work (identical is determined from the "device" mounted,
> one mount, one device as far as network filesystems go).
> And if they are not identical then you fall back to using
> a local copy.
>
> All bets are off if the local pathnames are required by
> the remote server. That is silly. How would a networked
> client even know what the pathname would be? The parameters
> should be the two file handles passed to the remote filesystem.
You may need a filename relative to the root
of the exported part of the tree.
Remote side:
J:\groups\rteng\John Smith\tests\a.out
(with rteng exported as \\RTENG)
Local side:
/home/john/tests/a.out
(the mount point is "/home/john")
Path needed:
\\RTENG\John Smith\tests\a.out
You have that, since the kernel knows that a
"\\\\RTENG\\John Smith" directory was mounted
on /home/john and you're trying to deal with
a tests/a.out file.
> Personally, I don't think any changes should be made.
> It's just that this level of transfer is what the original
> poster was talking about. It just shouldn't be done in
> kernel mode.
Anywhere else would be buggy and most likely setuid.
"H. Peter Anvin" <[email protected]> writes:
> Andrea Arcangeli wrote:
> > On Thu, Nov 13, 2003 at 04:36:26PM -0800, H. Peter Anvin wrote:
> >
> >>... or we could put in checks into the kernel for signal pending, and
> >>return EINTR.
> >
> > that would be even better indeed.
> >
>
> s/EINTR/short count/, of course :)
That would be buggy because existing users of sendfile don't know
about this and would silently only copy part of the file when a signal
happens.
-Andi
Andi Kleen wrote:
> > s/EINTR/short count/, of course :)
>
> That would be buggy because existing users of sendfile don't know
> about this and would silently only copy part of the file when a signal
> happens.
That doesn't make sense. There aren't any existing users of sendfile
to copy files.
-- Jamie
On Tue, 18 Nov 2003 15:49:21 +0000
Jamie Lokier <[email protected]> wrote:
> Andi Kleen wrote:
> > > s/EINTR/short count/, of course :)
> >
> > That would be buggy because existing users of sendfile don't know
> > about this and would silently only copy part of the file when a signal
> > happens.
>
> That doesn't make sense. There aren't any existing users of sendfile
> to copy files.
[ignore the mail, it was an stuck mail queue]
But note that arbitary changes in the signal handling would affect all users of sendfile, not just
those that attempt to copy files or do other things that should be done in user space.
-Andi
>>>>> " " == Andi Kleen <[email protected]> writes:
>> > That would be buggy because existing users of sendfile don't
>> > know about this and would silently only copy part of the file
>> > when a signal happens.
>>
>> That doesn't make sense. There aren't any existing users of
>> sendfile to copy files.
> [ignore the mail, it was an stuck mail queue]
> But note that arbitary changes in the signal handling would
> affect all users of sendfile, not just those that attempt to
> copy files or do other things that should be done in user
> space.
That 'change' is already in effect for people who mount their NFS
partitions with the "intr" or "soft" flags.
See the return value of generic_file_sendfile(): it already has the
read()/write-like semantics of returning number of bytes written if
non-zero, or the value of desc.error if not.
Cheers,
Trond
Andi Kleen wrote:
>
> That would be buggy because existing users of sendfile don't know
> about this and would silently only copy part of the file when a signal
> happens.
>
It would be consistent with the documented semantics for other file
operations. Obviously, return zero only on EOF.
-hpa
On 14 Nov 2003, Andi Kleen wrote:
>
> That would be buggy because existing users of sendfile don't know
> about this and would silently only copy part of the file when a signal
> happens.
Don't be silly.
Existing sendfile users had _better_ accept short writes.
They happen all the time. If the destination is the network, it _will_ be
interruptible.
Linus
Once upon a time, Andi Kleen <[email protected]> wrote:
>"H. Peter Anvin" <[email protected]> writes:
>> s/EINTR/short count/, of course :)
>That would be buggy because existing users of sendfile don't know
>about this and would silently only copy part of the file when a signal
>happens.
Tru64 5.1B sendfile(2) page includes:
[EINTR]
A signal interrupted sendfile before any data was
transmitted. If some data was transmitted, the func-
tion returns the number of bytes sent before the
interrupt and does not set errno to [EINTR].
There are quite a few more documented return values under Tru64,
although TCP sockets are the only supported destination. See
http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/MAN/MAN2/0024____.HTM
--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.
On Tuesday 18 November 2003 09:49, Jamie Lokier wrote:
> Andi Kleen wrote:
> > > s/EINTR/short count/, of course :)
> >
> > That would be buggy because existing users of sendfile don't know
> > about this and would silently only copy part of the file when a signal
> > happens.
>
> That doesn't make sense. There aren't any existing users of sendfile
> to copy files.
True. It also doesn't address the issue of what to do when the file copy is
being done on a remote server and not by something local. Synchronizing
a remote interrupt could really be nasty.
Jesse Pollard wrote:
> > > int sys_copy(int fd_src, int fd_dst)
> >
> > Doesn't work. You have to set the security attributes while you open
> > fd_dst.
>
> Why? the open for fd_src should have the security attributes (both locally
> and in the file server if networked). Opening fd_dst should SET the security
> attributes desired (again, locally and in the target fileserver).
The default attributes in the new location might be less strict than the
attributes of the source file.
If sys_copy() is just an API to introduce a new copy-on-write hard link,
these problems disappear. They are only relevant if sys_copy() is
intended to be a generic "copy that file" interface.
On Thursday 20 November 2003 11:21, Florian Weimer wrote:
> Jesse Pollard wrote:
> > > > int sys_copy(int fd_src, int fd_dst)
> > >
> > > Doesn't work. You have to set the security attributes while you open
> > > fd_dst.
> >
> > Why? the open for fd_src should have the security attributes (both
> > locally and in the file server if networked). Opening fd_dst should SET
> > the security attributes desired (again, locally and in the target
> > fileserver).
>
> The default attributes in the new location might be less strict than the
> attributes of the source file.
So what. the user was authorized to open the input file. The user was
authorized to open the output file. A file copy should be possible remotely
since the equivalent implementation of a local read/write loop would
accomplish the same thing.
> If sys_copy() is just an API to introduce a new copy-on-write hard link,
> these problems disappear. They are only relevant if sys_copy() is
> intended to be a generic "copy that file" interface.
Now if you wanted the remote server to deny the network copy... could
be done - after all the credentials for both input and output files
are present on the server. If the server decides NOT to copy, then fine.
It would just cause the user to make the copy with a read/write loop.
I was only thinking of it as a way to gain access to any filesystem
support that may be available for copying files. If none is available,
then do it in user mode.
Personally, I'm not sure it is a good idea, partly because the semantics
of a file copy operation are not well defined (some of the following IS
known).
1. what happens if the copy is aborted?
2. what happens if the network drops while the remote server continues?
3. what about buffer synchronization?
4. what errors should be reported ?
5. what happens when the syscall is interupted? Especially if the remote
copy may take a while (I've seen some require an hour or more - worst
case: days due to a media error (completed after the disk was replaced)).
6. what about a client opening the copy before it is finished copying?
Jesse Pollard wrote:
> > > > > int sys_copy(int fd_src, int fd_dst)
> > The default attributes in the new location might be less strict than the
> > attributes of the source file.
>
> So what. the user was authorized to open the input file. The user was
> authorized to open the output file. A file copy should be possible remotely
> since the equivalent implementation of a local read/write loop would
> accomplish the same thing.
The potential for race conditions worries me. However, the questions
you gave are more fundamental and may be enough to kill this idea (if it
wasn't already dead)...
On Thu, 2003-11-20 at 19:08, Jesse Pollard wrote:
> Now if you wanted the remote server to deny the network copy... could
> be done - after all the credentials for both input and output files
> are present on the server. If the server decides NOT to copy, then fine.
> It would just cause the user to make the copy with a read/write loop.
>
> I was only thinking of it as a way to gain access to any filesystem
> support that may be available for copying files. If none is available,
> then do it in user mode.
>
> Personally, I'm not sure it is a good idea, partly because the semantics
> of a file copy operation are not well defined (some of the following IS
> known).
>
> 1. what happens if the copy is aborted?
> 2. what happens if the network drops while the remote server continues?
> 3. what about buffer synchronization?
> 4. what errors should be reported ?
> 5. what happens when the syscall is interupted? Especially if the remote
> copy may take a while (I've seen some require an hour or more - worst
> case: days due to a media error (completed after the disk was replaced)).
> 6. what about a client opening the copy before it is finished copying?
If you really want a filesystem that supports efficient copying you
probably want it to have the equivalent of COW blocks, so that a copy
just sets up a few pointers, and the copy only happens when the original
or copied files are changed.
But basically you wont get a syscall until you have a filesystem with
semantics that only maps onto this sort of operation.
Justin
Justin Cormack wrote:
> On Thu, 2003-11-20 at 19:08, Jesse Pollard wrote:
> If you really want a filesystem that supports efficient copying you
> probably want it to have the equivalent of COW blocks, so that a copy
> just sets up a few pointers, and the copy only happens when the original
> or copied files are changed.
>
> But basically you wont get a syscall until you have a filesystem with
> semantics that only maps onto this sort of operation.
This could be a problem if COW causes you to run out of space when
writing to the file.
This could also be a benefit if, for whatever reason, you have lots of
copies of the same file that you never change. But that sounds somewhat
pointless to me.
On Nov 20, 2003 15:44 -0500, Timothy Miller wrote:
> This could be a problem if COW causes you to run out of space when
> writing to the file.
Not much different than running out of space copying a file.
> This could also be a benefit if, for whatever reason, you have lots of
> copies of the same file that you never change. But that sounds somewhat
> pointless to me.
Umm, snapshots-in-time of your /home, /usr/src, etc? Copies of the kernel?
Lots of reasons to have mostly-identical versions of files. Almost like
hard links, except you aren't at the mercy of your editor/patch to do the
right thing when modifying one of those copies.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
Andreas Dilger wrote:
> On Nov 20, 2003 15:44 -0500, Timothy Miller wrote:
>
>>This could be a problem if COW causes you to run out of space when
>>writing to the file.
>
>
> Not much different than running out of space copying a file.
It is, though. If you run out of space copying a file, you know it when
you're copying. Applications don't usually expect to get out-of-space
errors while overwriting something in the middle of a file.
In effect, your free space and your used space add up to greater than
the capacity of the disk. An application that checks for free space
before doing something would be fooled into thinking there is more free
space than there really is. How can an application find out in advance
that a file that it's about to modify (without appending anything to the
end) is going to need more disk space?
> It is, though. If you run out of space copying a file, you know it when
> you're copying. Applications don't usually expect to get out-of-space
> errors while overwriting something in the middle of a file.
What about sparse files?
> In effect, your free space and your used space add up to greater than
> the capacity of the disk. An application that checks for free space
> before doing something would be fooled into thinking there is more free
> space than there really is. How can an application find out in advance
> that a file that it's about to modify (without appending anything to the
> end) is going to need more disk space?
I don't think it can do that already now with sparse files, can it?
Cheers,
MaZe.
Maciej Zenczykowski wrote:
>>It is, though. If you run out of space copying a file, you know it when
>>you're copying. Applications don't usually expect to get out-of-space
>>errors while overwriting something in the middle of a file.
>
>
> What about sparse files?
Ah, good point. Never mind. :)
> Andreas Dilger wrote:
> > On Nov 20, 2003 15:44 -0500, Timothy Miller wrote:
> >
> >>This could be a problem if COW causes you to run out of space when
> >>writing to the file.
> >
> >
> > Not much different than running out of space copying a file.
>
> It is, though. If you run out of space copying a file, you
> know it when you're copying. Applications don't usually expect to get
> out-of-space errors while overwriting something in the middle of a
file.
It could for journaling filesystem already.
It's not in any spec that writing to the middle of a file would not
cause ENOSPC, is it?
> In effect, your free space and your used space add up to greater than
> the capacity of the disk. An application that checks for free space
> before doing something would be fooled into thinking there is
> more free space than there really is. How can an application find out
> in advance that a file that it's about to modify (without appending
> anything to the end) is going to need more disk space?
Assume 'fast'copy(int fd_in, int fd_out) where fd_in and fd_out reference
files. fd_in is opened for read and fd_out is opened for write. Ignore
filepos locations in both fd's. fd_out must reference an empty/truncated
file (if not then fail). Usually you'd call copy on fd_out straight out
of a creat call (and thus this would be a non-issue).
> 1. what happens if the copy is aborted?
I'd say the copy operation should be 'atomic', either it succeeds (full
copy) or fails (no changes to filesystems except for the truncate). An
abort would obviously usually result in a failure (thus a possible revert,
which is rather easy since it's likely just an truncate of whatever has
already been copied) or if we've just finished and than a successful
result.
> 2. what happens if the network drops while the remote server continues?
If the remote server has enough data to perform the operation then it does
complete it otherwise there ain't enough info anyway (afterall the
entire idea of this is to fit the entire copy into a single copy
instruction thus a single packet/command whatever, no extra data is
passed)...
> 3. what about buffer synchronization?
If this is happening remotely then I don't see what requires sync???
> 4. what errors should be reported ?
This is tougher:
Tests first performed locally (if they can be) than request forwarded to
remote end and tests performed remotely - return either error or
ACCEPTED, at which point local end tells it to go ahead, (at this
point the operation is effectively performed (unless an abort is
signalled) regardless of network connectivity). On completion remote end
will return info on completion or error code.
a) operation not supported by kernel :) - ENOSYS
b) fd_in/fd_out invalid file descriptor - EBADF
c) fd_in/fd_out is directory - EISDIR
d) can't read/write from/to fd_in/fd_out - EINVAL
e) an error if fd_out ain't empty - ENOTEMPTY
f) operation not supported by this combination of devices - EOPNOTSUPP
[so you need to do it via usual loop]
g) input file bigger then output file can be - EFBIG
[ie copy of 5GB file from remote filesystem which supports it to
another filesystem on the same server with 2GB max file size]
h) low-level IO error - EIO - serious problems (i.e. HDD read/write error)
i) out of disk space during copy - ENOSPC
j) out of memory during copy - ENOMEM (unlikely, needed?)
k) lost network connection - ENETRESET (unknown whether succeeded)
or ENOLINK ?
l) operation was aborted - EINTR [probably should be some other error
code, not sure]
m) success - either return 0 or the number of bytes copied
[probably best to return the # of bytes copied, even if (for now?) we
only accept full copies]
Did I miss anything? What about non-blocking call? Basically as above but
return INPROGRESS as soon as we tell remote end to go ahead... or perhaps
don't support non-blocking call?
> 5. what happens when the syscall is interupted? Especially if the remote
> copy may take a while (I've seen some require an hour or more - worst
> case: days due to a media error (completed after the disk was replaced)).
Well, if it's interrupted by a SIGINT or the like then return EINTR and
the copy was not performed (ie we backed the copy out, unless net failure
detected during abort then ENOLINK/ENETRESET).
If it's a more normal signal than it should behave like any normal kernel
restartable syscall (i.e. via ERESTARTNOHAND or something like that).
> 6. what about a client opening the copy before it is finished copying?
The file copy is atomic and thus the file doesn't per se exist until the
copy operation completes (or the file exists with zero size and is locked
and can't be opened).
Perhaps in the future we could support partial copies and restarting an
interrupted copy, but let's first agree (or not) on the above.
I think a copy syscall would be very useful. What I'd really like to see
is some sort of block-hashed-space-compression with copy-on-write
semantics file system for linux (for my 500 CD collection which probably
has a 10-12 data duplicity factor).
Comments?
Cheers,
MaZe.
Le jeu 20/11/2003 ? 20:08, Jesse Pollard a ?crit :
> 1. what happens if the copy is aborted?
> 2. what happens if the network drops while the remote server continues?
> 3. what about buffer synchronization?
> 4. what errors should be reported ?
> 5. what happens when the syscall is interupted? Especially if the remote
> copy may take a while (I've seen some require an hour or more - worst
> case: days due to a media error (completed after the disk was replaced)).
> 6. what about a client opening the copy before it is finished copying?
7. How to report progress with your average file manager ?
On Nov 20, 2003 23:31 +0100, Xavier Bestel wrote:
> Le jeu 20/11/2003 ? 20:08, Jesse Pollard a ?crit :
> > 1. what happens if the copy is aborted?
Same as now with "cp" - partial copy.
> > 2. what happens if the network drops while the remote server continues?
Irrelevant, since you can't access the file at that point (i.e. if server
continues then great, but if it doesn't it's no different than the server
disconnecting/crashing in the middle of a regular copy.
> > 3. what about buffer synchronization?
Sync file locally before starting, and no buffers on client are created.
If you write to file while it is being copied, how is that different
than two writers for same file now (i.e. usually broken). If the network
filesystem doesn't support locking, that's the filesystem's problem and
this API doesn't change it.
> > 4. what errors should be reported ?
Covered pretty well elsewhere. Of course EINTR should be reserved for
"interrupted, please continue if you want" as opposed to a hard error.
> > 5. what happens when the syscall is interupted? Especially if the remote
> > copy may take a while (I've seen some require an hour or more - worst
> > case: days due to a media error (completed after the disk was replaced)).
Partial copy, no different than now.
> > 6. what about a client opening the copy before it is finished copying?
Reads partial file, no different than now.
> 7. How to report progress with your average file manager ?
Support signals and restart the copy where it left off. Interrupting
once a second or whatever isn't onerous if needed and you can restart.
You could even support some sort of "SIGUSR1" like dd does to get status
back without actually killing things. Alternately, just stat the target
file as it is being copied to watch progress.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
On Thursday 20 November 2003 13:44, Justin Cormack wrote:
> On Thu, 2003-11-20 at 19:08, Jesse Pollard wrote:
[snip]
>
> If you really want a filesystem that supports efficient copying you
> probably want it to have the equivalent of COW blocks, so that a copy
> just sets up a few pointers, and the copy only happens when the original
> or copied files are changed.
Ummmm... I REALLY don't like COW on a disk. Much too big a chance that the
filesystem will deadlock, and with no recovery method. (oversubscribed, then
crash, corrupting the homeblock, repair (committing journal?) requires
space... no space... therefore mostly dead. You'd have to be able to mount
without the journal or the homeblock, then delete something, then commit the
journal, dismount, recover the rest-- though this might be overboard, the
homebock might not even be damaged).
> But basically you wont get a syscall until you have a filesystem with
> semantics that only maps onto this sort of operation.
I belive NFSv3/4 has a file copy request included. And I understand that
the SAMBA server does too.
On Thursday 20 November 2003 15:48, Maciej Zenczykowski wrote:
> Assume 'fast'copy(int fd_in, int fd_out) where fd_in and fd_out reference
> files. fd_in is opened for read and fd_out is opened for write. Ignore
> filepos locations in both fd's. fd_out must reference an empty/truncated
> file (if not then fail). Usually you'd call copy on fd_out straight out
> of a creat call (and thus this would be a non-issue).
>
> > 1. what happens if the copy is aborted?
>
> I'd say the copy operation should be 'atomic', either it succeeds (full
> copy) or fails (no changes to filesystems except for the truncate). An
> abort would obviously usually result in a failure (thus a possible revert,
> which is rather easy since it's likely just an truncate of whatever has
> already been copied) or if we've just finished and than a successful
> result.
Really? what happens if the abort is local to the system making the request?
what happens if the abort is on the remote server?
> > 2. what happens if the network drops while the remote server continues?
>
> If the remote server has enough data to perform the operation then it does
> complete it otherwise there ain't enough info anyway (afterall the
> entire idea of this is to fit the entire copy into a single copy
> instruction thus a single packet/command whatever, no extra data is
> passed)...
And back to aborts?
> > 3. what about buffer synchronization?
>
> If this is happening remotely then I don't see what requires sync???
Multiple hosts remote to the server that have afile open. Though this
already happens with NFS.
> > 4. what errors should be reported ?
>
> This is tougher:
>
> Tests first performed locally (if they can be) than request forwarded to
> remote end and tests performed remotely - return either error or
> ACCEPTED, at which point local end tells it to go ahead, (at this
> point the operation is effectively performed (unless an abort is
> signalled) regardless of network connectivity). On completion remote end
> will return info on completion or error code.
>
> a) operation not supported by kernel :) - ENOSYS
> b) fd_in/fd_out invalid file descriptor - EBADF
> c) fd_in/fd_out is directory - EISDIR
> d) can't read/write from/to fd_in/fd_out - EINVAL
> e) an error if fd_out ain't empty - ENOTEMPTY
> f) operation not supported by this combination of devices - EOPNOTSUPP
> [so you need to do it via usual loop]
> g) input file bigger then output file can be - EFBIG
> [ie copy of 5GB file from remote filesystem which supports it to
> another filesystem on the same server with 2GB max file size]
> h) low-level IO error - EIO - serious problems (i.e. HDD read/write error)
> i) out of disk space during copy - ENOSPC
> j) out of memory during copy - ENOMEM (unlikely, needed?)
> k) lost network connection - ENETRESET (unknown whether succeeded)
> or ENOLINK ?
> l) operation was aborted - EINTR [probably should be some other error
> code, not sure]
> m) success - either return 0 or the number of bytes copied
> [probably best to return the # of bytes copied, even if (for now?) we
> only accept full copies]
>
> Did I miss anything? What about non-blocking call? Basically as above but
> return INPROGRESS as soon as we tell remote end to go ahead... or perhaps
> don't support non-blocking call?
>
> > 5. what happens when the syscall is interupted? Especially if the remote
> > copy may take a while (I've seen some require an hour or more - worst
> > case: days due to a media error (completed after the disk was
> > replaced)).
>
> Well, if it's interrupted by a SIGINT or the like then return EINTR and
> the copy was not performed (ie we backed the copy out, unless net failure
> detected during abort then ENOLINK/ENETRESET).
Ooop - the copy is being done on the remote server.
> If it's a more normal signal than it should behave like any normal kernel
> restartable syscall (i.e. via ERESTARTNOHAND or something like that).
Again, the copy may be being made on the remote server.
> > 6. what about a client opening the copy before it is finished copying?
>
> The file copy is atomic and thus the file doesn't per se exist until the
> copy operation completes (or the file exists with zero size and is locked
> and can't be opened).
It does under all other methods of copying.
> Perhaps in the future we could support partial copies and restarting an
> interrupted copy, but let's first agree (or not) on the above.
>
> I think a copy syscall would be very useful. What I'd really like to see
> is some sort of block-hashed-space-compression with copy-on-write
> semantics file system for linux (for my 500 CD collection which probably
> has a 10-12 data duplicity factor).
It could be usefull. What you describe now is a migrating filesystem on a
server. And note that your COW is going from two different filesystems (hmm
or maybe a custom union mount?)...
Which is where the migrating filesystem. The served filesystem should already
know how to transfer a file from the archive.
Hi!
> >>This could be a problem if COW causes you to run out of space when
> >>writing to the file.
> >
> >
> >Not much different than running out of space copying a file.
>
> It is, though. If you run out of space copying a file, you know it when
> you're copying. Applications don't usually expect to get out-of-space
> errors while overwriting something in the middle of a file.
Same can happen on compressed filesystem...
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
Pavel Machek wrote:
> > It is, though. If you run out of space copying a file, you know it when
> > you're copying. Applications don't usually expect to get out-of-space
> > errors while overwriting something in the middle of a file.
>
> Same can happen on compressed filesystem...
Or a filesystem with snapshots, e.g. using LVM.
-- Jamie
Jamie Lokier <[email protected]> writes:
> Pavel Machek wrote:
>> > It is, though. If you run out of space copying a file, you know it when
>> > you're copying. Applications don't usually expect to get out-of-space
>> > errors while overwriting something in the middle of a file.
>>
>> Same can happen on compressed filesystem...
>
> Or a filesystem with snapshots, e.g. using LVM.
Or writing to a sparse file.
Andreas.
--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
(Among the other N objections, add things like the lack of any sort of
control or option parameters)
...
N += 1: Sparse Copying (e.g. seeking past blocks of zeros)
N += 1: Unlink or overwrite or what?
N += 1: In-Kernel locking and resolution for pages that are mandatory
lock(ed)
N += 1: No fine-grained control for concurrency issues (multiple writers)
Start with doing a cp --help and move on from there for an unbounded list of
issues that sys_copy(int fd1, int fd2) does not even come close to
addressing.
Robert White wrote:
>(Among the other N objections, add things like the lack of any sort of
>control or option parameters)
>...
>N += 1: Sparse Copying (e.g. seeking past blocks of zeros)
>N += 1: Unlink or overwrite or what?
>N += 1: In-Kernel locking and resolution for pages that are mandatory
>lock(ed)
>N += 1: No fine-grained control for concurrency issues (multiple writers)
>
>Start with doing a cp --help and move on from there for an unbounded list of
>issues that sys_copy(int fd1, int fd2) does not even come close to
>addressing.
>
>
To be fair, sys_copy is never intended to replace cp or try to be
very smart. I don't think it is semantically supposed to do much more
than replace a read, write loop (of course, the syscall also has an
offset and count).
sparse copying would be implementation dependant. If cp wanted to do
something special it would not use one big copy call. I think unlink
/ overwrite is irrelevant if its semantically a read write loop.
On Thu, 27 Nov 2003, Nick Piggin wrote:
> Robert White wrote:
>
> >(Among the other N objections, add things like the lack of any sort of
> >control or option parameters)
> >...
> >N += 1: Sparse Copying (e.g. seeking past blocks of zeros)
> >N += 1: Unlink or overwrite or what?
> >N += 1: In-Kernel locking and resolution for pages that are mandatory
> >lock(ed)
> >N += 1: No fine-grained control for concurrency issues (multiple writers)
> >
> >Start with doing a cp --help and move on from there for an unbounded list of
> >issues that sys_copy(int fd1, int fd2) does not even come close to
> >addressing.
> >
> >
>
> To be fair, sys_copy is never intended to replace cp or try to be
> very smart. I don't think it is semantically supposed to do much more
> than replace a read, write loop (of course, the syscall also has an
> offset and count).
>
> sparse copying would be implementation dependant. If cp wanted to do
> something special it would not use one big copy call. I think unlink
> / overwrite is irrelevant if its semantically a read write loop.
>
actually if this syscall is allowed to do a COW at the filesystem level
(which I think is one of the better reasons for implementing this) then
sparse files would produce sparse copies.
if the destination exists it would need to be unlinked (overwrite doesn't
make sense in the COW context)
I don't understand the in-kernel page locking issues refered to above
the concurrancy issues are a good question, but I would suggest that the
syscall fully setup the copy and then create the link to it. this would
make the final creation an atomic operation (or as close to it as a
particular filesystem allows) and if you have multiple writers doing a
copy to the same destination then the last one wins, the earlier copies
get unlinked and deleted
I definantly don't see it being worth it to make a syscall to just
implement the read/write loop, but a copy syscall designed from the outset
to do a COW copy that falls back to a read/write loop for filesystems that
don't do COW has some real benifits
David Lang
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
David Lang wrote:
>On Thu, 27 Nov 2003, Nick Piggin wrote:
>
>
>>Robert White wrote:
>>
>>
>>>(Among the other N objections, add things like the lack of any sort of
>>>control or option parameters)
>>>...
>>>N += 1: Sparse Copying (e.g. seeking past blocks of zeros)
>>>N += 1: Unlink or overwrite or what?
>>>N += 1: In-Kernel locking and resolution for pages that are mandatory
>>>lock(ed)
>>>N += 1: No fine-grained control for concurrency issues (multiple writers)
>>>
>>>Start with doing a cp --help and move on from there for an unbounded list of
>>>issues that sys_copy(int fd1, int fd2) does not even come close to
>>>addressing.
>>>
>>>
>>>
>>To be fair, sys_copy is never intended to replace cp or try to be
>>very smart. I don't think it is semantically supposed to do much more
>>than replace a read, write loop (of course, the syscall also has an
>>offset and count).
>>
>>sparse copying would be implementation dependant. If cp wanted to do
>>something special it would not use one big copy call. I think unlink
>>/ overwrite is irrelevant if its semantically a read write loop.
>>
>>
>
>actually if this syscall is allowed to do a COW at the filesystem level
>(which I think is one of the better reasons for implementing this) then
>sparse files would produce sparse copies.
>
Sure, I just mean the semantics should be equivalent to a read write
loop. Another example is zero copy copy for a remote fs that supports
it.
>
>if the destination exists it would need to be unlinked (overwrite doesn't
>make sense in the COW context)
>
Well it would be implementation specific. Presumably it should keep
the semantics of an overwrite.
>
>I don't understand the in-kernel page locking issues refered to above
>
>the concurrancy issues are a good question, but I would suggest that the
>syscall fully setup the copy and then create the link to it. this would
>make the final creation an atomic operation (or as close to it as a
>particular filesystem allows) and if you have multiple writers doing a
>copy to the same destination then the last one wins, the earlier copies
>get unlinked and deleted
>
I don't think it should do any linking / unlinking it should just work
with file descriptors. Concurrent writes to a file don't have many
guarantees. sys_copy shouldn't have to be any stronger (read weaker).
>
>I definantly don't see it being worth it to make a syscall to just
>implement the read/write loop, but a copy syscall designed from the outset
>to do a COW copy that falls back to a read/write loop for filesystems that
>don't do COW has some real benifits
>
No I just mean the semantics.
On Thu, 27 Nov 2003, Nick Piggin wrote:
> >
> >if the destination exists it would need to be unlinked (overwrite doesn't
> >make sense in the COW context)
> >
>
> Well it would be implementation specific. Presumably it should keep
> the semantics of an overwrite.
>
> >
> >I don't understand the in-kernel page locking issues refered to above
> >
> >the concurrancy issues are a good question, but I would suggest that the
> >syscall fully setup the copy and then create the link to it. this would
> >make the final creation an atomic operation (or as close to it as a
> >particular filesystem allows) and if you have multiple writers doing a
> >copy to the same destination then the last one wins, the earlier copies
> >get unlinked and deleted
> >
>
> I don't think it should do any linking / unlinking it should just work
> with file descriptors. Concurrent writes to a file don't have many
> guarantees. sys_copy shouldn't have to be any stronger (read weaker).
I'm thinking that it may actually be easier to do this via file paths
instead of file descripters. with file paths something like COW or
zero-copy copy can be done trivially (and the kernel knows the user
credentials of the program issuing the command and can pass them on to the
filesystem to see if it's allowed). I don't see how this can be done with
file descripters (if all you have is a file descripter you can truncate
and write a file, but you don't know all the links to that file so you
can't reposition that first inode for example).
> >
> >I definantly don't see it being worth it to make a syscall to just
> >implement the read/write loop, but a copy syscall designed from the outset
> >to do a COW copy that falls back to a read/write loop for filesystems that
> >don't do COW has some real benifits
> >
>
> No I just mean the semantics.
>
>
>
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, 27 November 2003 01:50:46 -0800, David Lang wrote:
> >
> > I don't think it should do any linking / unlinking it should just work
> > with file descriptors. Concurrent writes to a file don't have many
> > guarantees. sys_copy shouldn't have to be any stronger (read weaker).
>
> I'm thinking that it may actually be easier to do this via file paths
> instead of file descripters. with file paths something like COW or
> zero-copy copy can be done trivially (and the kernel knows the user
> credentials of the program issuing the command and can pass them on to the
> filesystem to see if it's allowed). I don't see how this can be done with
> file descripters (if all you have is a file descripter you can truncate
> and write a file, but you don't know all the links to that file so you
> can't reposition that first inode for example).
And how is userspace supposed to protect itself from race conditions?
Just compare:
fd1 = open(path1);
if (stat(fd1) looks fishy)
abort();
fd2 = open(path2);
if (stat(fd2) looks fishy)
abort();
copy(fd1, fd2);
and:
fd1 = open(path1);
if (stat(fd1) looks fishy)
abort();
fd2 = open(path2);
if (stat(fd2) looks fishy)
abort();
copy(path1, path2);
J?rn
--
Don't worry about people stealing your ideas. If your ideas are any good,
you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by
Raph Levien, 1979
On Thu, 27 Nov 2003, J?rn Engel wrote:
> On Thu, 27 November 2003 01:50:46 -0800, David Lang wrote:
> > >
> > > I don't think it should do any linking / unlinking it should just work
> > > with file descriptors. Concurrent writes to a file don't have many
> > > guarantees. sys_copy shouldn't have to be any stronger (read weaker).
> >
> > I'm thinking that it may actually be easier to do this via file paths
> > instead of file descripters. with file paths something like COW or
> > zero-copy copy can be done trivially (and the kernel knows the user
> > credentials of the program issuing the command and can pass them on to the
> > filesystem to see if it's allowed). I don't see how this can be done with
> > file descripters (if all you have is a file descripter you can truncate
> > and write a file, but you don't know all the links to that file so you
> > can't reposition that first inode for example).
>
> And how is userspace supposed to protect itself from race conditions?
> Just compare:
>
> fd1 = open(path1);
> if (stat(fd1) looks fishy)
> abort();
> fd2 = open(path2);
> if (stat(fd2) looks fishy)
> abort();
> copy(fd1, fd2);
>
> and:
>
> fd1 = open(path1);
> if (stat(fd1) looks fishy)
> abort();
> fd2 = open(path2);
> if (stat(fd2) looks fishy)
> abort();
> copy(path1, path2);
>
> J?rn
>
Ok, good point. my first reaction is to make copy refuse to function
unless the target doesn't exist (protect the output), but that doesn't
solve the problem of protecting the input or preventing someone else from
tampering with the output (unless you have copy return the FD to use to
access the output)
actually thinking about it a bit more, did I make a stupid mistake and
think that the FD points at the beginning of the file when it really
points at the inode? if it points at the inode then the problems I was
refering to don't exist.
David Lang