2004-09-07 12:13:18

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 1/4] copyfile: generic_sendpage

Set Linus on Cc:, since you quoted him.

On Sat, 4 September 2004 15:39:02 -0700, Andrew Morton wrote:
>
> I discussed file->file sendfile with Linus a while back and he said
>
> > I think it was about doing a 2GB file-file sendfile, and see your system
> > grind to a halt without being able to kill it.
> >
> > That said, we have some of the same problems with just regular read/write
> > too. sendfile just makes it easier.
> >
> > We should probably make read/write be interruptible by _fatal_ signals.
> > It would require a new task state, though (TASK_KILLABLE or something, and
> > it would show up as a normal 'D' state).
>
> I don't know how much of a problem this is in practice - there are all
> sorts of nasty things which unprivileged apps can do to the system by
> overloading filesystems. Although most of them can be killed off by the
> sysadmin.

My solution is simple. Run sendfile() in a loop of 4k chunks and
check for reschedule and signal after each chunk. Not sure if the
reschedule is needed but it shouldn't hurt much either.

The other three patches were updated accorting to Christophs comments.

J?rn

--
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens


o Add sendfile() support for file targets to normal mm/filemap.c.
o Have ext[23] use that support.

Signed-off-by: J?rn Engel <[email protected]>
---

fs/ext2/file.c | 1
fs/ext3/file.c | 1
include/linux/fs.h | 1
mm/filemap.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 219 insertions(+)


--- linux-2.6.8cow/include/linux/fs.h~generic_sendpage 2004-09-04 22:59:13.000000000 +0200
+++ linux-2.6.8cow/include/linux/fs.h 2004-09-05 11:58:21.000000000 +0200
@@ -1417,6 +1417,7 @@
ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos);
extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *);
+extern ssize_t generic_file_sendpage(struct file *, struct page *, int, size_t, loff_t *, int);
extern void do_generic_mapping_read(struct address_space *mapping,
struct file_ra_state *, struct file *,
loff_t *, read_descriptor_t *, read_actor_t);
--- linux-2.6.8cow/mm/filemap.c~generic_sendpage 2004-09-04 22:59:14.000000000 +0200
+++ linux-2.6.8cow/mm/filemap.c 2004-09-04 23:04:38.000000000 +0200
@@ -985,6 +985,31 @@
return written;
}

+/* FIXME: It would be as simple as this, if we had a (void __user*) to write.
+ * We already have a kernel buffer, so it should be even simpler, right? ;)
+ *
+ * Yes, sorta. After duplicating the complete path of generic_file_write(),
+ * at least some special cases could be removed, so the copy is simpler than
+ * the original. But it remains a copy, so overall complexity increases.
+ */
+static ssize_t
+generic_kernel_file_write(struct file *, const char *, size_t, loff_t *);
+
+ssize_t generic_file_sendpage(struct file *file, struct page *page,
+ int offset, size_t size, loff_t *ppos, int more)
+{
+ ssize_t ret;
+ char *kaddr;
+
+ kaddr = kmap(page);
+ ret = generic_kernel_file_write(file, kaddr + offset, size, ppos);
+ kunmap(page);
+
+ return ret;
+}
+
+EXPORT_SYMBOL(generic_file_sendpage);
+
ssize_t generic_file_sendfile(struct file *in_file, loff_t *ppos,
size_t count, read_actor_t actor, void *target)
{
@@ -1642,6 +1667,19 @@
return bytes - left;
}

+static inline size_t
+filemap_copy_from_kernel(struct page *page, unsigned long offset,
+ const char *buf, unsigned bytes)
+{
+ char *kaddr;
+
+ kaddr = kmap(page);
+ memcpy(kaddr + offset, buf, bytes);
+ kunmap(page);
+
+ return bytes;
+}
+
static size_t
__filemap_copy_from_user_iovec(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
@@ -2016,6 +2054,155 @@

EXPORT_SYMBOL(generic_file_aio_write_nolock);

+/*
+ * TODO:
+ * This largely tries to copy generic_file_aio_write_nolock(), although it
+ * doesn't have to be nearly as generic. A real cleanup should either
+ * merge this into generic_file_aio_write_nolock() as well or keep it special
+ * and remove as much code as possible.
+ */
+static ssize_t
+generic_kernel_file_aio_write_nolock(struct kiocb *iocb, const struct iovec*iov,
+ unsigned long nr_segs, loff_t *ppos)
+{
+ struct file *file = iocb->ki_filp;
+ struct address_space * mapping = file->f_mapping;
+ struct address_space_operations *a_ops = mapping->a_ops;
+ size_t ocount; /* original count */
+ size_t count; /* after file limit checks */
+ struct inode *inode = mapping->host;
+ long status = 0;
+ loff_t pos;
+ struct page *page;
+ struct page *cached_page = NULL;
+ const int isblk = S_ISBLK(inode->i_mode);
+ ssize_t written;
+ ssize_t err;
+ size_t bytes;
+ struct pagevec lru_pvec;
+ const struct iovec *cur_iov = iov; /* current iovec */
+ size_t iov_base = 0; /* offset in the current iovec */
+ unsigned long seg;
+ char *buf;
+
+ ocount = 0;
+ for (seg = 0; seg < nr_segs; seg++) {
+ const struct iovec *iv = &iov[seg];
+
+ /*
+ * If any segment has a negative length, or the cumulative
+ * length ever wraps negative then return -EINVAL.
+ */
+ ocount += iv->iov_len;
+ if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
+ return -EINVAL;
+ }
+
+ count = ocount;
+ pos = *ppos;
+ pagevec_init(&lru_pvec, 0);
+
+ /* We can write back this queue in page reclaim */
+ current->backing_dev_info = mapping->backing_dev_info;
+ written = 0;
+
+ err = generic_write_checks(file, &pos, &count, isblk);
+ if (err)
+ goto out;
+
+
+ if (count == 0)
+ goto out;
+
+ remove_suid(file->f_dentry);
+ inode_update_time(inode, 1);
+
+ /* There is no sane reason to use O_DIRECT */
+ BUG_ON(file->f_flags & O_DIRECT);
+
+ buf = (char *)iov->iov_base;
+ do {
+ unsigned long index;
+ unsigned long offset;
+ size_t copied;
+
+ offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ index = pos >> PAGE_CACHE_SHIFT;
+ bytes = PAGE_CACHE_SIZE - offset;
+ if (bytes > count)
+ bytes = count;
+
+ page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ break;
+ }
+
+ status = a_ops->prepare_write(file, page, offset, offset+bytes);
+ if (unlikely(status)) {
+ loff_t isize = i_size_read(inode);
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again.
+ */
+ unlock_page(page);
+ page_cache_release(page);
+ if (pos + bytes > isize)
+ vmtruncate(inode, isize);
+ break;
+ }
+
+ BUG_ON(nr_segs != 1);
+ copied = filemap_copy_from_kernel(page, offset, buf, bytes);
+
+ flush_dcache_page(page);
+ status = a_ops->commit_write(file, page, offset, offset+bytes);
+ if (likely(copied > 0)) {
+ if (!status)
+ status = copied;
+
+ if (status >= 0) {
+ written += status;
+ count -= status;
+ pos += status;
+ buf += status;
+ if (unlikely(nr_segs > 1))
+ filemap_set_next_iovec(&cur_iov,
+ &iov_base, status);
+ }
+ }
+ if (unlikely(copied != bytes))
+ if (status >= 0)
+ status = -EFAULT;
+ unlock_page(page);
+ mark_page_accessed(page);
+ page_cache_release(page);
+ if (status < 0)
+ break;
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
+ } while (count);
+ *ppos = pos;
+
+ if (cached_page)
+ page_cache_release(cached_page);
+
+ /*
+ * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
+ */
+ if (status >= 0) {
+ if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
+ status = generic_osync_inode(inode, mapping,
+ OSYNC_METADATA|OSYNC_DATA);
+ }
+
+ err = written ? written : status;
+out:
+ pagevec_lru_add(&lru_pvec);
+ current->backing_dev_info = 0;
+ return err;
+}
+
ssize_t
generic_file_write_nolock(struct file *file, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)
@@ -2030,6 +2217,20 @@
return ret;
}

+static ssize_t
+generic_kernel_file_write_nolock(struct file *file, const struct iovec *iov,
+ unsigned long nr_segs, loff_t *ppos)
+{
+ struct kiocb kiocb;
+ ssize_t ret;
+
+ init_sync_kiocb(&kiocb, file);
+ ret = generic_kernel_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos);
+ if (ret == -EIOCBQUEUED)
+ ret = wait_on_sync_kiocb(&kiocb);
+ return ret;
+}
+
EXPORT_SYMBOL(generic_file_write_nolock);

ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf,
@@ -2068,6 +2269,21 @@

EXPORT_SYMBOL(generic_file_write);

+static ssize_t generic_kernel_file_write(struct file *file, const char *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file->f_mapping->host;
+ ssize_t err;
+ struct iovec local_iov = {.iov_base = (void __user *)buf,
+ .iov_len = count };
+
+ down(&inode->i_sem);
+ err = generic_kernel_file_write_nolock(file, &local_iov, 1, ppos);
+ up(&inode->i_sem);
+
+ return err;
+}
+
ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)
{
--- linux-2.6.8cow/fs/ext2/file.c~generic_sendpage 2004-06-16 07:19:13.000000000 +0200
+++ linux-2.6.8cow/fs/ext2/file.c 2004-09-04 23:04:38.000000000 +0200
@@ -53,6 +53,7 @@
.readv = generic_file_readv,
.writev = generic_file_writev,
.sendfile = generic_file_sendfile,
+ .sendpage = generic_file_sendpage,
};

struct inode_operations ext2_file_inode_operations = {
--- linux-2.6.8cow/fs/ext3/file.c~generic_sendpage 2004-06-16 07:20:04.000000000 +0200
+++ linux-2.6.8cow/fs/ext3/file.c 2004-09-04 23:04:38.000000000 +0200
@@ -127,6 +127,7 @@
.release = ext3_release_file,
.fsync = ext3_sync_file,
.sendfile = generic_file_sendfile,
+ .sendpage = generic_file_sendpage,
};

struct inode_operations ext3_file_inode_operations = {


2004-09-07 12:13:56

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 3/4] copyfile: sendfile_loop

The new loop patch. Tested on my machine, appears to work.

J?rn

--
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens

Linus and Andrew are rightfully concerned about local DoS via a large
file->file sendfile(). This patch turns large sendfile() calls into a
loop of 4k chunks. After each chunk, it adds a cond_resched for
interactivity and a signal check to allow aborts etc. after the user
found out what a bad idea this may be.

Signed-off-by: J?rn Engel <[email protected]>
---

read_write.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletion(-)


--- linux-2.6.8cow/fs/read_write.c~sendfile_loop 2004-09-05 12:06:39.000000000 +0200
+++ linux-2.6.8cow/fs/read_write.c 2004-09-07 13:27:30.000000000 +0200
@@ -561,6 +561,35 @@
return ret;
}

+/**
+ * sendfile() of a 2GB file over usb1-attached hard drives can take a moment.
+ * This little loop is supposed to stop now and then to check for signals,
+ * reschedule and generally play nice with others.
+ */
+ssize_t inline __vfs_sendfile(struct file *in_file, loff_t *ppos, size_t count,
+ read_actor_t actor, struct file *out_file)
+{
+ ssize_t done = 0, ret;
+ while (count) {
+ size_t n = min(count, (size_t)4096);
+ ret = in_file->f_op->sendfile(in_file, ppos, n, actor,out_file);
+ if (ret < 0) {
+ if (done)
+ return done;
+ else
+ return ret;
+ }
+
+ done += ret;
+ count -= ret;
+
+ if ((ret == 0) || signal_pending(current))
+ break;
+ cond_resched();
+ }
+ return done;
+}
+
ssize_t vfs_sendfile(struct file *out_file, struct file *in_file, loff_t *ppos,
size_t count, loff_t max)
{
@@ -608,7 +637,7 @@
count = max - pos;
}

- ret = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);
+ ret = __vfs_sendfile(in_file, ppos, count, file_send_actor, out_file);

if (*ppos > max)
return -EOVERFLOW;

2004-09-07 12:17:09

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 2/4] copyfile: sendfile


J?rn

--
Geld macht nicht gl?cklich.
Gl?ck macht nicht satt.

Creates vfs_sendfile(), which can be called from other places within
the kernel. Such other places include copyfile() and cowlinks.

In principle, this just moves code from do_sendfile() to
vfs_sendfile(). On top of that, it removes the "if (!in_inode)"
check. The same never existed for out_inode and Christoph Hellweg
confirmed that it is superfluous.

Signed-off-by: J?rn Engel <[email protected]>
---

fs/read_write.c | 103 +++++++++++++++++++++++++++++------------------------
include/linux/fs.h | 1
2 files changed, 58 insertions(+), 46 deletions(-)


--- linux-2.6.8cow/fs/read_write.c~sendfile 2004-09-05 12:02:46.000000000 +0200
+++ linux-2.6.8cow/fs/read_write.c 2004-09-05 12:06:39.000000000 +0200
@@ -561,12 +561,66 @@
return ret;
}

+ssize_t vfs_sendfile(struct file *out_file, struct file *in_file, loff_t *ppos,
+ size_t count, loff_t max)
+{
+ struct inode * in_inode, * out_inode;
+ loff_t pos;
+ ssize_t ret;
+
+ /* verify in_file */
+ in_inode = in_file->f_dentry->d_inode;
+ if (!in_file->f_op || !in_file->f_op->sendfile)
+ return -EINVAL;
+
+ if (!ppos)
+ ppos = &in_file->f_pos;
+ else
+ if (!(in_file->f_mode & FMODE_PREAD))
+ return -ESPIPE;
+
+ ret = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, *ppos, count);
+ if (ret)
+ return ret;
+
+ /* verify out_file */
+ out_inode = out_file->f_dentry->d_inode;
+ if (!out_file->f_op || !out_file->f_op->sendpage)
+ return -EINVAL;
+
+ ret = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file->f_pos, count);
+ if (ret)
+ return ret;
+
+ ret = security_file_permission (out_file, MAY_WRITE);
+ if (ret)
+ return ret;
+
+ if (!max)
+ max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
+
+ pos = *ppos;
+ if (unlikely(pos < 0))
+ return -EINVAL;
+ if (unlikely(pos + count > max)) {
+ if (pos >= max)
+ return -EOVERFLOW;
+ count = max - pos;
+ }
+
+ ret = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);
+
+ if (*ppos > max)
+ return -EOVERFLOW;
+ return ret;
+}
+
+EXPORT_SYMBOL(vfs_sendfile);
+
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
struct file * in_file, * out_file;
- struct inode * in_inode, * out_inode;
- loff_t pos;
ssize_t retval;
int fput_needed_in, fput_needed_out;

@@ -579,21 +633,6 @@
goto out;
if (!(in_file->f_mode & FMODE_READ))
goto fput_in;
- retval = -EINVAL;
- in_inode = in_file->f_dentry->d_inode;
- if (!in_inode)
- goto fput_in;
- if (!in_file->f_op || !in_file->f_op->sendfile)
- goto fput_in;
- retval = -ESPIPE;
- if (!ppos)
- ppos = &in_file->f_pos;
- else
- if (!(in_file->f_mode & FMODE_PREAD))
- goto fput_in;
- retval = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, *ppos, count);
- if (retval)
- goto fput_in;

retval = security_file_permission (in_file, MAY_READ);
if (retval)
@@ -608,36 +647,8 @@
goto fput_in;
if (!(out_file->f_mode & FMODE_WRITE))
goto fput_out;
- retval = -EINVAL;
- if (!out_file->f_op || !out_file->f_op->sendpage)
- goto fput_out;
- out_inode = out_file->f_dentry->d_inode;
- retval = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file->f_pos, count);
- if (retval)
- goto fput_out;

- retval = security_file_permission (out_file, MAY_WRITE);
- if (retval)
- goto fput_out;
-
- if (!max)
- max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
-
- pos = *ppos;
- retval = -EINVAL;
- if (unlikely(pos < 0))
- goto fput_out;
- if (unlikely(pos + count > max)) {
- retval = -EOVERFLOW;
- if (pos >= max)
- goto fput_out;
- count = max - pos;
- }
-
- retval = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);
-
- if (*ppos > max)
- retval = -EOVERFLOW;
+ retval = vfs_sendfile(out_file, in_file, ppos, count, max);

fput_out:
fput_light(out_file, fput_needed_out);
--- linux-2.6.8cow/include/linux/fs.h~sendfile 2004-09-05 12:02:46.000000000 +0200
+++ linux-2.6.8cow/include/linux/fs.h 2004-09-05 12:06:51.000000000 +0200
@@ -928,6 +928,7 @@
unsigned long, loff_t *);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
+ssize_t vfs_sendfile(struct file *, struct file *, loff_t *, size_t, loff_t);

/*
* NOTE: write_inode, delete_inode, clear_inode, put_inode can be called

2004-09-07 12:19:02

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 4/4] copyfile: copyfile

Again, the syscall itself may be a stupid idea, but Steve indicated
interest for cifs. I'll hide behind his back and let him fight for
it. ;)

J?rn

--
Those who come seeking peace without a treaty are plotting.
-- Sun Tzu

Adds a new syscall, copyfile() which does as the name sais.

This syscall is a preparation for real cowlinks, but it might make sense on
it's own as well. For cowlinks, the real copy is not done until the first
write (or similar operation) occurs, but at that point, both operations
should be the same.

Signed-off-by: J?rn Engel <[email protected]>
---

arch/i386/kernel/entry.S | 1
fs/namei.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 2
3 files changed, 145 insertions(+)


--- linux-2.6.8cow/arch/i386/kernel/entry.S~copyfile 2004-09-05 11:57:15.000000000 +0200
+++ linux-2.6.8cow/arch/i386/kernel/entry.S 2004-09-05 11:58:00.000000000 +0200
@@ -886,5 +886,6 @@
.long sys_mq_notify
.long sys_mq_getsetattr
.long sys_ni_syscall /* reserved for kexec */
+ .long sys_copyfile

syscall_table_size=(.-sys_call_table)
--- linux-2.6.8cow/fs/namei.c~copyfile 2004-09-05 11:57:15.000000000 +0200
+++ linux-2.6.8cow/fs/namei.c 2004-09-05 11:58:15.000000000 +0200
@@ -2220,6 +2220,148 @@
return error;
}

+/* note how we open() two struct file *, just to do the sendfile(). this
+ * should not be necessary, only nfs, smbfs and cifs depend on the struct
+ * file *, while afs and all local filesystems can do without.
+ * but changing all interfaces and fixing up three filesystems is a little
+ * too much work for now.
+ */
+#include <linux/file.h> /* for fput */
+static int copy_data(struct dentry *old_dentry, struct vfsmount *old_mnt,
+ struct dentry *new_dentry, struct vfsmount *new_mnt)
+{
+ int ret;
+ loff_t size;
+ ssize_t n;
+ struct file *old_file;
+ struct file *new_file;
+
+ dget(old_dentry);
+ mntget(old_mnt);
+ old_file = dentry_open(old_dentry, old_mnt, O_RDONLY);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ dget(new_dentry);
+ mntget(new_mnt);
+ new_file = dentry_open(new_dentry, new_mnt, O_WRONLY);
+ ret = PTR_ERR(new_file);
+ if (IS_ERR(new_file))
+ goto out1;
+
+ size = i_size_read(old_file->f_dentry->d_inode);
+ if (((size_t)size != size) || ((ssize_t)size != size)) {
+ ret = -EFBIG;
+ goto out2;
+ }
+ n = vfs_sendfile(new_file, old_file, NULL, size, 0);
+ if (n == size)
+ ret = 0;
+ else if (n < 0)
+ ret = n;
+ else
+ ret = -ENOSPC;
+
+out2:
+ fput(new_file);
+out1:
+ fput(old_file);
+ return ret;
+}
+
+int vfs_copyfile(struct nameidata *old_nd, struct nameidata *new_nd,
+ struct dentry *new_dentry, int mode)
+{
+ int ret;
+
+ if (!old_nd->dentry->d_inode)
+ return -ENOENT;
+ if (!S_ISREG(old_nd->dentry->d_inode->i_mode))
+ return -EINVAL;
+ /* FIXME: replace with proper permission check */
+ if (new_dentry->d_inode)
+ return -EEXIST;
+
+ if (mode == -1)
+ mode = old_nd->dentry->d_inode->i_mode;
+
+ ret = vfs_create(new_nd->dentry->d_inode, new_dentry, mode, new_nd);
+ if (ret)
+ return ret;
+
+ ret = copy_data(old_nd->dentry, old_nd->mnt, new_dentry, new_nd->mnt);
+
+ if (ret) {
+ int error = vfs_unlink(new_nd->dentry->d_inode, new_dentry);
+ BUG_ON(error);
+ /* FIXME: not sure if there are return value we should not BUG()
+ * on */
+ }
+ return ret;
+}
+
+int do_copyfile(const char *from, const char *to, int mode)
+{
+ int ret;
+ struct dentry *new_dentry;
+ struct nameidata old_nd, new_nd;
+
+ ret = path_lookup(from, 0, &old_nd);
+ if (ret)
+ return ret;
+
+ ret = path_lookup(to, LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE, &new_nd);
+ if (ret)
+ goto out1;
+
+ new_dentry = lookup_create(&new_nd, 0);
+ ret = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out2;
+
+ ret = vfs_copyfile(&old_nd, &new_nd, new_dentry, mode);
+
+ dput(new_dentry);
+ up(&new_nd.dentry->d_inode->i_sem);
+out2:
+ path_release(&new_nd);
+out1:
+ path_release(&old_nd);
+ return ret;
+}
+
+/* copyfile() - create a new file and copy the old one's contents into it, all
+ * inside the kernel
+ *
+ * TODO:
+ * quota
+ * notification
+ * security
+ * unlock directory during data copy
+ * loop for data copy
+ */
+int sys_copyfile(const char __user *ufrom, const char __user *uto, umode_t mode)
+{
+ char *from, *to;
+ int ret;
+
+ from = getname(ufrom);
+ if (IS_ERR(from))
+ return PTR_ERR(from);
+
+ to = getname(uto);
+ ret = PTR_ERR(to);
+ if (IS_ERR(to))
+ goto err;
+
+ ret = do_copyfile(from, to, mode);
+
+ putname(to);
+err:
+ putname(from);
+ return ret;
+}
+
int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen, const char *link)
{
int len;
--- linux-2.6.8cow/include/linux/syscalls.h~copyfile 2004-09-05 11:57:15.000000000 +0200
+++ linux-2.6.8cow/include/linux/syscalls.h 2004-09-05 11:58:00.000000000 +0200
@@ -283,6 +283,8 @@
asmlinkage long sys_unlink(const char __user *pathname);
asmlinkage long sys_rename(const char __user *oldname,
const char __user *newname);
+asmlinkage long sys_copyfile(const char __user *from, const char __user *to,
+ umode_t mode);
asmlinkage long sys_chmod(const char __user *filename, mode_t mode);
asmlinkage long sys_fchmod(unsigned int fd, mode_t mode);

2004-09-07 14:06:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile



On Tue, 7 Sep 2004, J?rn Engel wrote:
>
> Again, the syscall itself may be a stupid idea, but Steve indicated
> interest for cifs. I'll hide behind his back and let him fight for
> it. ;)

Well, this isn't useful for cifs.

For cifs to be able to use it, the "copyfile()" interface needs to
basically just be a pathname operation (ie a "dir->i_op->copy()"), not a
"struct file" operation. It's more like the VFS "->rename()" or "->link"
operations, in other words. And it should return -EXDEV the same way
rename returns EXDEV if the files aren't on the same filesystem.

Then you could (and should) make a "generic_file_copy()" function that
takes that pathname format, and then uses sendfile() to do the copy for
regular disk-based filesystems.

I think you should be able to copy the "sys_link()" code for almost all of
the top-level stuff. The only real difference being

- error = dir->i_op->link(old_dentry, dir, new_dentry);
+ error = dir->i_op->copy(old_dentry, dir, new_dentry);

or something.

And no, I don't know how to handle interruptability. I think the right
answer may be that filesystems that don't support this as a "native op"
and can't do it quickly should just return an error, and then users can
copy their multi-gigabyte files by hand, like they used to.

So if we do this, we do this _right_. We also make sure that we error out
"too much" rather than "too little", so that people don't start depending
on behaviour that we don't want them to depend on.

Linus

2004-09-07 14:19:50

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile

On Tue, 2004-09-07 at 15:06, Linus Torvalds wrote:
> On Tue, 7 Sep 2004, Jörn Engel wrote:
> >
> > Again, the syscall itself may be a stupid idea, but Steve indicated
> > interest for cifs. I'll hide behind his back and let him fight for
> > it. ;)
>
> Well, this isn't useful for cifs.
>
> For cifs to be able to use it, the "copyfile()" interface needs to
> basically just be a pathname operation (ie a "dir->i_op->copy()"), not a
> "struct file" operation. It's more like the VFS "->rename()" or "->link"
> operations, in other words. And it should return -EXDEV the same way
> rename returns EXDEV if the files aren't on the same filesystem.

Indeed. A pathname based operation would be useful for any fs with
"added features". For example, on NTFS it could be used to preserve
named streams and extended attributes which would otherwise be lost.
Mind you, the current NTFS driver cannot create files yet so it will be
a while until such a "copyfile()" is useful there...

> Then you could (and should) make a "generic_file_copy()" function that
> takes that pathname format, and then uses sendfile() to do the copy for
> regular disk-based filesystems.
>
> I think you should be able to copy the "sys_link()" code for almost all of
> the top-level stuff. The only real difference being
>
> - error = dir->i_op->link(old_dentry, dir, new_dentry);
> + error = dir->i_op->copy(old_dentry, dir, new_dentry);
>
> or something.
>
> And no, I don't know how to handle interruptability. I think the right
> answer may be that filesystems that don't support this as a "native op"
> and can't do it quickly should just return an error, and then users can
> copy their multi-gigabyte files by hand, like they used to.
>
> So if we do this, we do this _right_. We also make sure that we error out
> "too much" rather than "too little", so that people don't start depending
> on behaviour that we don't want them to depend on.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/, http://www-stu.christs.cam.ac.uk/~aia21/

2004-09-07 15:02:58

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile

On Tue, 7 September 2004 07:06:00 -0700, Linus Torvalds wrote:
>
> Then you could (and should) make a "generic_file_copy()" function that
> takes that pathname format, and then uses sendfile() to do the copy for
> regular disk-based filesystems.

Does that mean that you're ok with the first three patches?

> I think you should be able to copy the "sys_link()" code for almost all of
> the top-level stuff. The only real difference being
>
> - error = dir->i_op->link(old_dentry, dir, new_dentry);
> + error = dir->i_op->copy(old_dentry, dir, new_dentry);
>
> or something.

ok.

> And no, I don't know how to handle interruptability.

How about the loop in 3/4?

> I think the right
> answer may be that filesystems that don't support this as a "native op"
> and can't do it quickly should just return an error, and then users can
> copy their multi-gigabyte files by hand, like they used to.
>
> So if we do this, we do this _right_. We also make sure that we error out
> "too much" rather than "too little", so that people don't start depending
> on behaviour that we don't want them to depend on.

Makes sense, as long as any disk filesystem can decide to add a
one-liner like
.copy = generic_file_copy;
to their operations and explicitly allow this. Cowlinks remain my pet
project, even if I get distracted by Real Work sometimes. Without
some form of in-kernel copy, they'd be dead in the water.

J?rn

--
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein

2004-09-07 15:26:46

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile

On Tue, 7 September 2004 07:59:11 -0700, Linus Torvalds wrote:
> On Tue, 7 Sep 2004, J?rn Engel wrote:
> >
> > Does that mean that you're ok with the first three patches?
>
> No, it means that they weren't fundamentally flawed..

It's a start...

> Actually, the 4kB batching one was - if you only max out to using 4kB at a
> time, sendfile() is kind of pointless, because then it will never do
> multi-page copies in the first place, and all the complexity at a lower
> level is worthless..

Give me a better number. 16k? 1M? Or would it not be fundamentally
flawed if the unit was seconds, instead of bytes? That makes a lot
more sense, since a floppy and a Ultra320 RAID array differ slightly
in speed and it's response time the users actually care about.

J?rn

--
/* Keep these two variables together */
int bar;

2004-09-07 15:32:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile



On Tue, 7 Sep 2004, J?rn Engel wrote:
>
> > Actually, the 4kB batching one was - if you only max out to using 4kB at a
> > time, sendfile() is kind of pointless, because then it will never do
> > multi-page copies in the first place, and all the complexity at a lower
> > level is worthless..
>
> Give me a better number. 16k? 1M? Or would it not be fundamentally
> flawed if the unit was seconds, instead of bytes? That makes a lot
> more sense, since a floppy and a Ultra320 RAID array differ slightly
> in speed and it's response time the users actually care about.

Well, you can't do it by seconds. What you _can_ do is to just make the
fundamental page-cache sendfile thing check for interruptible, and just do
it at a lower level. Maybe.

Linus

2004-09-07 15:41:33

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 1/4] copyfile: generic_sendpage


> On Sat, 4 September 2004 15:39:02 -0700, Andrew Morton wrote:
> >
> > I discussed file->file sendfile with Linus a while back and he said
> >
> > > We should probably make read/write be interruptible by _fatal_ signals.
> > > It would require a new task state, though (TASK_KILLABLE or something, and
> > > it would show up as a normal 'D' state).

But a function can be called from many places... how will it know which
of TASK_{INTERRUPTIBLE,UNINTERRUPTIBLE,KILLABLE} to use? It's bad enough
already, with functions which are _sometimes_ called from an
uninterruptible call path using TASK_UNINTERRUPTIBLE even when they are
called from a different call path which _could_ handle being
interrupted.

I wonder if we should handle interruptibility (and killability) in the
same was as we handle preemptability -- that is, keep a count in the
task structure. Increase it by one when you can't be
{interrupted,killed,preempted} and it gets observed by all the functions
you call until the count is decremented all the way to zero.

So upon signal delivery you look at task->uninterruptible_count and
queue the signal if it's non-zero. Or if it's a fatal signal you look at
task->unkillable_count.

--
dwmw2

2004-09-07 16:07:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile



On Tue, 7 Sep 2004, J?rn Engel wrote:

> On Tue, 7 September 2004 07:06:00 -0700, Linus Torvalds wrote:
> >
> > Then you could (and should) make a "generic_file_copy()" function that
> > takes that pathname format, and then uses sendfile() to do the copy for
> > regular disk-based filesystems.
>
> Does that mean that you're ok with the first three patches?

No, it means that they weren't fundamentally flawed..

Actually, the 4kB batching one was - if you only max out to using 4kB at a
time, sendfile() is kind of pointless, because then it will never do
multi-page copies in the first place, and all the complexity at a lower
level is worthless..

Linus

2004-09-08 17:13:47

by Steve French (smfltc)

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile

On Tue, 2004-09-07 at 09:06, Linus Torvalds wrote:
> For cifs to be able to use it, the "copyfile()" interface needs to
> basically just be a pathname operation (ie a "dir->i_op->copy()"), not a
> "struct file" operation. It's more like the VFS "->rename()" or "->link"
> operations, in other words. And it should return -EXDEV the same way
> rename returns EXDEV if the files aren't on the same filesystem.
>
> I think you should be able to copy the "sys_link()" code for almost all of
> the top-level stuff. The only real difference being
>
> - error = dir->i_op->link(old_dentry, dir, new_dentry);
> + error = dir->i_op->copy(old_dentry, dir, new_dentry);
>
> answer may be that filesystems that don't support this as a "native op"
> and can't do it quickly should just return an error, and then users can
> copy their multi-gigabyte files by hand, like they used to.

Yes - the CIFS protocol operation for copy requires syntax similar to
the link or rename examples. The protocol operation for CIFS copy
requires:
1) a source filename (and a tree identifier - which identifies which
server export ie share the path is relative to)
2) a target filename (and target tree identifier)
3) source and target on the same server - The client filesystem can
enforce the last requirement that the source and target be on the same
server (although some servers also will reject it if the source and
target are not on the same filesystem on the same server) by simply
returning EXDEV to the VFS if the source and target are on different
servers.

Using a similar i_op to the link i_op contains sufficient information
for me to turn on the CIFSSMBCopy function in the fs/cifs code.

There are a couple of obvious copy options which probably could be
mapped onto the cifs network protocol but are not required.
4)Something like a "FAIL_IF_EXISTS" flag (do not copy the file if the
target file exists)
5) a "copy subtree" flag indicating whether the copy is for the
file/directory or for the whole subtree.

As in link, returning an error when the filesystem can not perform the
operation should be expected by the layers above.

Defining a handle (file struct) based copy operation is not as useful.

I believe that as a result of the copy request that most CIFS servers
typically would copy file attributes (such as in the case of cifs
servers the dos attributes) and extended attributes (xattrs, and NTFS
alternate data streams if the server supports them) but ACLs for the
existing file would not be copied to the new file. That seems like
reasonable default common sense semantics. Allowing the client to
interrupt a network file copy that has been submitted to the server will
not always be possible, but it is at least easier to preserve atomicity
in the case of a single copy operation going over the network than it
would be if the copy is broken apart by the client and as individual
open/read/close open/write/close



2004-09-09 15:04:36

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [PATCH 4/4] copyfile: copyfile

On Tue, Sep 07, 2004 at 05:21:18PM +0200, J?rn Engel wrote:
> On Tue, 7 September 2004 07:59:11 -0700, Linus Torvalds wrote:
> > On Tue, 7 Sep 2004, J?rn Engel wrote:
> > >
> > > Does that mean that you're ok with the first three patches?
> >
> > No, it means that they weren't fundamentally flawed..
>
> It's a start...
>
> > Actually, the 4kB batching one was - if you only max out to using 4kB at a
> > time, sendfile() is kind of pointless, because then it will never do
> > multi-page copies in the first place, and all the complexity at a lower
> > level is worthless..
>
> Give me a better number. 16k? 1M? Or would it not be fundamentally
> flawed if the unit was seconds, instead of bytes? That makes a lot
> more sense, since a floppy and a Ultra320 RAID array differ slightly
> in speed and it's response time the users actually care about.

maybe the 'block/buffer' size could be passed with the
syscall (maybe with a kernel default), allowing dumb tools
to use a fixed value and smart ones, exactly what the user
desires ...

best,
Herbert

> J?rn
>
> --
> /* Keep these two variables together */
> int bar;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/