2004-06-08 16:12:23

by Andrea Arcangeli

[permalink] [raw]
Subject: downgrade_write replacement in remap_file_pages

Apparently downgrade_write deadlocks the kernel in the mmap_sem
under load. I suspect some rwsem bug. Anyways it's matter of time before
in my tree I replace the rwsem implementation with my spinlock based
common code implementation again that I can understand trivially (unlike
the current code). I don't mind a microoptimization when the code is so
complicated, so I don't mind too much to fix whatever current bug in
downgrade_write.

In the meantime to workaround the deadlock (and to verify if this make
the deadlock go away, which returned a positive result) I implemented
this patch: this doesn't fix downgrade_wite, but I'm posting it here
because I believe it's much better code regardless of the
downgrade_write issue. With this patch we'll never run down_write again
in production, the down_write will be taken only once when the db or the
simulator startup (during the very first page fault) and never again, in
turn providing (at least in theory) better runtime scalability.

Index: linux-2.5/mm/fremap.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/fremap.c,v
retrieving revision 1.24
diff -u -p -r1.24 fremap.c
--- linux-2.5/mm/fremap.c 23 May 2004 05:07:26 -0000 1.24
+++ linux-2.5/mm/fremap.c 8 Jun 2004 15:38:21 -0000
@@ -161,6 +161,7 @@ asmlinkage long sys_remap_file_pages(uns
unsigned long end = start + size;
struct vm_area_struct *vma;
int err = -EINVAL;
+ int has_write_lock = 0;

if (__prot)
return err;
@@ -181,7 +182,8 @@ asmlinkage long sys_remap_file_pages(uns
#endif

/* We need down_write() to change vma->vm_flags. */
- down_write(&mm->mmap_sem);
+ down_read(&mm->mmap_sem);
+ retry:
vma = find_vma(mm, start);

/*
@@ -198,8 +200,13 @@ asmlinkage long sys_remap_file_pages(uns
end <= vma->vm_end) {

/* Must set VM_NONLINEAR before any pages are populated. */
- if (pgoff != linear_page_index(vma, start) &&
- !(vma->vm_flags & VM_NONLINEAR)) {
+ if (unlikely(pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR))) {
+ if (!has_write_lock) {
+ up_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ has_write_lock = 1;
+ goto retry;
+ }
mapping = vma->vm_file->f_mapping;
spin_lock(&mapping->i_mmap_lock);
flush_dcache_mmap_lock(mapping);
@@ -212,8 +219,6 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

- /* ->populate can take a long time, so downgrade the lock. */
- downgrade_write(&mm->mmap_sem);
err = vma->vm_ops->populate(vma, start, size,
vma->vm_page_prot,
pgoff, flags & MAP_NONBLOCK);
@@ -223,10 +228,11 @@ asmlinkage long sys_remap_file_pages(uns
* it after ->populate completes, and that would prevent
* downgrading the lock. (Locks can't be upgraded).
*/
+ }
+ if (likely(!has_write_lock))
up_read(&mm->mmap_sem);
- } else {
+ else
up_write(&mm->mmap_sem);
- }

return err;
}


2004-06-08 16:32:06

by Andrew Morton

[permalink] [raw]
Subject: Re: downgrade_write replacement in remap_file_pages

Andrea Arcangeli <[email protected]> wrote:
>
> Apparently downgrade_write deadlocks the kernel in the mmap_sem
> under load. I suspect some rwsem bug. Anyways it's matter of time before
> in my tree I replace the rwsem implementation with my spinlock based
> common code implementation again that I can understand trivially (unlike
> the current code). I don't mind a microoptimization when the code is so
> complicated, so I don't mind too much to fix whatever current bug in
> downgrade_write.

I must say I agree with the sentiments - the current implementation doesn't
have a very attractive complexity/benefit ratio. But I wrote a
spinlock-based version three years ago too, so I'm biased ;) Certainly it
is bog-simple and fixes up the overflow-at-32768-waiters bug.

I think a spinlock-based implementation would be OK if it was x86-specific,
because x86 spin_unlock is cheap. But some architectures do atomic ops in
spin_unlock and won't like it. Plus those architectures which can
implement atomic_add_return() can implement nice versions of rwsems such as
the ppc64 code. Although ppc64 still seems to have an overflow bug.

So ho-hum. As a first step, David, could you please take a look into
what's up with downgrade_write()?

(Then again, we need to have a serious think about the overflow bug. It's
fatal. Should we fix it? If so, the current rwsem implementation is
probably unsalvageable).

2004-06-08 16:39:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: downgrade_write replacement in remap_file_pages



On Tue, 8 Jun 2004, Andrew Morton wrote:
>
> So ho-hum. As a first step, David, could you please take a look into
> what's up with downgrade_write()?

One reason why I like Andrea's approach (regardless of downgrade_write())
is that it seems to avoid taking the heavy lock (write) by default.

Linus

2004-06-08 17:05:31

by David Howells

[permalink] [raw]
Subject: Re: downgrade_write replacement in remap_file_pages


> Apparently downgrade_write deadlocks the kernel in the mmap_sem
> under load.

Which implementation of rwsems is your kernel using? The spinlock-based one or
the XADD based one? Have you tried the other version?

Have you more than 32767 processes?

Do you have any stack traces?

David

2004-06-08 19:05:32

by David Howells

[permalink] [raw]
Subject: Re: downgrade_write replacement in remap_file_pages


Signed-Off-By: David Howells <[email protected]>
D: Stop downgrade_write() from under-adjusting the rwsem counter in optimised
D: rw-semaphores.

diff -urp linux-2.6.6/lib/rwsem.c linux-2.6.6-keys/lib/rwsem.c
--- linux-2.6.6/lib/rwsem.c 2004-05-11 11:28:57.000000000 +0100
+++ linux-2.6.6-keys/lib/rwsem.c 2004-06-08 19:31:35.550653817 +0100
@@ -29,15 +29,15 @@ void rwsemtrace(struct rw_semaphore *sem

/*
* handle the lock being released whilst there are processes blocked on it that can now run
- * - if we come here, then:
- * - the 'active part' of the count (&0x0000ffff) reached zero but has been re-incremented
+ * - if we come here from up_xxxx(), then:
+ * - the 'active part' of the count (&0x0000ffff) had reached zero (but may have changed)
* - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
* - there must be someone on the queue
* - the spinlock must be held by the caller
* - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if wakewrite is non-zero
+ * - writers are only woken if downgrading is false
*/
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
+static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
{
struct rwsem_waiter *waiter;
struct task_struct *tsk;
@@ -46,10 +46,12 @@ static inline struct rw_semaphore *__rws

rwsemtrace(sem,"Entering __rwsem_do_wake");

- if (!wakewrite)
+ if (downgrading)
goto dont_wake_writers;

- /* only wake someone up if we can transition the active part of the count from 0 -> 1 */
+ /* if we came through an up_xxxx() call, we only only wake someone up
+ * if we can transition the active part of the count from 0 -> 1
+ */
try_again:
oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS,sem) - RWSEM_ACTIVE_BIAS;
if (oldcount & RWSEM_ACTIVE_MASK)
@@ -78,9 +80,10 @@ static inline struct rw_semaphore *__rws
if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
goto out;

- /* grant an infinite number of read locks to the readers at the front of the queue
- * - note we increment the 'active part' of the count by the number of readers (less one
- * for the activity decrement we've already done) before waking any processes up
+ /* grant an infinite number of read locks to the readers at the front
+ * of the queue
+ * - note we increment the 'active part' of the count by the number of
+ * readers before waking any processes up
*/
readers_only:
woken = 0;
@@ -95,8 +98,10 @@ static inline struct rw_semaphore *__rws
} while (waiter->flags & RWSEM_WAITING_FOR_READ);

loop = woken;
- woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
- woken -= RWSEM_ACTIVE_BIAS;
+ woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+ if (!downgrading)
+ woken -= RWSEM_ACTIVE_BIAS; /* we'd already done one increment
+ * earlier */
rwsem_atomic_add(woken,sem);

next = sem->wait_list.next;
@@ -150,7 +155,7 @@ static inline struct rw_semaphore *rwsem
* - it might even be this process, since the waker takes a more active part
*/
if (!(count & RWSEM_ACTIVE_MASK))
- sem = __rwsem_do_wake(sem,1);
+ sem = __rwsem_do_wake(sem, 0);

spin_unlock(&sem->wait_lock);

@@ -201,7 +206,7 @@ struct rw_semaphore fastcall __sched *rw

/*
* handle waking up a waiter on the semaphore
- * - up_read has decremented the active part of the count if we come here
+ * - up_read/up_write has decremented the active part of the count if we come here
*/
struct rw_semaphore fastcall *rwsem_wake(struct rw_semaphore *sem)
{
@@ -211,7 +216,7 @@ struct rw_semaphore fastcall *rwsem_wake

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem,1);
+ sem = __rwsem_do_wake(sem, 0);

spin_unlock(&sem->wait_lock);

@@ -233,7 +238,7 @@ struct rw_semaphore fastcall *rwsem_down

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem,0);
+ sem = __rwsem_do_wake(sem, 1);

spin_unlock(&sem->wait_lock);

2004-06-08 19:36:45

by William Lee Irwin III

[permalink] [raw]
Subject: Re: downgrade_write replacement in remap_file_pages

On Tue, Jun 08, 2004 at 05:44:38PM +0200, Andrea Arcangeli wrote:
> Apparently downgrade_write deadlocks the kernel in the mmap_sem
> under load. I suspect some rwsem bug. Anyways it's matter of time before
> in my tree I replace the rwsem implementation with my spinlock based
> common code implementation again that I can understand trivially (unlike
> the current code). I don't mind a microoptimization when the code is so
> complicated, so I don't mind too much to fix whatever current bug in
> downgrade_write.
> In the meantime to workaround the deadlock (and to verify if this make
> the deadlock go away, which returned a positive result) I implemented
> this patch: this doesn't fix downgrade_wite, but I'm posting it here
> because I believe it's much better code regardless of the
> downgrade_write issue. With this patch we'll never run down_write again
> in production, the down_write will be taken only once when the db or the
> simulator startup (during the very first page fault) and never again, in
> turn providing (at least in theory) better runtime scalability.

I've been using something similar since about May 20. However, it was
unclear that it was a deadlock as opposed to semaphore contention from
the reports I got.


On Tue, Jun 08, 2004 at 05:44:38PM +0200, Andrea Arcangeli wrote:
> - if (pgoff != linear_page_index(vma, start) &&
> - !(vma->vm_flags & VM_NONLINEAR)) {
> + if (unlikely(pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR))) {

There is no linear_pgoff variable...


-- wli

2004-06-08 22:33:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: downgrade_write replacement in remap_file_pages

On Tue, Jun 08, 2004 at 06:05:08PM +0100, David Howells wrote:
>
> > Apparently downgrade_write deadlocks the kernel in the mmap_sem
> > under load.
>
> Which implementation of rwsems is your kernel using? The spinlock-based one or
> the XADD based one? Have you tried the other version?

stock 2.6 rwsem implementation compiled for PII (I still have tons of
patches to forward port from 2.4-aa, if I would port my rwsem to 2.6 I
would have never noticed this race)

> Have you more than 32767 processes?

no, there should be around 10k processes, sure not more than 20k.

> Do you have any stack traces?

yes:

strace:

open("/proc/6022/stat", O_RDONLY) = 6
read(6,

SYSRQ+T

[<c01fa365>] rwsem_down_read_failed+0x85/0x121
[<c01a0da9>] .text.lock.array+0x49/0xd0
[<c01e4412>] avc_has_perm+0x62/0x78
[<c01e5ba3>] inode_has_perm+0x53/0x90
[<c019d3b1>] proc_info_read+0x51/0x150
[<c016ada1>] vfs_read+0xe1/0x130
[<c016b001>] sys_read+0x91/0xf0

it's the down_read in proc_pid_stat. the workload running at the same
time is heavy remap_file_pages. They're processes so the only race
happens against the /proc filesystem and that's why it hangs there.
Somehow downgrade_write in remap_file_pages races with down_read in
/proc. My patch workarounds the deadlock by not calling downgrade_write,
but I posted it to l-k because my code is better anyways since there's
no good reason to ever call down_write in the fast path (and if we don't
start down_write we don't need downgrade_write anymore). The only thing
bitten by downgrade_write left is xfs.

You can imagine which is the critical apps that triggers this deadlock,
not many apps are using remap_file_pages in production yet, and very few
are going to call it in a flood.

I agree with Andrew the limit of 32k processes needs fixing, but nobody
noticed yet with any real app, so it's a low prio matter.

2004-06-08 22:52:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: downgrade_write replacement in remap_file_pages

On Tue, Jun 08, 2004 at 12:36:21PM -0700, William Lee Irwin III wrote:
> On Tue, Jun 08, 2004 at 05:44:38PM +0200, Andrea Arcangeli wrote:
> > - if (pgoff != linear_page_index(vma, start) &&
> > - !(vma->vm_flags & VM_NONLINEAR)) {
> > + if (unlikely(pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR))) {
>
> There is no linear_pgoff variable...

I tested it on a different codebase and I didn't notice this issue while
fixing the rejects, fixing it up is easy, replace linear_pgoff with
linear_page_index(vma, start).

2004-06-09 12:19:26

by Alexander Nyberg

[permalink] [raw]
Subject: [PATCH] A generic_file_sendpage()

The sendfile() for all file systems remain unusable as it is right now,
only works for sending data to socket. But that should be as much performance
enhancing as this yes?

Please hit me with cluebat for what I'm missing.
(yes, rebooted between all copying)

-----------------------------------
Normal read/write with 16K buffers:

comp1 with 4 scsi disks sw raid0:
kernie:/mnt/data/playground# time ./copyf tref x1
size: 2097152000

real 2m9.680s
user 0m0.075s
sys 0m21.019s

comp2 with single ide disk:
om3:/home/alex# time ./copyf haha c1
size: 1048576000

real 1m25.104s
user 0m0.042s
sys 0m14.880s


-----------------------------------
Normal read/write with 64K buffers:

comp1 with 4 scsi disks sw raid0:
kernie:/mnt/data/playground# time ./copyf tref x3
size: 2097152000

real 2m11.160s
user 0m0.035s
sys 0m20.745s


comp2 with single ide disk:
om3:/home/alex# time ./copyf haha c3
size: 1048576000

real 1m25.651s
user 0m0.052s
sys 0m14.020s


-----------------------------------
Using sendfile() to copy entire files:

comp1 with 4 scsi disks sw raid0:
kernie:/mnt/data/playground# time ./sendf tref x2
size: 2097152000

real 2m9.645s
user 0m0.001s
sys 0m19.961s

and again:

real 2m9.675s
user 0m0.001s
sys 0m19.271s


comp2 with single ide disk:
om3:/home/alex# time ./sendf haha c2
size: 1048576000

real 1m24.395s
user 0m0.002s
sys 0m13.151s

and again:

real 1m23.781s
user 0m0.001s
sys 0m12.967s



--- include/linux/fs_orig.h 2004-06-09 00:37:29.000000000 +0200
+++ include/linux/fs.h 2004-06-07 18:13:54.000000000 +0200
@@ -1405,6 +1405,7 @@ extern ssize_t do_sync_write(struct file
ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos);
extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void __user *);
+extern ssize_t generic_file_sendpage(struct file *, struct page *, int, size_t, loff_t *, int);
extern void do_generic_mapping_read(struct address_space *mapping,
struct file_ra_state *, struct file *,
loff_t *, read_descriptor_t *, read_actor_t);
--- mm/filemap_orig.c 2004-06-09 00:37:45.000000000 +0200
+++ mm/filemap.c 2004-06-08 22:19:48.000000000 +0200
@@ -961,7 +961,32 @@ generic_file_read(struct file *filp, cha

EXPORT_SYMBOL(generic_file_read);

-int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
+ssize_t generic_file_sendpage(struct file *out_file, struct page *page,
+ int offset, size_t size, loff_t *pos, int more)
+{
+ void *addr;
+ int ret;
+ mm_segment_t old_fs;
+
+ old_fs = get_fs();
+ set_fs(KERNEL_DS);
+
+ addr = kmap(page);
+ if (!addr) {
+ set_fs(old_fs);
+ return -ENOMEM;
+ }
+
+ ret = out_file->f_op->write(out_file, addr + offset, size, pos);
+
+ kunmap(addr);
+
+ set_fs(old_fs);
+ return ret;
+}
+
+int file_send_actor(read_descriptor_t * desc, struct page *page,
+ unsigned long offset, unsigned long size)
{
ssize_t written;
unsigned long count = desc->count;
--- fs/ext3/file_orig.c 2004-06-09 00:42:50.000000000 +0200
+++ fs/ext3/file.c 2004-06-07 18:12:19.000000000 +0200
@@ -129,6 +129,7 @@ struct file_operations ext3_file_operati
.release = ext3_release_file,
.fsync = ext3_sync_file,
.sendfile = generic_file_sendfile,
+ .sendpage = generic_file_sendpage,
};

struct inode_operations ext3_file_inode_operations = {
@@ -140,4 +141,3 @@ struct inode_operations ext3_file_inode_
.removexattr = ext3_removexattr,
.permission = ext3_permission,
};
-


2004-06-10 19:51:25

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

Hi!

> The sendfile() for all file systems remain unusable as it is right now,
> only works for sending data to socket. But that should be as much performance
> enhancing as this yes?
>
> Please hit me with cluebat for what I'm missing.
> (yes, rebooted between all copying)
>

...this will also allow COW filesystems and copy-on-server...
It would be nice if it could go in soon, so userspace can
start using it and COW/copy-on-server can be usefull in 2.8...


Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-06-25 19:20:28

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

On Wed, 9 June 2004 14:19:19 +0200, Alexander Nyberg wrote:
>
> The sendfile() for all file systems remain unusable as it is right now,
> only works for sending data to socket. But that should be as much performance
> enhancing as this yes?
>
> Please hit me with cluebat for what I'm missing.
> (yes, rebooted between all copying)

A similar patch I've once written gave about 10% performance boost
with warm caches. Your measurements are with could caches but still
give a noticeable boost. Nice.

> --- include/linux/fs_orig.h 2004-06-09 00:37:29.000000000 +0200
> +++ include/linux/fs.h 2004-06-07 18:13:54.000000000 +0200
> @@ -1405,6 +1405,7 @@ extern ssize_t do_sync_write(struct file
> ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
> unsigned long nr_segs, loff_t *ppos);
> extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void __user *);
> +extern ssize_t generic_file_sendpage(struct file *, struct page *, int, size_t, loff_t *, int);
> extern void do_generic_mapping_read(struct address_space *mapping,
> struct file_ra_state *, struct file *,
> loff_t *, read_descriptor_t *, read_actor_t);
> --- mm/filemap_orig.c 2004-06-09 00:37:45.000000000 +0200
> +++ mm/filemap.c 2004-06-08 22:19:48.000000000 +0200
> @@ -961,7 +961,32 @@ generic_file_read(struct file *filp, cha
>
> EXPORT_SYMBOL(generic_file_read);
>
> +ssize_t generic_file_sendpage(struct file *out_file, struct page *page,
> + int offset, size_t size, loff_t *pos, int more)
> +{
> + void *addr;
> + int ret;
> + mm_segment_t old_fs;
> +
> + old_fs = get_fs();
> + set_fs(KERNEL_DS);
> +
> + addr = kmap(page);
> + if (!addr) {
> + set_fs(old_fs);
> + return -ENOMEM;
> + }
> +
> + ret = out_file->f_op->write(out_file, addr + offset, size, pos);
> +
> + kunmap(addr);
> +
> + set_fs(old_fs);
> + return ret;
> +}

Your patch is *much* smaller than mine. Looks lean and mean. But you
depend on the struct file* passed to generic_file_sendpage().

One of my goals for 2.7 is to get rid of all users of struct file* in
the various read-, write- and send-functions. Currently, there are
four of them, you would introduce number five.

Is is possible get around using out_file without making the patch much
bigger?

If you need a motivation for this, think cowlinks. Copying inodes
would be trivial, if you didn't need a fscking file for every copy
operation. If you actually open(), sendfile() and close(), you end up
with tons of possible errors and a nightmare of cleanup code. Not
fun.

> --- fs/ext3/file_orig.c 2004-06-09 00:42:50.000000000 +0200
> +++ fs/ext3/file.c 2004-06-07 18:12:19.000000000 +0200
> @@ -129,6 +129,7 @@ struct file_operations ext3_file_operati
> + .sendpage = generic_file_sendpage,

Obviously also works for ext2.

J?rn

--
They laughed at Galileo. They laughed at Copernicus. They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown

2004-06-25 19:46:16

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

On Fri, Jun 25, 2004 at 09:19:24PM +0200, J?rn Engel wrote:
> > + old_fs = get_fs();
> > + set_fs(KERNEL_DS);

Eeek... Don't do that, please - set_fs() is not a good thing to add for
no good reason.

> Your patch is *much* smaller than mine. Looks lean and mean. But you
> depend on the struct file* passed to generic_file_sendpage().
>
> One of my goals for 2.7 is to get rid of all users of struct file* in
> the various read-, write- and send-functions. Currently, there are
> four of them, you would introduce number five.

And how, pray tell, are you going to do that on filesystems that keep
part of context in file->private_data?

2004-06-25 20:03:52

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

On Fri, 25 June 2004 20:46:11 +0100, [email protected] wrote:
> >
> > One of my goals for 2.7 is to get rid of all users of struct file* in
> > the various read-, write- and send-functions. Currently, there are
> > four of them, you would introduce number five.
>
> And how, pray tell, are you going to do that on filesystems that keep
> part of context in file->private_data?

Not sure. NFSv3 appears to be fixable, the only context is the UID,
which happens to be stored in the inode as well. NFSv4 and cifs could
be worse, I didn't look closely yet. smbfs accesses the dentry, which
has similar effects, but should be fixable as well.

Do you know of any impossible cases?

J?rn

--
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra

2004-06-25 20:06:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

On Jun 25, 2004 21:19 +0200, J?rn Engel wrote:
> On Wed, 9 June 2004 14:19:19 +0200, Alexander Nyberg wrote:
> > The sendfile() for all file systems remain unusable as it is right now,
> > only works for sending data to socket. But that should be as much
> > performance enhancing as this yes?
>
> A similar patch I've once written gave about 10% performance boost
> with warm caches. Your measurements are with could caches but still
> give a noticeable boost. Nice.

Yes, it would be nice to get this into the kernel, then eventually have
"cp" test if sendfile() works between two files. That would allow things
like remote file copy for network filesystems, COW, etc much easier.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/


Attachments:
(No filename) (844.00 B)
(No filename) (189.00 B)
Download all attachments

2004-06-25 20:09:53

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

On Fri, 25 June 2004 14:05:33 -0600, Andreas Dilger wrote:
>
> Yes, it would be nice to get this into the kernel, then eventually have
> "cp" test if sendfile() works between two files. That would allow things
> like remote file copy for network filesystems, COW, etc much easier.

But as long as the patches are as ugly as mine, this is 2.7 work. In
case anyone missed it:

http://wohnheim.fh-wedel.de/~joern/cowlink/

J?rn

--
To my face you have the audacity to advise me to become a thief - the worst
kind of thief that is conceivable, a thief of spiritual things, a thief of
ideas! It is insufferable, intolerable!
-- M. Binet in Scarabouche

2004-06-26 01:31:13

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

P? fr , 25/06/2004 klokka 16:03, skreiv J?rn Engel:
> Not sure. NFSv3 appears to be fixable, the only context is the UID,

Huh???? WTF happened to the actual credential?

> which happens to be stored in the inode as well. NFSv4 and cifs could
> be worse, I didn't look closely yet. smbfs accesses the dentry, which
> has similar effects, but should be fixable as well.
>
> Do you know of any impossible cases?

NFS, CIFS, all other networked filesystems that need private context
information beyond what is contained in the struct file. Why?

Trond

2004-06-28 11:41:36

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] A generic_file_sendpage()

On Fri, 25 June 2004 20:53:34 -0400, Trond Myklebust wrote:
> P? fr , 25/06/2004 klokka 16:03, skreiv J?rn Engel:
> > Not sure. NFSv3 appears to be fixable, the only context is the UID,
>
> Huh???? WTF happened to the actual credential?

No idea, I couldn't find it in the source.

> > which happens to be stored in the inode as well. NFSv4 and cifs could
> > be worse, I didn't look closely yet. smbfs accesses the dentry, which
> > has similar effects, but should be fixable as well.
> >
> > Do you know of any impossible cases?
>
> NFS, CIFS, all other networked filesystems that need private context
> information beyond what is contained in the struct file. Why?

Darn! I need to copy an inode. Currently, I'm opening a two files,
copy between them and close them again. Open and close are completely
pointless and only complicate things.

The fun part is that cifs would actually benifit from the same things
it complicates. Oh well.

J?rn

--
Geld macht nicht gl?cklich.
Gl?ck macht nicht satt.