2002-04-10 10:05:49

by Hirokazu Takahashi

[permalink] [raw]
Subject: [PATCH] zerocopy NFS updated

Hi

I add a new patch for zerocopy NFS.
va03-knfsd-zerocopy-sendpage-2.5.7-test1.patch makes knfsd to skip
csum_partial_copy_generic() which copies data into a sk_buff.
This feature works on when you use NFS over TCP only at this moment.
I'd like to implement sendpage for UDP, but it doesn't work yet.

But I wonder about sendpage. I guess HW IP checksum for outgoing
pages might be miscalculated as VFS can update them anytime.
New feature like COW pagecache should be added to VM and they
should be duplicated in this case.

Is there anyone who could advise me about this.


Following patches patches are against linux 2.5.7

ftp://ftp.valinux.co.jp/pub/people/taka/tune/2.5.7/va01-knfsd-zerocopy-vfsread-2.5.7.patch
ftp://ftp.valinux.co.jp/pub/people/taka/tune/2.5.7/va02-kmap-multplepages-2.5.7.patch

ftp://ftp.valinux.co.jp/pub/people/taka/tune/2.5.7/va03-knfsd-zerocopy-sendpage-2.5.7-test1.patch


Andrew, Could you try it again?


Regards,
Hirokazu Takahashi


2002-04-10 19:32:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hirokazu Takahashi <[email protected]> writes:

> But I wonder about sendpage. I guess HW IP checksum for outgoing
> pages might be miscalculated as VFS can update them anytime.
> New feature like COW pagecache should be added to VM and they
> should be duplicated in this case.

For hw checksums it should not be a problem. NICs usually load
the packet into their packet fifo and compute the checksum on the fly
and then patch it into the header in the fifo before sending it out. A
NIC that would do slow PCI bus mastering twice just to compute the checksum
would be very dumb and I doubt they exist (if yes I bet it would be
faster to do software checksumming on them). When the NIC only
accesses the memory once there is no race window.

-Andi

2002-04-11 02:38:02

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

From: Andi Kleen <[email protected]>
Date: 10 Apr 2002 21:32:22 +0200

For hw checksums it should not be a problem. NICs usually load
the packet into their packet fifo and compute the checksum on the fly
and then patch it into the header in the fifo before sending it out. A
NIC that would do slow PCI bus mastering twice just to compute the checksum
would be very dumb and I doubt they exist (if yes I bet it would be
faster to do software checksumming on them). When the NIC only
accesses the memory once there is no race window.

Aha, but in the NFS case what if the page in the page cache gets
truncated from the file before the SKB is given to the card?
It would be quite easy to add such a test case to connectathon :-)

See, we hold a reference to the page in the SKB, but this only
guarentees that it cannot be freed up reused for another purpose.
It does not prevent the page contents from being sent out long
after it is no longer a part of that file.

Samba has similar issues, which is why they only use sendfile()
when the client holds an OP lock on the file. (Although the Samba
issue is that in the same packet they mention the length of the file
plus the contents).

I'm still not %100 convinced this behavior would be illegal in the
NFS case, it needs more deep thought than I can provide right now.

2002-04-11 06:46:55

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi

Thank you for the replies.

ak> From: Andi Kleen <[email protected]>
ak> Date: 10 Apr 2002 21:32:22 +0200
ak>
ak> For hw checksums it should not be a problem. NICs usually load
ak> the packet into their packet fifo and compute the checksum on the fly
ak> and then patch it into the header in the fifo before sending it out. A
ak> NIC that would do slow PCI bus mastering twice just to compute the checksum
ak> would be very dumb and I doubt they exist (if yes I bet it would be
ak> faster to do software checksumming on them). When the NIC only
ak> accesses the memory once there is no race window.

davem> Aha, but in the NFS case what if the page in the page cache gets
davem> truncated from the file before the SKB is given to the card?
davem> It would be quite easy to add such a test case to connectathon :-)
davem>
davem> See, we hold a reference to the page in the SKB, but this only
davem> guarentees that it cannot be freed up reused for another purpose.
davem> It does not prevent the page contents from being sent out long
davem> after it is no longer a part of that file.

I believe it's probably OK. We considered as follows.

Please consider a knfsd sends data of File A by useing sendmsg().
1. The knfsd copies the data of File A into sk_buff.
2. File A may be truncated after step.1
3. NFS clients receives the packets of File A which is already truncated.

Next please consider the knfsd sends data of File A by useing sendpage().
1. The knfsd grabs pages of File A. (page_cache_get)
2. File A may be truncated after step.1
3. The knfsd send the pages.
4. NFS clients receives the packets of File A which is already truncated.

Is there any differences between them ?
This behavior is invisible to NFS Clients, I think.

davem> Samba has similar issues, which is why they only use sendfile()
davem> when the client holds an OP lock on the file. (Although the Samba
davem> issue is that in the same packet they mention the length of the file
davem> plus the contents).

I think NFSD is a part of kernel -- not a usermode process, so NFSD can
control to avoid this kind of situations.

New zerocopy knfsd grabs pages of file and its atrribute in a same operation.
I think no discrepancy between them would occur.

And yes I know sending pages might be overwritten by another process,
but it maybe also happens on local filesystems. File data can be
updated while another process is reading the same file.

davem> I'm still not %100 convinced this behavior would be illegal in the
davem> NFS case, it needs more deep thought than I can provide right now.

I'm happy to talk to you.

Thank you,
Hirokazu Takahashi

2002-04-11 06:55:38

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

From: Hirokazu Takahashi <[email protected]>
Date: Thu, 11 Apr 2002 15:46:51 +0900 (JST)

Please consider a knfsd sends data of File A by useing sendmsg().
1. The knfsd copies the data of File A into sk_buff.
2. File A may be truncated after step.1
3. NFS clients receives the packets of File A which is already truncated.

Next please consider the knfsd sends data of File A by useing sendpage().
1. The knfsd grabs pages of File A. (page_cache_get)
2. File A may be truncated after step.1
3. The knfsd send the pages.
4. NFS clients receives the packets of File A which is already truncated.

Is there any differences between them ?
This behavior is invisible to NFS Clients, I think.

Consider truncate() to 1 byte left in that page. To handle mmap()'s
of this file the kernel will memset() rest of the page to zero.

Now, in the sendfile() case the NFS client sees some page filled
mostly of zeros instead of file contents.

In sendmsg() knfd case, client sees something reasonable. He will
see something that was actually in the file at some point in time.
The sendfile() case sees pure garbage, contents that never were in
the file at any point in time.

We could make knfsd take the write semaphore on the inode until client
is known to get the packet but that is the kind of overhead we'd like
to avoid.

2002-04-11 07:41:39

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

davem> Consider truncate() to 1 byte left in that page. To handle mmap()'s
davem> of this file the kernel will memset() rest of the page to zero.
davem> Now, in the sendfile() case the NFS client sees some page filled
davem> mostly of zeros instead of file contents.

Hmmm... I realize clearly.

davem> In sendmsg() knfd case, client sees something reasonable. He will
davem> see something that was actually in the file at some point in time.
davem> The sendfile() case sees pure garbage, contents that never were in
davem> the file at any point in time.
davem>
davem> We could make knfsd take the write semaphore on the inode until client
davem> is known to get the packet but that is the kind of overhead we'd like
davem> to avoid.

Yes, the write semaphore would be good solution if TCP/IP stack never
get stuck.

Now I wonder if we could make these pages COW mode.
When some process try to update the pages, they should be duplicated.
I's easy to implement it in write(), truncate() and so on.
But mmap() is little bit difficult if there no reverse mapping page to PTE.

How do you think about this idea?

Regards,
Hirokazu Takahashi

2002-04-11 07:59:32

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

From: Hirokazu Takahashi <[email protected]>
Date: Thu, 11 Apr 2002 16:41:34 +0900 (JST)

Now I wonder if we could make these pages COW mode.
When some process try to update the pages, they should be duplicated.
I's easy to implement it in write(), truncate() and so on.
But mmap() is little bit difficult if there no reverse mapping page to PTE.

How do you think about this idea?

I think this idea has such high overhead that it is even not for
consideration, consider SMP.

2002-04-11 11:38:29

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi, David

davem> Now I wonder if we could make these pages COW mode.
davem> When some process try to update the pages, they should be duplicated.
davem> I's easy to implement it in write(), truncate() and so on.
davem> But mmap() is little bit difficult if there no reverse mapping page to PTE.
davem>
davem> How do you think about this idea?
davem>
davem> I think this idea has such high overhead that it is even not for
davem> consideration, consider SMP.

Hmmm... If I'd implement them.....
How about following codes ?

nfsd read()
{
:
page_cache_get(page);
if (page is mapped to anywhere)
page = duplicate_and_rehash(page);
else {
page_lock(page);
page->flags |= COW;
page_unlock(page);
}
sendpage(page);
page_cache_release(page);
}

generic_file_write()
{
page = _grab_cache_page()
lock_page(page);
if (page->flags & COW)
page = duplicate_and_rehash(page);
prepare_write();
commit_write();
UnlockPage(page);
page_cache_release(page)
}

truncate_list_page() <-- truncate() calls
{
page_cache_get();
lock_page(page);
if (page->flags & COW)
page = duplicate_and_rehash(page);
truncate_partial_page();
UnlockPage(page);
page_cache_release(page);
}

2002-04-11 11:43:33

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

From: Hirokazu Takahashi <[email protected]>
Date: Thu, 11 Apr 2002 20:38:23 +0900 (JST)

Hmmm... If I'd implement them.....
How about following codes ?

nfsd read()
{
:
page_cache_get(page);
if (page is mapped to anywhere)
page = duplicate_and_rehash(page);
else {
page_lock(page);
page->flags |= COW;
page_unlock(page);
}
sendpage(page);
page_cache_release(page);
}

What if a process mmap's the page between duplicate_and_rehash and the
card actually getting the data?

This is hopeless. The whole COW idea is 1) expensive 2) complex to
implement.

This is why we don't implement sendfile with anything other than a
simple page reference. Otherwise the overhead and complexity is
unacceptable.

No, you must block truncate operations on the file until the client
ACK's the nfsd read request if you wish to use sendfile() with
nfsd.

2002-04-11 13:00:39

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On 11 April 2002 09:36, David S. Miller wrote:
> No, you must block truncate operations on the file until the client
> ACK's the nfsd read request if you wish to use sendfile() with
> nfsd.

Which shouldn't be a big performance problem unless I am unaware
of some real-life applications doing heavy truncates.
--
vda

2002-04-11 13:16:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote:
> On 11 April 2002 09:36, David S. Miller wrote:
> > No, you must block truncate operations on the file until the client
> > ACK's the nfsd read request if you wish to use sendfile() with
> > nfsd.
>
> Which shouldn't be a big performance problem unless I am unaware
> of some real-life applications doing heavy truncates.

Every unlink does a truncate. There are applications that delete files
a lot.

-Andi

2002-04-11 17:33:10

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Thu, Apr 11, 2002 at 12:52:16AM -0700, David S. Miller wrote:
> I think this idea has such high overhead that it is even not for
> consideration, consider SMP.

One possibility is to make the inode semaphore a rwsem, and to have NFS
take that for read until the sendpage is complete. The idea of splitting
the inode semaphore up into two (one rw against truncate) has been bounced
around for a few other reasons (like allowing multiple concurrent reads +
writes to a file). Perhaps its time to bite the bullet and do it.

-ben
--
"A man with a bass just walked in,
and he's putting it down
on the floor."

2002-04-11 17:36:12

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Thu, Apr 11, 2002 at 03:16:16PM +0200, Andi Kleen wrote:
> On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote:
> > On 11 April 2002 09:36, David S. Miller wrote:
> > > No, you must block truncate operations on the file until the client
> > > ACK's the nfsd read request if you wish to use sendfile() with
> > > nfsd.
> >
> > Which shouldn't be a big performance problem unless I am unaware
> > of some real-life applications doing heavy truncates.
>
> Every unlink does a truncate. There are applications that delete files
> a lot.

Not quite. The implicite truncate only happens when the link count falls
to 0 and the last user of the inode releases their reference to the inode.

-ben
--
"A man with a bass just walked in,
and he's putting it down
on the floor."

2002-04-12 08:10:27

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

Thank you for your suggestion.

bcrl> One possibility is to make the inode semaphore a rwsem, and to have NFS
bcrl> take that for read until the sendpage is complete. The idea of splitting
bcrl> the inode semaphore up into two (one rw against truncate) has been bounced
bcrl> around for a few other reasons (like allowing multiple concurrent reads +
bcrl> writes to a file). Perhaps its time to bite the bullet and do it.

It sounds not so bad.
Partial truncating would rarely happens, so it might be enough.
I'll give it a try.

Regards,
Hirokazu Takahashi

2002-04-12 12:30:24

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

I wondered if regular truncate() and read() might have the same
problem, so I tested again and again.
And I realized it will occur on any local filesystems.
Sometime I could get partly zero filled data instead of file contents.

I analysis this situation, read systemcall doesn't lock anything
-- no page lock, no semaphore lock -- while someone truncates
files partially.
It will often happens in case of pagefault in copy_user() to
copy file data to user space.

I guess if needed, it should be fixed in VFS.

davem> Consider truncate() to 1 byte left in that page. To handle mmap()'s
davem> of this file the kernel will memset() rest of the page to zero.
davem>
davem> Now, in the sendfile() case the NFS client sees some page filled
davem> mostly of zeros instead of file contents.
davem>
davem> In sendmsg() knfd case, client sees something reasonable. He will
davem> see something that was actually in the file at some point in time.
davem> The sendfile() case sees pure garbage, contents that never were in
davem> the file at any point in time.
davem>
davem> We could make knfsd take the write semaphore on the inode until client
davem> is known to get the packet but that is the kind of overhead we'd like
davem> to avoid.

Thank you,
Hirokazu Takahashi.

2002-04-12 12:36:02

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Fri, Apr 12, 2002 at 09:30:11PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> I wondered if regular truncate() and read() might have the same
> problem, so I tested again and again.
> And I realized it will occur on any local filesystems.
> Sometime I could get partly zero filled data instead of file contents.
>
> I analysis this situation, read systemcall doesn't lock anything
> -- no page lock, no semaphore lock -- while someone truncates
> files partially.
> It will often happens in case of pagefault in copy_user() to
> copy file data to user space.
>
> I guess if needed, it should be fixed in VFS.

I don't see it as a big problem and would just leave it as it is (for NFS
and local)
Adding more locking would slow down read() a lot and there should be
a good reason to take such a performance hit. Linux did this forever
and I don't think anybody ever reported it as a bug, so we can probably
safely assume that this behaviour (non atomic truncate) is not a problem for
users in practice.

-Andi

2002-04-12 21:23:28

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Andi Kleen wrote:
> > I wondered if regular truncate() and read() might have the same
> > problem, so I tested again and again.
> > And I realized it will occur on any local filesystems.
> > Sometime I could get partly zero filled data instead of file contents.
> >
> I don't see it as a big problem and would just leave it as it is (for NFS
> and local)
> Adding more locking would slow down read() a lot and there should be
> a good reason to take such a performance hit. Linux did this forever
> and I don't think anybody ever reported it as a bug, so we can probably
> safely assume that this behaviour (non atomic truncate) is not a problem for
> users in practice.

Ouch! I have a program which can output incorrect results if this is
the case. It may seem to use an esoteric locking strategy, but I had no
idea it was acceptable for read to return data that truncate is in the
middle of zeroing.

The program keeps a cache on disk of generated files. For each cached
object, there is a metadata file. Metadata files are text files,
written in such a way that the first line looks similar to "=12296.0"
and so does the last line, but none of the intermediate lines have that
form.

Multiple programs can access the disk cache at the same time, and must
be able to check the metadata files being written by other programs,
blocking if necessary while a cached object is being generated.

When a metadata file is being written, first it is created, then the
cached object is created and written to a related file, and finally the
metadata including the first and last marker line is written to the
metadata file.

When a program reads a metadata file, it reads as much as it can and
then checks the first and last lines are identical. If they are, the
middle of the file is valid otherwise it isn't -- perhaps the process
generating that file died, or has the metadata file locked.

This strategy is used so that there's no need to lock the file, in the
cache of a cache hit with no complications. If there's a complication
we lock with LOCK_SH and try again.

>From time to time, when a cache object is invalid, it's appropriate to
truncate the metadata file. If that were atomic, end of story.
Unfortunately I've just now heard that a read() can successfully
interleave some zeros from a parallel truncate.

If the timing is right, that means it's possible for the reading
process to see a hole of zeros in the middle of the file. The first and
last lines would be intact, and the reader would think that the whole
file is therefore valid. Bad!

This occurs if the reader copies the initial bytes from the page, then
the truncation process catches up and zeros out some bytes, but then the
reader catches up and beats the truncation process to the end of the
file.

I'm not advocating more locking in read() -- there's no need, and it is
quite important that it is fast! But I would very much appreciate an
understanding of the rules that relate reading, writing and truncating
processes. How much ordering & atomicity can I depend on? Anything at all?

cheers,
-- Jamie

2002-04-12 21:39:32

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

From: Jamie Lokier <[email protected]>
Date: Fri, 12 Apr 2002 22:22:52 +0100

I'm not advocating more locking in read() -- there's no need, and it is
quite important that it is fast! But I would very much appreciate an
understanding of the rules that relate reading, writing and truncating
processes. How much ordering & atomicity can I depend on? Anything at all?

Basically none it appears :-)

If you need to depend upon a consistent snapshot of what some other
thread writes into a file, you must have some locking protocol to use
to synchronize with that other thread.

2002-04-12 21:47:05

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

From: Andi Kleen <[email protected]>
Date: Fri, 12 Apr 2002 14:35:59 +0200

On Fri, Apr 12, 2002 at 09:30:11PM +0900, Hirokazu Takahashi wrote:
> I analysis this situation, read systemcall doesn't lock anything
> -- no page lock, no semaphore lock -- while someone truncates
> files partially.
> It will often happens in case of pagefault in copy_user() to
> copy file data to user space.

I don't see it as a big problem and would just leave it as it is
(for NFS and local)

I agree with Andi. You can basically throw away my whole argument
about this. Applications that require synchonization between the
writer of file contents and reader of file contents must do some
kind of locking amongst themselves at user level.

2002-04-13 00:22:01

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

David S. Miller wrote:
> I'm not advocating more locking in read() -- there's no need, and it is
> quite important that it is fast! But I would very much appreciate an
> understanding of the rules that relate reading, writing and truncating
> processes. How much ordering & atomicity can I depend on? Anything at all?
>
> Basically none it appears :-)
>
> If you need to depend upon a consistent snapshot of what some other
> thread writes into a file, you must have some locking protocol to use
> to synchronize with that other thread.

Darn, I was hoping to avoid system calls.
Perhaps it's good fortune that futexes just arrived :-)

In some ways, it seems entirely reasonable for truncate() to behave as
if it were writing zeros. That is, after all, what you see there if the
file is expanded later with a hole.

I wonder if it is reasonable to depend on that: -- i.e. I'll only ever
see zeros, not say random bytes, or ones or something. I'm sure that's
so with the current kernel, and probably all of them ever (except for
bugs) but I wonder whether it's ok to rely on that.

-- Jamie

2002-04-13 06:39:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

> I wonder if it is reasonable to depend on that: -- i.e. I'll only ever
> see zeros, not say random bytes, or ones or something. I'm sure that's
> so with the current kernel, and probably all of them ever (except for
> bugs) but I wonder whether it's ok to rely on that.

With truncates you should only ever see zeros. If you want this guarantee
over system crashes you need to make sure to use the right file system
though (e.g. ext2 or reiserfs without the ordered data mode patches or
ext3 in writeback mode could give you junk if the system crashes at the
wrong time). Still depending on only seeing zeroes would
seem to be a bit fragile on me (what happens when the disk dies for
example?), using some other locking protocol is probably more safe.

-Andi

2002-04-13 08:01:33

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

Thanks to Andrew Theurer for his help.
He posted the result of testing my patches on [email protected]
list, we can get a great performance.

> I tried the patch with great performance improvement! I ran my nfs read
> test (48 clients read 200 MB file from one 4-way SMP NFS server) and
> compared your patches to regular 2.5.7. Regular 2.5.7 resulted in 87 MB/sec
> with 100% CPU utilization. Your patch resulted 130 MB/sec with 82% CPU
> utilization! This is very good! I took profiles, and as expected,
> csum_copy and file_read_actor were gone with the patch. Sar reported nearly
> 40 MB/sec per gigabit adapter (there are 4) during the test. That is the
> most I have seen so far. Soon I will be doing some lock analysis to make
> sure we don't have any locking problems. Also, I will see if there is
> anyone here at IBM LTC that can assist with your development of zerocopy on
> UDP. Thanks for the patch!
>
> Andrew Theurer

2002-04-13 18:53:09

by Chris Wedgwood

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Fri, Apr 12, 2002 at 02:31:50PM -0700, David S. Miller wrote:

If you need to depend upon a consistent snapshot of what some
other thread writes into a file, you must have some locking
protocol to use to synchronize with that other thread.

Appends of small-writes (for whatever reason) seems to be atomic,
AFAIK nobody gets corrupt apache logs for example.



--cw

2002-04-13 19:26:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Andi Kleen <[email protected]> writes:

> > I wonder if it is reasonable to depend on that: -- i.e. I'll only ever
> > see zeros, not say random bytes, or ones or something. I'm sure that's
> > so with the current kernel, and probably all of them ever (except for
> > bugs) but I wonder whether it's ok to rely on that.
>
> With truncates you should only ever see zeros. If you want this guarantee
> over system crashes you need to make sure to use the right file system
> though (e.g. ext2 or reiserfs without the ordered data mode patches or
> ext3 in writeback mode could give you junk if the system crashes at the
> wrong time). Still depending on only seeing zeroes would
> seem to be a bit fragile on me (what happens when the disk dies for
> example?), using some other locking protocol is probably more safe.

Could the garbage from ext3 in writeback mode be considered an
information leak? I know that is why most places in the kernel
initialize pages to 0. So you don't accidentally see what another
user put there.

Eric
k

2002-04-13 19:37:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Sat, Apr 13, 2002 at 01:19:46PM -0600, Eric W. Biederman wrote:
> Could the garbage from ext3 in writeback mode be considered an
> information leak? I know that is why most places in the kernel
> initialize pages to 0. So you don't accidentally see what another
> user put there.

Yes it could. But then ext2/ffs have the same problem and so far people were
able to live on with that.

-Andi

2002-04-13 20:41:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Andi Kleen <[email protected]> writes:

> On Sat, Apr 13, 2002 at 01:19:46PM -0600, Eric W. Biederman wrote:
> > Could the garbage from ext3 in writeback mode be considered an
> > information leak? I know that is why most places in the kernel
> > initialize pages to 0. So you don't accidentally see what another
> > user put there.
>
> Yes it could. But then ext2/ffs have the same problem and so far people were
> able to live on with that.

The reason I asked, is the description sounded specific to ext3. Also
with ext3 a supported way to shutdown is to just pull the power on the
machine. And the filesystem comes back to life without a full fsck.

So if this can happen when all you need is to replay the journal, I
have issues with it. If this happens in the case of damaged
filesystem I don't.

Eric

2002-04-14 00:08:08

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Sat, 13 Apr 2002 11:52:49 -0700,
Chris Wedgwood <[email protected]> wrote:
>On Fri, Apr 12, 2002 at 02:31:50PM -0700, David S. Miller wrote:
>
> If you need to depend upon a consistent snapshot of what some
> other thread writes into a file, you must have some locking
> protocol to use to synchronize with that other thread.
>
>Appends of small-writes (for whatever reason) seems to be atomic,
>AFAIK nobody gets corrupt apache logs for example.

Write in append mode must be atomic in the kernel. Whether a user
space write in append mode is atomic or not depends on how many write()
syscalls it takes to pass the data into the kernel. Each write()
append will be atomic but multiple writes can be interleaved.

2002-04-14 08:20:06

by Chris Wedgwood

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Sun, Apr 14, 2002 at 10:07:56AM +1000, Keith Owens wrote:

Write in append mode must be atomic in the kernel. Whether a user
space write in append mode is atomic or not depends on how many
write() syscalls it takes to pass the data into the kernel. Each
write() append will be atomic but multiple writes can be
interleaved.

Up to what size? I assume I cannot assume O_APPEND atomicity for
(say) 100M writes?


--cw

2002-04-14 08:41:02

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Sun, 14 Apr 2002 01:19:46 -0700,
Chris Wedgwood <[email protected]> wrote:
>On Sun, Apr 14, 2002 at 10:07:56AM +1000, Keith Owens wrote:
>
> Write in append mode must be atomic in the kernel. Whether a user
> space write in append mode is atomic or not depends on how many
> write() syscalls it takes to pass the data into the kernel. Each
> write() append will be atomic but multiple writes can be
> interleaved.
>
>Up to what size? I assume I cannot assume O_APPEND atomicity for
>(say) 100M writes?

Atomic on that inode, not atomic wrt other I/O to other inodes. Most
write operations use generic_file_write() which grabs the inode semaphore.
No other writes (or indeed any other I/O) can proceed on the inode
until this write completes and releases the semaphore.

I suppose that some filesystem could use its own write method that
releases the lock during the write operation. I would not trust my
data to such filesystems, they violate SUSV2.

"If the O_APPEND flag of the file status flags is set, the file
offset shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation"

2002-04-15 01:30:45

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hello, David

If you don't mind, could you give me some advises about
the sendpage mechanism.

I'd like to implenent sendpage of UDP stack which NFS uses heavily.
It may improve the performance of NFS over UDP dramastically.

I wonder if there were "SENDPAGES" interface instead of sendpage
between socket layer and inet layer, we could send some pages
atomically with low overhead.
And it could make implementing RPC over UDP easier
to send multiple pages as one UDP pakcet easily.

How do you think about this approach?

davem> I don't see it as a big problem and would just leave it as it is
davem> (for NFS and local)
davem>
davem> I agree with Andi. You can basically throw away my whole argument
davem> about this. Applications that require synchonization between the
davem> writer of file contents and reader of file contents must do some
davem> kind of locking amongst themselves at user level.

OK.

Regards,
Hirokazu Takahashi

2002-04-15 04:31:04

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

From: Hirokazu Takahashi <[email protected]>
Date: Mon, 15 Apr 2002 10:30:13 +0900 (JST)

I'd like to implenent sendpage of UDP stack which NFS uses heavily.
It may improve the performance of NFS over UDP dramastically.

I wonder if there were "SENDPAGES" interface instead of sendpage
between socket layer and inet layer, we could send some pages
atomically with low overhead.
And it could make implementing RPC over UDP easier
to send multiple pages as one UDP pakcet easily.

How do you think about this approach?

Sendpages mechanism will not be implemented.

You must implement UDP sendfile() one page at a time, by building up
an SKB with multiple calls similar to TCP with TCP_CORK socket option
set.

For datagram sockets, define temporary SKB hung off of struct sock.
Define UDP_CORK socket option which begins the "queue data only"
state.

All sendmsg()/sendfile() calls append to temporary SKB, first
sendmsg()/sendfile() call to UDP will create this sock->skb. First
call may be sendmsg() but subsequent calls for that SKB must be
sendfile() calls. If this pattern of calls is broken, SKB is sent.

Call to set UDP_CORK socket option to zero actually sends the SKB
being built.

The normal usage will be:

setsockopt(fd, UDP_CORK, 1);
sendmsg(fd, sunrpc_headers, sizeof(sunrpc_headers));
sendfile(fd, ...);
setsockopt(fd, UDP_CORK, 0);

2002-04-16 00:15:41

by Mike Fedyk

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Thu, Apr 11, 2002 at 03:16:16PM +0200, Andi Kleen wrote:
> On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote:
> > On 11 April 2002 09:36, David S. Miller wrote:
> > > No, you must block truncate operations on the file until the client
> > > ACK's the nfsd read request if you wish to use sendfile() with
> > > nfsd.
> >
> > Which shouldn't be a big performance problem unless I am unaware
> > of some real-life applications doing heavy truncates.
>
> Every unlink does a truncate. There are applications that delete files
> a lot.

Is this true at the filesystem level or only in memory? If so, I could
immagine that it would make it much harder to undelete a file when you don't
even know how big it was (file set to 0 size)...

Why is this required? Could someone say quickly (as I'm sure it's probably
quite complex) or point me to some references?

2002-04-16 01:03:39

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi, David

Thank you for your advice!

davem> Sendpages mechanism will not be implemented.
davem>
davem> You must implement UDP sendfile() one page at a time, by building up
davem> an SKB with multiple calls similar to TCP with TCP_CORK socket option
davem> set.
davem>
davem> For datagram sockets, define temporary SKB hung off of struct sock.
davem> Define UDP_CORK socket option which begins the "queue data only"
davem> state.
davem>
davem> All sendmsg()/sendfile() calls append to temporary SKB, first
davem> sendmsg()/sendfile() call to UDP will create this sock->skb. First
davem> call may be sendmsg() but subsequent calls for that SKB must be
davem> sendfile() calls. If this pattern of calls is broken, SKB is sent.
davem>
davem> Call to set UDP_CORK socket option to zero actually sends the SKB
davem> being built.
davem>
davem> The normal usage will be:
davem>
davem> setsockopt(fd, UDP_CORK, 1);
davem> sendmsg(fd, sunrpc_headers, sizeof(sunrpc_headers));
davem> sendfile(fd, ...);
davem> setsockopt(fd, UDP_CORK, 0);

Yes, it seems to be the most general way.
OK, I'll do this way first of all.

In the kernel, probaboly I'd impelement as following:

put a RPC header and a NFS header on "bufferA";
down(semaphore);
sendmsg(bufferA, MSG_MORE);
for (eache pages of fileC)
sock->opt->sendpage(page, islastpage ? 0 : MSG_MORE)
up(semaphore);

the semaphore is required to serialize sending data as many knfsd kthreads
use the same socket.

Actually I'd like to implement it like following codes, but unfortunatelly
it wouldn't work on UDP socket of servers as the socket doesn't have specific
destination address at all, and sendpage has no arguments to specify it.
It's not so good....

put a RPC header and a NFS header on "pageB";
down(semaphore);
sock->opt->sendpage(pageB, MSG_MORE);
for (each pages of fileC)
sock->opt->sendpage(page, islastpage ? 0 : MSG_MORE)
up(semaphore);


Thank you,
Hirokazu Takahashi

2002-04-16 01:41:23

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Tue, Apr 16, 2002 at 10:03:02AM +0900, Hirokazu Takahashi wrote:
> Hi, David
>
...
>
> Yes, it seems to be the most general way.
> OK, I'll do this way first of all.
>
> In the kernel, probaboly I'd impelement as following:
>
> put a RPC header and a NFS header on "bufferA";
> down(semaphore);
> sendmsg(bufferA, MSG_MORE);
> for (eache pages of fileC)
> sock->opt->sendpage(page, islastpage ? 0 : MSG_MORE)
> up(semaphore);
>
> the semaphore is required to serialize sending data as many knfsd kthreads
> use the same socket.

Won't this serialize too much ? I mean, consider the situation where we
have file-A and file-B completely in cache, while file-C needs to be
read from the physical disk.

Three different clients (A, B and C) request file-A, file-B and file-C
respectively. The send of file-C is started first, and the sends of files
A and B (which could commence immediately and complete at near wire-speed)
will now have to wait (leaving the NIC idle) until file-C is read from
the disks.

Even if it's not the entire file but only a single NFS request (probably 8kB),
one disk seek (7ms) is still around 85 kB, or 10 8kB NFS requests (at 100Mbit).

Or am I misunderstanding ? Will your UDP sendpage() queue the requests ?

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-04-16 02:21:01

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

jakob> Won't this serialize too much ? I mean, consider the situation where we
jakob> have file-A and file-B completely in cache, while file-C needs to be
jakob> read from the physical disk.
jakob>
jakob> Three different clients (A, B and C) request file-A, file-B and file-C
jakob> respectively. The send of file-C is started first, and the sends of files
jakob> A and B (which could commence immediately and complete at near wire-speed)
jakob> will now have to wait (leaving the NIC idle) until file-C is read from
jakob> the disks.
jakob>
jakob> Even if it's not the entire file but only a single NFS request (probably 8kB),
jakob> one disk seek (7ms) is still around 85 kB, or 10 8kB NFS requests (at 100Mbit).
jakob>
jakob> Or am I misunderstanding ? Will your UDP sendpage() queue the requests ?

No problem.
On my implementation, at the beginning a knfsd grabs all pages -- a part
of file-C -- to reply to the NFS client. After that the knfsd starts to
send them. It won't block any other knfsds during disk I/Os.

Thank you,
Hirokazu Takahashi.

2002-04-16 15:39:07

by Oliver Xymoron

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Mon, 15 Apr 2002, Mike Fedyk wrote:

> On Thu, Apr 11, 2002 at 03:16:16PM +0200, Andi Kleen wrote:
> > On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote:
> > > On 11 April 2002 09:36, David S. Miller wrote:
> > > > No, you must block truncate operations on the file until the client
> > > > ACK's the nfsd read request if you wish to use sendfile() with
> > > > nfsd.
> > >
> > > Which shouldn't be a big performance problem unless I am unaware
> > > of some real-life applications doing heavy truncates.
> >
> > Every unlink does a truncate. There are applications that delete files
> > a lot.
>
> Is this true at the filesystem level or only in memory? If so, I could
> immagine that it would make it much harder to undelete a file when you don't
> even know how big it was (file set to 0 size)...
>
> Why is this required? Could someone say quickly (as I'm sure it's probably
> quite complex) or point me to some references?

Truncate is used to return the formerly used blocks to the free pool. It
is possible (and preferable) to avoid flushing out the modified file
metadata (inode and indirect blocks) for the deleted file, but
recoverability of deleted files has never been high on the priority list.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-04-18 05:02:46

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

I've been thinking about your comment, and I realized it was a good
suggestion.
There are no problem with the zerocopy NFS, but If you want to
use UDP sendfile for streaming or something like that, you wouldn't
get good performance.

jakob> > the semaphore is required to serialize sending data as many knfsd kthreads
jakob> > use the same socket.
jakob>
jakob> Won't this serialize too much ? I mean, consider the situation where we
jakob> have file-A and file-B completely in cache, while file-C needs to be
jakob> read from the physical disk.
jakob>
jakob> Three different clients (A, B and C) request file-A, file-B and file-C
jakob> respectively. The send of file-C is started first, and the sends of files
jakob> A and B (which could commence immediately and complete at near wire-speed)
jakob> will now have to wait (leaving the NIC idle) until file-C is read from
jakob> the disks.
jakob>
jakob> Even if it's not the entire file but only a single NFS request (probably 8kB),
jakob> one disk seek (7ms) is still around 85 kB, or 10 8kB NFS requests (at 100Mbit).
jakob>
jakob> Or am I misunderstanding ? Will your UDP sendpage() queue the requests ?

There may be many threads on a streaming server and they would share
the same UDP socket. UDP_CORK mechanism requires semaphore to serialize
sending data and it would block each other for a long time because
sendfile() might have the threads sleep in BIO as you said.

You may say we can use MSG_MORE instead of UDP_CORK.
When we use it, we have to make multiple queue for each destination for
clientns. But we can't link pages to the queue as sendfile() have no
arguments specifying the destination.

client UDP sockets
+---------+
|dest:123 |---------+
| | |
+---------+ | server
V UDP socket
+---------+ +---------+ <--- thread1
|dest:123 |------->|src:123 | <--- thread2
| | |dest:ANY | <--- thread3
+---------+ +---------+ <--- thread4
A
+---------+ |
|dest:123 |---------+
| |
+---------+

Shall I make a multiple queue based on pid instead of destitation address ?
Any idea is welcome!

Thank you,
Hirokazu Takahashi

2002-04-18 07:58:16

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Thu, Apr 18, 2002 at 02:01:55PM +0900, Hirokazu Takahashi wrote:
> Hi,
>
> I've been thinking about your comment, and I realized it was a good
> suggestion.
> There are no problem with the zerocopy NFS, but If you want to
> use UDP sendfile for streaming or something like that, you wouldn't
> get good performance.

Hi again,

So the problem is, that it is too easy to use UDP sendfile "poorly", right?

Your NFS threads don't have the problem because you make sure that pages are in
core prior to the sendfile call, but not every developer may think that far...

...
>
> Shall I make a multiple queue based on pid instead of destitation address ?
> Any idea is welcome!

Ok, so here's some ideas. I'm no expert so if something below seems subtle,
it's more likely to be plain stupid rather than something really clever ;)


In order to keep sendfile as simple as possible, perhaps one could just make
it fail if not all pages are in core.

So, your NFS send routine would be something like


retry:
submit_read_requests
await_io_completion

rc = sendfile(..)
if (rc == -EFAULT)
goto retry

(I suppose even the retry is optional - this being UDP, the packet could be
dropped anywhere anyway. The rationale behind retrying immediately is, that
"almost" all pages probably are in core)

That would keep sendfile simple, and force it's users to think of a clever
way to make sure the pages are ready (and about what to do if they aren't).


This is obviously not something one can do from userspace. There, I think that
your suggestion with a queue per pid seems like a nice solution. What I worry
about is, if the machine is under heavy memory pressure and the queue entries
start piling up - if sendfile is not clever (or somehow lets the VM figure out
what to do with the requests), the queues will be competing against eachother,
taking even longer to finish...

Perhaps userspace would call some sendfile wrapper routine that would do the
queue management, while kernel threads that are sufficiently clever by
themselves will just call the lightweight sendfile.

Or, will the queue management be simpler and less dangerous than I think ? :)


Cheers,

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-04-18 08:53:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

>>>>> " " == Hirokazu Takahashi <[email protected]> writes:

> Hi, I've been thinking about your comment, and I realized it
> was a good suggestion. There are no problem with the zerocopy
> NFS, but If you want to use UDP sendfile for streaming or
> something like that, you wouldn't get good performance.

Surely one can work around this in userland without inventing a load
of ad-hoc schemes in the kernel socket layer?

If one doesn't want to create a pool of sockets in order to service
the different threads, one can use generic methods such as
sys_readahead() in order to ensure that the relevant data gets paged
in prior to hogging the socket.

There is no difference between UDP and TCP sendfile() in this respect.

Cheers,
Trond

2002-04-19 03:22:40

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

> > Hi, I've been thinking about your comment, and I realized it
> > was a good suggestion. There are no problem with the zerocopy
> > NFS, but If you want to use UDP sendfile for streaming or
> > something like that, you wouldn't get good performance.
>
> Surely one can work around this in userland without inventing a load
> of ad-hoc schemes in the kernel socket layer?
>
> If one doesn't want to create a pool of sockets in order to service
> the different threads, one can use generic methods such as
> sys_readahead() in order to ensure that the relevant data gets paged
> in prior to hogging the socket.

That makes sense.
It would work good enough in many cases, though it would be hard to
make sure that it really exists in core before sendfile().

> There is no difference between UDP and TCP sendfile() in this respect.

Yes.
And it seems to be more important on UDP sendfile().
processes or threads sharing the same UDP socket would affect each other,
while processes or threads on TCP sockets don't care about it as TCP
connection is peer to peer.

Thank you,
Hirokazu Takahashi.

2002-04-19 09:19:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Friday 19. April 2002 05:21, Hirokazu Takahashi wrote:

> And it seems to be more important on UDP sendfile().
> processes or threads sharing the same UDP socket would affect each other,
> while processes or threads on TCP sockets don't care about it as TCP
> connection is peer to peer.

No. It is not the lack of peer-to-peer connections that gives rise to the
bottleneck, but the idea of several threads multiplexing sendfile() through a
single socket. Given a bad program design, it can be done over TCP too.

The conclusion is that the programmer really ought to choose a different
design. For multimedia streaming, for instance, it makes sense to use 1 UDP
socket per thread rather than to multiplex the output through one socket.

Cheers,
Trond

2002-04-20 07:48:28

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

> > And it seems to be more important on UDP sendfile().
> > processes or threads sharing the same UDP socket would affect each other,
> > while processes or threads on TCP sockets don't care about it as TCP
> > connection is peer to peer.
>
> No. It is not the lack of peer-to-peer connections that gives rise to the
> bottleneck, but the idea of several threads multiplexing sendfile() through a
> single socket. Given a bad program design, it can be done over TCP too.
>
> The conclusion is that the programmer really ought to choose a different
> design. For multimedia streaming, for instance, it makes sense to use 1 UDP
> socket per thread rather than to multiplex the output through one socket.

You mean, create UDP sockets which have the same port number?
Yes we can if we use setsockopt(SO_REUSEADDR).
And it could lead less contention between CPUs.
Sounds good!

Thank you,
Hirokazu Takahashi.

2002-04-20 10:15:25

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

Hi,

> With all this talk on serialization on UDP, and have a question. first, let
> me explain the situation. I have an NFS test which calls 48 clients to read
> the same 200 MB file on the same server. I record the time for all the
> clients to finish and then calculate the total throughput. The server is a
> 4-way IA32. (I used this test to measure the zerocopy/tcp/nfs patch) Now,
> right before the test, the 200 MB file is created on the server, so there is
> no disk IO at all during the test. It's just an very simple cached read.
> Now, when the clients use udp, I can only get a run queue length of 1, and I
> have confirmed there is only one nfsd thread in svc_process() at one time,
> and I am 65% idle. With tcp, I can get all nfsd threads running, and max all
> CPUs. Am I experiencing a bottleneck/serialization due to a single UDP
> socket?

What version do you use?
2.5.8 kernel has a problem in readahead of NFSD.
It doesn't work at all.

It can be easy to fix.

Thank you,
Hirokazu Takahashi.

2002-04-20 15:46:03

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

> Hi,
>
> > With all this talk on serialization on UDP, and have a question. first,
let
> > me explain the situation. I have an NFS test which calls 48 clients to
read
> > the same 200 MB file on the same server. I record the time for all the
> > clients to finish and then calculate the total throughput. The server is
a
> > 4-way IA32. (I used this test to measure the zerocopy/tcp/nfs patch)
Now,
> > right before the test, the 200 MB file is created on the server, so
there is
> > no disk IO at all during the test. It's just an very simple cached
read.
> > Now, when the clients use udp, I can only get a run queue length of 1,
and I
> > have confirmed there is only one nfsd thread in svc_process() at one
time,
> > and I am 65% idle. With tcp, I can get all nfsd threads running, and
max all
> > CPUs. Am I experiencing a bottleneck/serialization due to a single UDP
> > socket?
>
> What version do you use?
> 2.5.8 kernel has a problem in readahead of NFSD.
> It doesn't work at all.

I have this problem on every version I have used, including 2.4.18, 2.4.18
w/ Niel's patches, 2.5.6, and 2.5.7. One other thing I forgot to mention:
If I set the number of resident nfsd threads to "2", I can get 2 nfsd
threads running at once (nfsd_busy = 2), along with ~30% improvement in
throughput. If I use any other qty of resident nfsd threads, I always get
exactly 1 nfsd threads running (nfsd_busy = 1) during this test. With tcp
there is no serialization at all. I can get nearly 48 nfsd threads busy
with the 48 clients all reading at once.

-Andrew

2002-04-24 23:09:31

by Mike Fedyk

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Sat, Apr 13, 2002 at 02:34:12PM -0600, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> > On Sat, Apr 13, 2002 at 01:19:46PM -0600, Eric W. Biederman wrote:
> > > Could the garbage from ext3 in writeback mode be considered an
> > > information leak? I know that is why most places in the kernel
> > > initialize pages to 0. So you don't accidentally see what another
> > > user put there.
> >
> > Yes it could. But then ext2/ffs have the same problem and so far people were
> > able to live on with that.
>
> The reason I asked, is the description sounded specific to ext3. Also
> with ext3 a supported way to shutdown is to just pull the power on the
> machine. And the filesystem comes back to life without a full fsck.
>
> So if this can happen when all you need is to replay the journal, I
> have issues with it. If this happens in the case of damaged
> filesystem I don't.
>

Actually, with ext3 the only mode IIRC is data=journal that will keep this
from happening. In ordered or writeback mode there is a window where the
pages will be zeroed in memory, but not on disk.

Admittedly, the time window is largest in writeback mode, smaller in ordered
and smallest (non-existant?) in data journaling mode.

Mike

2002-04-25 12:37:49

by Terje Eggestad

[permalink] [raw]
Subject: Possible bug with UDP and SO_REUSEADDR. Was Re: [PATCH] zerocopy NFS updated

Seeing this mail from Hirokazu my curiosity triggered, I'd never
contemplated using reuse on a udp socket.

However writing a test server that stand in blocking wait on a UDP
socket, and start two instances of the server it's ALWAYS the server
last started that get the udp message, even if it's not in blocking
wait, and the first started server is.

Smells like a bug to me, this behavior don't make much sence.

Using stock 2.4.17.

TJ


On Sat, 2002-04-20 at 09:47, Hirokazu Takahashi wrote:
> Hi,
>
> > > And it seems to be more important on UDP sendfile().
> > > processes or threads sharing the same UDP socket would affect each other,
> > > while processes or threads on TCP sockets don't care about it as TCP
> > > connection is peer to peer.
> >
> > No. It is not the lack of peer-to-peer connections that gives rise to the
> > bottleneck, but the idea of several threads multiplexing sendfile() through a
> > single socket. Given a bad program design, it can be done over TCP too.
> >
> > The conclusion is that the programmer really ought to choose a different
> > design. For multimedia streaming, for instance, it makes sense to use 1 UDP
> > socket per thread rather than to multiplex the output through one socket.
>
> You mean, create UDP sockets which have the same port number?
> Yes we can if we use setsockopt(SO_REUSEADDR).
> And it could lead less contention between CPUs.
> Sounds good!
>
> Thank you,
> Hirokazu Takahashi.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-04-25 17:13:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS updated

On Apr 24, 2002 16:11 -0700, Mike Fedyk wrote:
> Actually, with ext3 the only mode IIRC is data=journal that will keep this
> from happening. In ordered or writeback mode there is a window where the
> pages will be zeroed in memory, but not on disk.
>
> Admittedly, the time window is largest in writeback mode, smaller in ordered
> and smallest (non-existant?) in data journaling mode.

One thing you are forgetting is that with data=ordered mode, the inode
itself is not updated until the data has been written to the disk. So
technically you are correct - with ordered mode there is a window where
pages are updated in memory but not on disk, but if you crash during
that window the inode size will be the old size so you will still not be
able to access the un-zero'd data on disk.

It is only with data=writeback that this could be a problem, because
there is no ordering between updating the inode and writing the data
to disk. That's why there is only a real benefit to using
data=writeback for applications like databases and such where the file
size doesn't change and you are writing into the middle of the file.
In many cases, data=ordered is actually faster than data=writeback.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-04-26 02:52:47

by David Miller

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR. Was Re: [PATCH] zerocopy NFS updated

From: Terje Eggestad <[email protected]>
Date: 25 Apr 2002 14:37:44 +0200

However writing a test server that stand in blocking wait on a UDP
socket, and start two instances of the server it's ALWAYS the server
last started that get the udp message, even if it's not in blocking
wait, and the first started server is.

Smells like a bug to me, this behavior don't make much sence.

Using stock 2.4.17.

Can you post your test server/client application so that I
don't have to write it myself and guess how you did things?

Thanks.

2002-04-26 07:39:01

by Terje Eggestad

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR. Was Re: [PATCH] zerocopy NFS updated

'course



On Fri, 2002-04-26 at 04:43, David S. Miller wrote:
> From: Terje Eggestad <[email protected]>
> Date: 25 Apr 2002 14:37:44 +0200
>
> However writing a test server that stand in blocking wait on a UDP
> socket, and start two instances of the server it's ALWAYS the server
> last started that get the udp message, even if it's not in blocking
> wait, and the first started server is.
>
> Smells like a bug to me, this behavior don't make much sence.
>
> Using stock 2.4.17.
>
> Can you post your test server/client application so that I
> don't have to write it myself and guess how you did things?
>
> Thanks.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________


Attachments:
client.c (1.25 kB)
server.c (1.26 kB)
Download all attachments

2002-04-29 00:41:59

by David Schwartz

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR.



On Thu, 25 Apr 2002 19:43:01 -0700 (PDT), David S. Miller wrote:
>From: Terje Eggestad <[email protected]>
>Date: 25 Apr 2002 14:37:44 +0200
>
>However writing a test server that stand in blocking wait on a UDP
>socket, and start two instances of the server it's ALWAYS the server
>last started that get the udp message, even if it's not in blocking
>wait, and the first started server is.
>
>Smells like a bug to me, this behavior don't make much sence.
>
>Using stock 2.4.17.
>
>Can you post your test server/client application so that I
>don't have to write it myself and guess how you did things?

There are really two possibilities:

1) The two instances are cooperating closely together and should be sharing
a socket (not each opening one), or

2) The two instances are not cooperating closely together and each own their
own socket. For all the kernel knows, they don't even know about each other.

In the first case, it's logical for whichever one happens to try to read
first to get the/a datagram. In the second case, it's logical for the kernel
to pick one and give it all the data.

DS


2002-04-29 08:06:47

by Terje Eggestad

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR.

On Mon, 2002-04-29 at 02:41, David Schwartz wrote:
>
>
> On Thu, 25 Apr 2002 19:43:01 -0700 (PDT), David S. Miller wrote:
> >From: Terje Eggestad <[email protected]>
> >Date: 25 Apr 2002 14:37:44 +0200
> >
> >However writing a test server that stand in blocking wait on a UDP
> >socket, and start two instances of the server it's ALWAYS the server
> >last started that get the udp message, even if it's not in blocking
> >wait, and the first started server is.
> >
> >Smells like a bug to me, this behavior don't make much sence.
> >
> >Using stock 2.4.17.
> >
> >Can you post your test server/client application so that I
> >don't have to write it myself and guess how you did things?
>
> There are really two possibilities:
>
> 1) The two instances are cooperating closely together and should be sharing
> a socket (not each opening one), or
>
> 2) The two instances are not cooperating closely together and each own their
> own socket. For all the kernel knows, they don't even know about each other.
>
> In the first case, it's logical for whichever one happens to try to read
> first to get the/a datagram. In the second case, it's logical for the kernel
> to pick one and give it all the data.
>
> DS
>

IMHO, in the second case it's logical for the kernel NOT to allow the
second to bind to the port at all. Which it actually does, it's the
normal case. When you set the SO_REUSEADDR flag on the socket you're
telling the kernel that we're in case 1).

TJ

>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-04-29 08:44:51

by David Schwartz

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR.


>> 1) The two instances are cooperating closely together and should be
>>sharing
>>a socket (not each opening one), or
>>
>> 2) The two instances are not cooperating closely together and each own
>>their
>>own socket. For all the kernel knows, they don't even know about each
>>other.
>>
>> In the first case, it's logical for whichever one happens to try to
read
>>first to get the/a datagram. In the second case, it's logical for the
>>kernel
>>to pick one and give it all the data.
>>
>> DS

>IMHO, in the second case it's logical for the kernel NOT to allow the
>second to bind to the port at all. Which it actually does, it's the
>normal case. When you set the SO_REUSEADDR flag on the socket you're
>telling the kernel that we're in case 1).
>
>TJ

NO. When you set the SO_REUSEADDR, you are telling the kernel that you
intend to share your port with *someone*, but not who. The kernel has no way
to know that two processes that bind to the same UDP port with SO_REUSEADDR
are the two that were intended to cooperate with each other. For all it
knows, one is a foo intended to cooperate with other foo's and the other is a
bar intended to cooperate with other bar's.

That's why if you mean to share, you should share the actual socket
descriptor rather than trying to reference the same transport endpoint with
two different sockets.

Of course, in this case you don't even need SO_REUSEADDR/SO_REUSEPORT since
you only actually open the endpoint once.

DS


2002-04-29 10:03:22

by Terje Eggestad

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR.

On Mon, 2002-04-29 at 10:44, David Schwartz wrote:
>
> >> 1) The two instances are cooperating closely together and should be
> >>sharing
> >>a socket (not each opening one), or
> >>
> >> 2) The two instances are not cooperating closely together and each own
> >>their
> >>own socket. For all the kernel knows, they don't even know about each
> >>other.
> >>
> >> In the first case, it's logical for whichever one happens to try to
> read
> >>first to get the/a datagram. In the second case, it's logical for the
> >>kernel
> >>to pick one and give it all the data.
> >>
> >> DS
>
> >IMHO, in the second case it's logical for the kernel NOT to allow the
> >second to bind to the port at all. Which it actually does, it's the
> >normal case. When you set the SO_REUSEADDR flag on the socket you're
> >telling the kernel that we're in case 1).
> >
> >TJ
>
> NO. When you set the SO_REUSEADDR, you are telling the kernel that you
> intend to share your port with *someone*, but not who. The kernel has no way
> to know that two processes that bind to the same UDP port with SO_REUSEADDR
> are the two that were intended to cooperate with each other. For all it
> knows, one is a foo intended to cooperate with other foo's and the other is a
> bar intended to cooperate with other bar's.
>
> That's why if you mean to share, you should share the actual socket
> descriptor rather than trying to reference the same transport endpoint with
> two different sockets.
>
> Of course, in this case you don't even need SO_REUSEADDR/SO_REUSEPORT since
> you only actually open the endpoint once.
>

Well, first of all, I picked up "Unix Network Programming, Networking
APIs: Scokets and XTI" by R. Stevens.This is discussed on p.195-196.
(With reference to "TCP/IP Illustrated" Vol 2, p.777-779, which I don't
have at hand). According to Stevens, duplicate binding to the same
address (IP + port) is an multicast/broadcast feature, and the test code
I published here a few mail ago is actually illegal on hosts that
a) don't implement multicast.
b) implement SO_REUSEPORT (Which Linux as of now don't)

FYI: In b) the use of SO_REUSEPORT to do bind of duplicate addr is the
same as SO_REUSEADDR is now. All parties must set the flag.

Stevens further remarks that when a unicast datagram is received on the
port only one socket shall receive it, *** and which one is
implementation specific. ***!!!

*** So the current implementation is NOT a bug. *** (If you believe
Stevens that is :-) I do.).


I even agree that the *proper* way for two or more programs to share a
UDP port is to share the socket, it just create an issue about who shall
create the AF_UNIX socket used to pass the descriptor, and what happen
when the owner of the AF_UNIX socket dies. (they others will after all
most likely continue). Not to mention the extra code needed in the
programs to implement the descriptor passing algorithm.


However, I still can't see any *practical* use of having one program
(me) bind the port, deliberately share it, and another program (you)
coming along and want to share it, and then all unicast datagrams are
passed to you. Not If I haven't subscribed to any multicast addresses,
and no one is sending bcasts, there is no point of me being alive.

Can you come up with a real life situation where this make sense?


Like I said, it's currently not a bug, and IMHO any behavior should only
be changed iff SO_REUSEPORT is implemented.



> DS
>
>

TJ

--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-04-29 10:38:27

by David Schwartz

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR.


>However, I still can't see any *practical* use of having one program
>(me) bind the port, deliberately share it, and another program (you)
>coming along and want to share it, and then all unicast datagrams are
>passed to you. Not If I haven't subscribed to any multicast addresses,
>and no one is sending bcasts, there is no point of me being alive.
>
>Can you come up with a real life situation where this make sense?

Absolutely. This is actually used in cases where you have a 'default'
handler for a protocol that is built into a larger program but want to keep
the option to 'override' it with a program with more sophisticated behavior
from time to time. In this case, the last socket should get all the data
until it goes away.

DS




2002-04-29 14:20:33

by Terje Eggestad

[permalink] [raw]
Subject: Re: Possible bug with UDP and SO_REUSEADDR.

On Mon, 2002-04-29 at 12:38, David Schwartz wrote:
>
> >However, I still can't see any *practical* use of having one program
> >(me) bind the port, deliberately share it, and another program (you)
> >coming along and want to share it, and then all unicast datagrams are
> >passed to you. Not If I haven't subscribed to any multicast addresses,
> >and no one is sending bcasts, there is no point of me being alive.
> >
> >Can you come up with a real life situation where this make sense?
>
> Absolutely. This is actually used in cases where you have a 'default'
> handler for a protocol that is built into a larger program but want to keep
> the option to 'override' it with a program with more sophisticated behavior
> from time to time. In this case, the last socket should get all the data
> until it goes away.
>
> DS
>

First of all since we're in agreement that the current behavior is NOT a
bug, this discussion is pretty pointless, however I getting worked up.

In all fairness, I've a colleague that did an implementation of TCPIP a
decade ago, and his in agreement in that the current logic this is the
way implementations worked. Thus we're less likely to break things
leaving this they way they are.

However you logic is broken.
First of all, I asked for a case where it make sense, not where it's
moronically been done so. If you review your own argument:

> That's why if you mean to share, you should share the actual socket
> descriptor rather than trying to reference the same transport endpoint
> with two different sockets.

The program that want to "override" shall connect the first on a
AF_UNIX, get the descriptor and be told not to read from the UDP socket
until the AF_UNIX socket to the overrider is broken/disconnected.

Since according to Stevens, what happen here is implementation specific,
the "overriding" you describe is non-portable.

If you look at you other argument:
> NO. When you set the SO_REUSEADDR, you are telling the
> kernel that you intend to share your port with *someone*, but not who.
> The kernel has no way to know that two processes that bind to the same
> UDP port with SO_REUSEADDR are the two that were intended to
> cooperate with each other. For all it knows, one is a foo intended to
> cooperate with other foo's and the other is a bar intended to
> cooperate with other bar's.

The logical deduction from this is that you should never, ever, use bind
to the same address for unicast since the kernel don't have the
sufficient information to route the datagram correctly. I *COULD* agree
to that it should be illegal to duplicate bind to an address.

Trouble is now that is actually legal...

TJ

>
>
>
--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________