2008-01-08 22:28:13

by Brent Casavant

[permalink] [raw]
Subject: AF_UNIX MSG_PEEK bug?

Hello,

I was coding an application which passes variable-length messages
over an AF_UNIX SOCK_STREAM socket. As such I would pass a message
length embedded as the first element in the message, use recv(...,MSG_PEEK)
to determine the message length, and perform the necessary allocation
on the receiving end before performing the real recv().

However, the program would occasionally get into a situation where
a call to recv(sockfd, &buf, len, MSG_PEEK) returns some number
of bytes less than the requested length, and persists in this state
(i.e. retrying the call continues to return the same amount of data)
even when more than sufficient data is known to have been successfully
written to the socket.

By my reading of the af_unix.c kernel code, I believe the problem lies in
unix_stream_recvmsg(). Here we find that, under MSG_PEEK, only a single
skb is pulled from the receive queue and the data copied into the user
buffer. This fails to account for the fact that this skb could contain less
data than is necessary to satisfy the MSG_PEEK request. The "short" skb
is then pushed back onto the queue, and thus a retried MSG_PEEK will again
return a short response. The only way to clear this condition is to
perform a recv() without specifying MSG_PEEK.

I'll be happy to submit a patch to fix this, but would like to first
confirm with others that A) this is incorrect behavior for MSG_PEEK
semantics and B) my approach to the fix sounds reasonable.

What I propose as a fix is, for MSG_PEEK operations, dequeueing enough
skbs to satisfy the operation (checking of course that there is enough
data in the queue in the first place), copying out the requested data
to the user buffer, then pushing those skbs back onto the receive queue.

I'm not terribly familiar with the Linux networking code, so if the
above approach sounds misguided, please feel free to let me know.

Below you'll find an example program (not the real program, I know
there's some no-nos in here, this was written simply to demonstrate
the bug) that demonstrates the problem. It fails quite reliably and
quickly on an SGI Altix IA64, though given the nature of the problem I
don't believe it's architecture-specific. Running the program with
no arguments produces output similar to the following:

consumer: recv of message size is not progressing at time 1199829455094356.
consumer: want 8 bytes, but only receiving 5.
consumer: SIOCOUTQ=0, SIOCINQ=202421
consumer: had 36985 successes
producer: sendmsg failed at time 1199829455094571

The meaning of the first two lines is obvious (UNIX time in microseconds).
The third line gives the output and input queue lengths for the AF_UNIX
socket at the time of the error message, to confirm that there is indeed
sufficient data available for satisfying the MSG_PEEK operation. The
fourth line details how many iterations of the main loop of the program
were successful until we got hung up on the MSG_PEEK. The final line
simply confirms that the producer failed as a result of the consumer
closing the socket, not the other way around.

Thanks,
Brent Casavant

#include <errno.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/un.h>

#include <linux/sockios.h>

#define SOCKNAME "/tmp/peektest.sock"
#define MAXLEN 2048
#define NUMMSG 64
#define ERRLOOPS 50000

struct sockmsg {
size_t length;
char payload[MAXLEN];
};

struct sockmsg messages[NUMMSG];
struct iovec iov[NUMMSG];

void
do_consumer(char *socketname) {
struct stat sb;
int status;
int sockfd, fd;
struct sockaddr_un su;
ssize_t bytes;
struct msghdr msg = { .msg_name = NULL, .msg_namelen = 0,
.msg_iov = iov, .msg_iovlen = 1,
.msg_control = NULL, .msg_controllen = 0,
.msg_flags = 0 };
int successes = 0;

/* Wait for socket to become available */
do {
sleep(1);
status = stat(socketname, &sb);
} while ((status < 0) && (ENOENT == errno));
if ((status < 0) && (ENOENT != errno)) {
printf("consumer: stat failed\n");
return;
}
if (!S_ISSOCK(sb.st_mode)) {
printf("consumer: %s is not a socket\n", socketname);
return;
}

/* Connect to the producer */
sockfd = socket(AF_UNIX, SOCK_STREAM, 0);
if (sockfd < 0) {
printf("consumer: socket() failed\n");
return;
}
su.sun_family = PF_UNIX;
strcpy(su.sun_path, socketname);
if (connect(sockfd, (struct sockaddr*) &su, SUN_LEN(&su)) < 0) {
printf("consumer: connect() failed\n");
return;
}

/* Main loop here. Inspect the first size_t in the incoming
* message to determine the message length. Then read in
* the entire message.
*/
do {
size_t msglen;
int loops;

/* XXX: Here's where the trouble occurs. Even though the
* socket gets filled up quite plentifully by the producer,
* the recv(..., MSG_PEEK) operation eventually will get
* stuck returning less bytes than requested and never
* return any more.
* Note that including or not MSG_WAITALL in the flags
* makes no difference.
*/
/* Get length of next message */
loops = 0;
do {
int outq, inq;
bytes = recv(sockfd, &msglen, sizeof(size_t),
MSG_PEEK|MSG_WAITALL);
if (loops++ > ERRLOOPS) {
struct timeval tv;
gettimeofday(&tv, NULL);
printf("consumer: recv of message size is "
"not progressing at time %ld.\n",
tv.tv_sec*1000000 + tv.tv_usec);
printf("consumer: want %d bytes, but only "
"receiving %d.\n", sizeof(size_t),
bytes);
ioctl(sockfd, SIOCOUTQ, &outq);
ioctl(sockfd, SIOCINQ, &inq);
printf("consumer: SIOCOUTQ=%d, SIOCINQ=%d\n",
outq, inq);
printf("consumer: had %d successes\n",
successes);
return;
}
/* Just in case, give the producer a little time to
* fill the queue if we're not making progress.
*/
if ((bytes < sizeof(size_t)) && (0 == (loops % 10000)))
sleep(1);
} while (bytes < sizeof(size_t));

/* Receive the message */
iov[0].iov_len = msglen;
bytes = recvmsg(sockfd, &msg, MSG_WAITALL);
if (bytes < 0) {
printf("consumer: recvmsg failed\n");
return;
} else if (bytes != msglen) {
printf("consumer: short recvmsg\n");
return;
}
/* Discard what we receive, we don't really care
* about it.
*/
successes++;
} while(1);
}

void
do_producer(char *socketname) {
int sockfd, fd;
struct sockaddr_un su;
struct msghdr msg = { .msg_name = NULL, .msg_namelen = 0,
.msg_iov = iov, .msg_iovlen = NUMMSG,
.msg_control = NULL, .msg_controllen = 0,
.msg_flags = 0 };
ssize_t sent;

/* Remove any leftover socket with each run */
unlink(SOCKNAME);

/* Create/bind/listen/accept socket */
sockfd = socket(AF_UNIX, SOCK_STREAM, 0);
if (sockfd < 0) {
printf("producer: socket() failed\n");
return;
}
su.sun_family = PF_UNIX;
strcpy(su.sun_path, socketname);
if (bind(sockfd, (struct sockaddr*) &su, SUN_LEN(&su)) < 0) {
printf("producer: bind() failed\n");
return;
}
if (listen(sockfd, 5) < 0) {
printf("producer: listen() failed\n");
return;
}
while ((fd = accept(sockfd, NULL, 0)) < 0)
printf("producer: accept() failed\n");

/* Don't leave it laying around when we exit */
unlink(SOCKNAME);

/* Main loop here. Set message lengths, and send them over the
* socket.
*/
srand(getpid());
do {
int i;
size_t tosend;
struct timeval tv;

/* Set up each IOV/message */
tosend = 0;
for (i = 0; i < msg.msg_iovlen; i++) {
/* Send random lengths of data */
messages[i].length =
(rand() % MAXLEN) + sizeof(size_t);
iov[i].iov_len = messages[i].length;
tosend += messages[i].length;
}

/* Send the message */
sent = sendmsg(fd, &msg, MSG_NOSIGNAL);
if (sent < 0) {
gettimeofday(&tv, NULL);
printf("producer: sendmsg failed at time %ld\n",
tv.tv_sec*1000000 + tv.tv_usec);
return;
}
if (sent < tosend) {
gettimeofday(&tv, NULL);
printf("producer: short sendmsg at time %ld\n",
tv.tv_sec*1000000 + tv.tv_usec);
return;
}
} while(1);
}

int
main(int argc, char *argv[]) {
pid_t pid;
int i;

/* Initialize messages and IOVs */
for (i = 0; i < NUMMSG; i++) {
memset(&messages[i], 0, sizeof(struct sockmsg));
iov[i].iov_base = &messages[i];
iov[i].iov_len = sizeof(struct sockmsg);
}

/* With no arguments, start both producer and consumer */
if (argc < 2) {
if ((pid = fork()) < 0) {
printf("Problem with fork\n");
return 1;
} else if (0 == pid)
do_producer(SOCKNAME);
else
do_consumer(SOCKNAME);
return 0;
}

/* Otherwise start the specified role */
if (0 == strcmp(argv[1], "producer"))
do_producer(SOCKNAME);
else if (0 == strcmp(argv[1], "consumer"))
do_consumer(SOCKNAME);
else {
printf("Error: Must specify \"producer\" or \"consumer\" "
"on command line.\n");
return 1;
}

return 0;
}

--
Brent Casavant All music is folk music. I ain't
[email protected] never heard a horse sing a song.
Silicon Graphics, Inc. -- Louis Armstrong


2008-01-08 22:40:36

by Rick Jones

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

Potential bugs notwithstanding, given that this is a STREAM socket, and
as such shouldn't (I hope, or I'm eating toes for dinner again) have
side effects like tossing the rest of a datagram, why are you using
MSG_PEEK? Why not simply read the N bytes of the message that will have
the message length with a normal read/recv, and then read that many
bytes in the next call?

rick jones

2008-01-08 22:54:21

by Tom Spink

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On 08/01/2008, Rick Jones <[email protected]> wrote:
> Potential bugs notwithstanding, given that this is a STREAM socket, and
> as such shouldn't (I hope, or I'm eating toes for dinner again) have
> side effects like tossing the rest of a datagram, why are you using
> MSG_PEEK? Why not simply read the N bytes of the message that will have
> the message length with a normal read/recv, and then read that many
> bytes in the next call?
>
> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Hi,

Where in the code is the message length being sent across the socket?

--
Regards,
Tom Spink
University of Edinburgh

2008-01-08 23:19:06

by Brent Casavant

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On Tue, 8 Jan 2008, Tom Spink wrote:

> Where in the code is the message length being sent across the socket?

In do_producer(), there are the following lines in the main loop:

/* Send random lengths of data */
messages[i].length = (rand() % MAXLEN) + sizeof(size_t);
iov[i].iov_len = messages[i].length;

The entire "struct sockmsg" is sent across the socket, so the first
size_t in each message contains the length of the entire message
(including the size_t). This size gets picked up at the
recv(...,MSG_PEEK) line in do_consumer().

Thanks,
Brent

--
Brent Casavant All music is folk music. I ain't
[email protected] never heard a horse sing a song.
Silicon Graphics, Inc. -- Louis Armstrong

2008-01-08 23:20:29

by Brent Casavant

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On Tue, 8 Jan 2008, Rick Jones wrote:

> Potential bugs notwithstanding, given that this is a STREAM socket, and as
> such shouldn't (I hope, or I'm eating toes for dinner again) have side effects
> like tossing the rest of a datagram, why are you using MSG_PEEK? Why not
> simply read the N bytes of the message that will have the message length with
> a normal read/recv, and then read that many bytes in the next call?

That's entirely reasonable, and probably a worthwhile change to make.
But, as you say, it doesn't change whether or not this is a bug in
the MSG_PEEK code.

With a small bit of complication I certainly can do what you suggest.

The initial reasoning was that this made it easy to handle the case where
the caller of the library routine (my code which stumbled on this was
part of a small library I wrote as part of the application) did not supply
a sufficiently sized buffer for reception of the next message. The "easy"
way to do this was a MSG_PEEK to validate the buffer size against the
message size, followed by a regular recv() if the buffer is large enough.
To do what you suggest (which effectively works around the bug) is certainly
possible, but also requires maintaining state between calls to the
"get message" routine, as I need to track whether or not the size has already
been read from the next message on the stream socket.

The change isn't terribly difficult, but wasn't the initial idea.

Plus it's always good to flush a bug out of hiding, right? ;)

Brent

--
Brent Casavant All music is folk music. I ain't
[email protected] never heard a horse sing a song.
Silicon Graphics, Inc. -- Louis Armstrong

2008-01-08 23:40:15

by Tom Spink

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On 08/01/2008, Brent Casavant <[email protected]> wrote:
> On Tue, 8 Jan 2008, Tom Spink wrote:
>
> > Where in the code is the message length being sent across the socket?
>
> In do_producer(), there are the following lines in the main loop:
>
> /* Send random lengths of data */
> messages[i].length = (rand() % MAXLEN) + sizeof(size_t);
> iov[i].iov_len = messages[i].length;
>
> The entire "struct sockmsg" is sent across the socket, so the first
> size_t in each message contains the length of the entire message
> (including the size_t). This size gets picked up at the
> recv(...,MSG_PEEK) line in do_consumer().
>
> Thanks,
> Brent
>
> --
> Brent Casavant All music is folk music. I ain't
> [email protected] never heard a horse sing a song.
> Silicon Graphics, Inc. -- Louis Armstrong
>

Hi,

But you're not consuming the size_t on the other end. You're only
peeking it, i.e. you're doing the recv to peek at the message, but
never calling recv to remove that data from the queue... or am I
missing something?

--
Regards,
Tom Spink
University of Edinburgh

2008-01-08 23:46:28

by Tom Spink

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On 08/01/2008, Tom Spink <[email protected]> wrote:
> On 08/01/2008, Brent Casavant <[email protected]> wrote:
> > On Tue, 8 Jan 2008, Tom Spink wrote:
> >
> > > Where in the code is the message length being sent across the socket?
> >
> > In do_producer(), there are the following lines in the main loop:
> >
> > /* Send random lengths of data */
> > messages[i].length = (rand() % MAXLEN) + sizeof(size_t);
> > iov[i].iov_len = messages[i].length;
> >
> > The entire "struct sockmsg" is sent across the socket, so the first
> > size_t in each message contains the length of the entire message
> > (including the size_t). This size gets picked up at the
> > recv(...,MSG_PEEK) line in do_consumer().
> >
> > Thanks,
> > Brent
> >
> > --
> > Brent Casavant All music is folk music. I ain't
> > [email protected] never heard a horse sing a song.
> > Silicon Graphics, Inc. -- Louis Armstrong
> >
>
> Hi,
>
> But you're not consuming the size_t on the other end. You're only
> peeking it, i.e. you're doing the recv to peek at the message, but
> never calling recv to remove that data from the queue... or am I
> missing something?
>
> --
> Regards,
> Tom Spink
> University of Edinburgh
>

Ach. I *am* missing something... and what I'm missing is my
understanding of the sendmsg and recvmsg calls.

A quick Google has sorted me out.

--
Regards,
Tom Spink
University of Edinburgh

2008-01-09 00:08:46

by Brent Casavant

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On Tue, 8 Jan 2008, Tom Spink wrote:

> Ach. I *am* missing something... and what I'm missing is my
> understanding of the sendmsg and recvmsg calls.

No problem. We all have those days. :)

Brent

--
Brent Casavant All music is folk music. I ain't
[email protected] never heard a horse sing a song.
Silicon Graphics, Inc. -- Louis Armstrong

2008-01-09 11:04:49

by Tetsuo Handa

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

Hello.

Brent Casavant wrote:
> However, the program would occasionally get into a situation where
> a call to recv(sockfd, &buf, len, MSG_PEEK) returns some number
> of bytes less than the requested length, and persists in this state
> (i.e. retrying the call continues to return the same amount of data)
> even when more than sufficient data is known to have been successfully
> written to the socket.
Did you try MSG_WAITALL flag? See "man 2 recv".
A TCP socket handles data in bytes.
You cannot complain if the amount received by recv() is smaller than expected
unless you use MSG_WAITALL flag.

Thanks.

2008-01-09 18:01:55

by Brent Casavant

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On Wed, 9 Jan 2008, Tetsuo Handa wrote:

> Did you try MSG_WAITALL flag? See "man 2 recv".
> A TCP socket handles data in bytes.
> You cannot complain if the amount received by recv() is smaller than expected
> unless you use MSG_WAITALL flag.

Yes. It made no difference, as noted in the comments in the
provided test program.

Brent

--
Brent Casavant All music is folk music. I ain't
[email protected] never heard a horse sing a song.
Silicon Graphics, Inc. -- Louis Armstrong

2008-01-10 00:58:47

by Herbert Xu

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

Brent Casavant <[email protected]> wrote:
>
>> Did you try MSG_WAITALL flag? See "man 2 recv".
>> A TCP socket handles data in bytes.
>> You cannot complain if the amount received by recv() is smaller than expected
>> unless you use MSG_WAITALL flag.
>
> Yes. It made no difference, as noted in the comments in the
> provided test program.

The POSIX text for MSG_WAITALL actually says that when used in
conjunction with MSG_PEEK it may not return the full data.

Having said that, I do agree that having TCP and AF_UNIX behave
in the same way is a plus. However, if you really want this to
happen it would help if you had attached a patch :)

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-01-10 01:19:20

by Brent Casavant

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On Thu, 10 Jan 2008, Herbert Xu wrote:

> The POSIX text for MSG_WAITALL actually says that when used in
> conjunction with MSG_PEEK it may not return the full data.

That's fine. The problem is that the peek operation returns less
data than requested even when sufficient data is available on the
receive queue.

> However, if you really want this to
> happen it would help if you had attached a patch :)

A patch is definitely in progress. I'm a little confused as
to the difference between unix_detach_fds() and scm_fp_dup()
in the MSG_PEEK versus !MSG_PEEK paths in unix_stream_recvmsg(),
however once I figure that out a patch should be forthcoming.

Brent

--
Brent Casavant All music is folk music. I ain't
[email protected] never heard a horse sing a song.
Silicon Graphics, Inc. -- Louis Armstrong

2008-01-10 02:50:25

by Brent Casavant

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

On Thu, 10 Jan 2008, Herbert Xu wrote:

> Having said that, I do agree that having TCP and AF_UNIX behave
> in the same way is a plus. However, if you really want this to
> happen it would help if you had attached a patch :)

The following patch appears to fix the problem. However, I would
really appreciate if someone more familiar with UNIX credential
passing could review what I did here and provide some feedback.
It wasn't at all clear to me why the MSG_PEEK and !MSG_PEEK code
paths were taking different actions regarding credentials, so I
unified them to behave the same way as the !MSG_PEEK code path.

I haven't tested credential passing under this new code yet, but
hope to do so tomorrow. Again, a review with an eye towards
that area would be most appreciated.

The various test cases I had which tripped this bug now pass without
incident. Also, if netstat is to be believed, quite a few programs
which utilize AF_UNIX sockets (e.g. hald, dbus, acpid) are running
very contentedly on the test system, so at least I didn't severely
break anything, possible UNIX credential issues notwithstanding.

Note: I'm not proposing this patch be integrated yet -- I'm throwing
it out here as a starting point.

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 060bba4..2ffdf5b 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -50,6 +50,9 @@
* Arnaldo C. Melo : Remove MOD_{INC,DEC}_USE_COUNT,
* the core infrastructure is doing that
* for all net proto families now (2.5.69+)
+ * Brent Casavant : SOCK_STREAM MSG_PEEK should peek
+ * far enough ahead to satisfy the request
+ * rather than stop after the first skb.
*
*
* Known differences from reference BSD that was tested:
@@ -1750,6 +1753,8 @@ static int unix_stream_recvmsg(struct ki
int target;
int err = 0;
long timeo;
+ struct sk_buff *skb;
+ struct sk_buff_head peek_stack;

err = -EINVAL;
if (sk->sk_state != TCP_ESTABLISHED)
@@ -1759,6 +1764,9 @@ static int unix_stream_recvmsg(struct ki
if (flags&MSG_OOB)
goto out;

+ if (flags & MSG_PEEK)
+ skb_queue_head_init(&peek_stack);
+
target = sock_rcvlowat(sk, flags&MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, flags&MSG_DONTWAIT);

@@ -1778,7 +1786,6 @@ static int unix_stream_recvmsg(struct ki
do
{
int chunk;
- struct sk_buff *skb;

unix_state_lock(sk);
skb = skb_dequeue(&sk->sk_receive_queue);
@@ -1845,39 +1852,31 @@ static int unix_stream_recvmsg(struct ki
copied += chunk;
size -= chunk;

- /* Mark read part of skb as used */
- if (!(flags & MSG_PEEK))
- {
- skb_pull(skb, chunk);
-
- if (UNIXCB(skb).fp)
- unix_detach_fds(siocb->scm, skb);
+ /* Credential passing */
+ if (UNIXCB(skb).fp)
+ unix_detach_fds(siocb->scm, skb);

- /* put the skb back if we didn't use it up.. */
+ if (!(flags & MSG_PEEK)) {
+ /* Mark read part of skb as used */
+ skb_pull(skb, chunk);
+ /* Return unused portion or free skb */
if (skb->len)
- {
skb_queue_head(&sk->sk_receive_queue, skb);
- break;
- }
-
- kfree_skb(skb);
-
- if (siocb->scm->fp)
- break;
- }
- else
- {
- /* It is questionable, see note in unix_dgram_recvmsg.
- */
- if (UNIXCB(skb).fp)
- siocb->scm->fp = scm_fp_dup(UNIXCB(skb).fp);
+ else
+ kfree_skb(skb);
+ } else
+ __skb_queue_head(&peek_stack, skb);

- /* put message back and return */
- skb_queue_head(&sk->sk_receive_queue, skb);
+ /* Stop early when passed credentials are encountered */
+ if (siocb->scm->fp)
break;
- }
} while (size);

+ /* Push all peeked skbs back onto receive queue */
+ if (flags & MSG_PEEK)
+ while ((skb = __skb_dequeue(&peek_stack)))
+ skb_queue_head(&sk->sk_receive_queue, skb);
+
mutex_unlock(&u->readlock);
scm_recv(sock, msg, siocb->scm, flags);
out:

2008-01-10 22:36:17

by Brent Casavant

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

Here's what I think is a better patch. Or maybe just simpler.

However, I'm still unsure what the effect of this patch on
file descriptor passing might be. Reading the prior code,
and the parallel portions/comments in unix_dgram_recvmsg(),
it looks like there's been a lot of uncertainty as to how
file descriptor passing should be handled durning MSG_PEEK
operations. To quote:

/* It is questionable: on PEEK we could:
- do not return fds - good, but too simple 8)
- return fds, and do not return them on read (old strategy,
apparently wrong)
- clone fds (I chose it for now, it is the most universal
solution)

POSIX 1003.1g does not actually define this clearly
at all. POSIX 1003.1g doesn't define a lot of things
clearly however!
*/

With this patch, passed file descriptors are ignored during MSG_PEEK.
This is essentially the first case in the comment above. What I
can't seem to figure out is why this is incorrect. I suspect there's
some history here that I can't find via Google, mailing list archives,
or revision logs.

So, that said, here's a cleaner patch. It's still not ready for
application until the file descriptor passing is better understood.

Thanks,
Brent

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 060bba4..6d6cdb4 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1750,6 +1750,8 @@ static int unix_stream_recvmsg(struct ki
int target;
int err = 0;
long timeo;
+ struct sk_buff *skb;
+ struct sk_buff_head peek_stack;

err = -EINVAL;
if (sk->sk_state != TCP_ESTABLISHED)
@@ -1759,6 +1761,9 @@ static int unix_stream_recvmsg(struct ki
if (flags&MSG_OOB)
goto out;

+ if (flags & MSG_PEEK)
+ skb_queue_head_init(&peek_stack);
+
target = sock_rcvlowat(sk, flags&MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, flags&MSG_DONTWAIT);

@@ -1778,7 +1783,6 @@ static int unix_stream_recvmsg(struct ki
do
{
int chunk;
- struct sk_buff *skb;

unix_state_lock(sk);
skb = skb_dequeue(&sk->sk_receive_queue);
@@ -1864,19 +1868,14 @@ static int unix_stream_recvmsg(struct ki

if (siocb->scm->fp)
break;
- }
- else
- {
- /* It is questionable, see note in unix_dgram_recvmsg.
- */
- if (UNIXCB(skb).fp)
- siocb->scm->fp = scm_fp_dup(UNIXCB(skb).fp);
+ } else
+ __skb_queue_head(&peek_stack, skb);
+ } while (size);

- /* put message back and return */
+ /* Push all peeked skbs back onto receive queue */
+ if (flags & MSG_PEEK)
+ while ((skb = __skb_dequeue(&peek_stack)))
skb_queue_head(&sk->sk_receive_queue, skb);
- break;
- }
- } while (size);

mutex_unlock(&u->readlock);
scm_recv(sock, msg, siocb->scm, flags);

2008-01-10 22:39:21

by Alan

[permalink] [raw]
Subject: Re: AF_UNIX MSG_PEEK bug?

> and the parallel portions/comments in unix_dgram_recvmsg(),
> it looks like there's been a lot of uncertainty as to how
> file descriptor passing should be handled durning MSG_PEEK
> operations. To quote:

The specs basically don't answer the question. What is critical is that
the behaviour does not change compared with older Linux releases.