2005-04-25 17:20:24

by Olivier Galibert

[permalink] [raw]
Subject: tcp_sendpage and page allocation lifetime vs. iscsi

I have a problem with the iscsi driver (both 4.x and 5.x) and scsi
tape I'm not sure how to solve. It may linked to some specific
characteristics of the tg3 network driver.

What happens is, from what I can trace:
1- st alloc_pages a bunch of pages for buffering

2- st sends a bunch of them to iscsi for writing (32K is common when
labelling a tape for instance)

3- iscsi sends whatever header is needed followed by the data using
tcp_sendpage

4- tcp_sendpage copies from of the pages but get_page() others,
probably depending on the state of the socket buffer. It returns
immediatly anyway, leaving some pages with an elevated count (which, I
guess, it will eventually decrement again)

5- iscsi returns to st

6- st reuses the buffer immediatly, and/or frees it if the device is
closed. Silent corruption in one case, bad_page in __free_page_ok
called from normalize_buffer in the other.

I'm going to complete my traces to be sure that's really what's going
on (I don't have a log immediatly after sendpage yet). But in any
case, what would the solution be?

OG.


2005-04-25 19:22:25

by Avi Kivity

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

Olivier Galibert wrote:

>I have a problem with the iscsi driver (both 4.x and 5.x) and scsi
>tape I'm not sure how to solve. It may linked to some specific
>characteristics of the tg3 network driver.
>
>What happens is, from what I can trace:
>1- st alloc_pages a bunch of pages for buffering
>
>2- st sends a bunch of them to iscsi for writing (32K is common when
> labelling a tape for instance)
>
>3- iscsi sends whatever header is needed followed by the data using
> tcp_sendpage
>
>4- tcp_sendpage copies from of the pages but get_page() others,
> probably depending on the state of the socket buffer. It returns
> immediatly anyway, leaving some pages with an elevated count (which, I
> guess, it will eventually decrement again)
>
>5- iscsi returns to st
>
>6- st reuses the buffer immediatly, and/or frees it if the device is
> closed. Silent corruption in one case, bad_page in __free_page_ok
> called from normalize_buffer in the other.
>
>I'm going to complete my traces to be sure that's really what's going
>on (I don't have a log immediatly after sendpage yet). But in any
>case, what would the solution be?
>
>
>
you need a completion to tell you when your buffer has been sent. you
can use the kiocb parameter to tcp_sendmsg, as it has a completion.
however, tcp_sendmsg does not appear to use it.

in effect, you need tcp aio, but the mainline kernel does not support it
yet.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2005-04-25 19:34:41

by David Miller

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

On Mon, 25 Apr 2005 22:11:16 +0300
Avi Kivity <[email protected]> wrote:

> you need a completion to tell you when your buffer has been sent. you
> can use the kiocb parameter to tcp_sendmsg, as it has a completion.
> however, tcp_sendmsg does not appear to use it.
>
> in effect, you need tcp aio, but the mainline kernel does not support it
> yet.

Or, he could simply not try to reuse the private buffer he is
giving to TCP.

2005-04-25 19:43:37

by Avi Kivity

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

David S. Miller wrote:

>On Mon, 25 Apr 2005 22:11:16 +0300
>Avi Kivity <[email protected]> wrote:
>
>
>
>>you need a completion to tell you when your buffer has been sent. you
>>can use the kiocb parameter to tcp_sendmsg, as it has a completion.
>>however, tcp_sendmsg does not appear to use it.
>>
>>in effect, you need tcp aio, but the mainline kernel does not support it
>>yet.
>>
>>
>
>Or, he could simply not try to reuse the private buffer he is
>giving to TCP.
>
>
you are describing a memory leak. at some point he must free (or
otherwise reuse) these pages.

theoretically he could peek at the tcp sequence number, but an
event-driven, protocol-agnostic completion seems better to me.

* light goes on *

yes, if he frees the pages immediately after tcp_sendpage, then the
reference count would remain elevated until tcp completes sending these
pages. so the sequence

allocate pages
fill with data
tcp_sendpage()
free pages

should be safe?

(I am still wishing for tcp aio, though)

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2005-04-25 19:45:59

by David Miller

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

On Mon, 25 Apr 2005 22:43:19 +0300
Avi Kivity <[email protected]> wrote:

> yes, if he frees the pages immediately after tcp_sendpage, then the
> reference count would remain elevated until tcp completes sending these
> pages.

That's what I meant, sorry for the confusion.

2005-04-25 22:06:58

by Olivier Galibert

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

On Mon, Apr 25, 2005 at 12:19:53PM -0700, David S. Miller wrote:
> Or, he could simply not try to reuse the private buffer he is
> giving to TCP.

'cept the buffer is managed by scsi/st.c, not the iscsi driver. So
it's either changing the read/write buffer control rules scsi uses, or
systematically copying the data before sending. Not that good
performance-wise, especially since most of the time tcp_sendpage does
not want to grab the page.

Do you think possible to extent the sendpage api to add some kind of
"don't get the pages, copy them if you need them" flag?

OG.

2005-04-25 22:17:17

by David Miller

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

On Tue, 26 Apr 2005 00:06:03 +0200
Olivier Galibert <[email protected]> wrote:

> Do you think possible to extent the sendpage api to add some kind of
> "don't get the pages, copy them if you need them" flag?

No, not really.

Do you happen to run the scsi->done() function from iscsi
as soon as the write over the TCP socket completes returns
success? That is likely what is causing the problem.

When you call scsi->done(), the buffer is effectively released
and the scsi/st.c driver can legally reuse it once you've done
that.

tcp_sendpages() is really meant to be invoked for page cache
pages, or temporary pages cons'd up specifically for that
send call. Just look at what TCP sendmsg does, for example.
It carves up a per-socket cached PAGE to put the user's data
into.

You could do something similar in iSCSI and for now I highly
suggest that is what you do.

You could also:

1) set TCP_CORK to 1
2) tcp_sendmsg() the scsi tape data
3) tcp_sendpage() to remaining pages
4) set TCP_CORK to 0

so that tcp_sendmsg() does all the data copying for you.

Finally, you could also use "SIOCOUTQ" ioctl to watch the
write buffer get released. Call it once before you do the send,
save that value, then after your send wait for it to hit
or pass the old value you saved.

In short, you're using an API in a way it was never designed
to be used. We don't lock pages, and that is a deliberate
design decision. When we send pages over the wire using
TCP sendpages out of the page cache, the file contents _CAN_
change mid-send, but that's OK because the card calculates
the packet checksums so no data corruption nor quality of
implementation issues arise as a result.

Again, this behavior and these mechanics were deliberately
made to function this way.

2005-04-25 22:31:12

by Olivier Galibert

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

On Mon, Apr 25, 2005 at 03:08:40PM -0700, David S. Miller wrote:
> You could do something similar in iSCSI and for now I highly
> suggest that is what you do.

Ok, I'll try to implement that.


> In short, you're using an API in a way it was never designed
> to be used.

Heh, I don't, I'm just debugging it ;-)

It's interesting though that both the 4.x guys (linux-iscsi) and the
5.x ones (openiscsi, now both have merged) did the exact same mistake
while afaik their developments were independant. There is a
documentation issue there. If you could put what you just said in
Documentation/*, it should hopefully prevent this kind of
hard-to-track mistakes in the future.

OG.

2005-04-29 17:10:02

by Dmitry Yusupov

[permalink] [raw]
Subject: Re: tcp_sendpage and page allocation lifetime vs. iscsi

On Mon, 2005-04-25 at 15:08 -0700, David S. Miller wrote:
> On Tue, 26 Apr 2005 00:06:03 +0200
> Olivier Galibert <[email protected]> wrote:
>
> > Do you think possible to extent the sendpage api to add some kind of
> > "don't get the pages, copy them if you need them" flag?
>
> No, not really.
>
> Do you happen to run the scsi->done() function from iscsi
> as soon as the write over the TCP socket completes returns
> success? That is likely what is causing the problem.

Your guess is correct. If this happen Olivier will see corruption sooner
or later. But this should never happen, unless we have a subtle bug in
iscsi_tcp.c which needs to be verified and fixed.

Since SCSI tape device usually causing a lot of sense data, we probably
have a bug around sense data path processing. And this is not related to
TCP API abuse in anyway. I would recommend to continue this discussion
on http://www.open-iscsi.org mailing list.

Dima