2003-09-08 16:58:30

by Sven Köhler

[permalink] [raw]
Subject: [NBD] patch and documentation

The NBD-protocol

How it works:
To use the NBD features of the kernel you must compile the kernel module
called "Network block device support" (CONFIG_BLK_DEV_NBD). After you've
loaded the module the devices will appear in /dev/nbd/ if you're using
devfs. If you're not using devfs you will possibly need to create your
device-nodes with "mknod". All NBD devices have major ID 43. /dev/nbd/0
has minor ID 0, /dev/nbd/1 has minor ID 1 and so on.
To connect a NBD device to a remote server you need install the NBD tools
downloadable at http://nbd.sf.net or http://sf.net/projects/nbd. Run
"nbd-client <host> <tcp-port> /dev/nbd/0" to connect the device /dev/nbd/0
to the remote server. nbd-client will open a TCP-connection to the server
and waits for some initial data which contains the size of the device.
Than the handle of the TCP-connection is transferred to the kernel for
further use. nbd-client will fork into background because the handle of the
TCP-connection would be closed by the kernel if nbd-client exits.
The tools also contain a very basic NBD server which will enable you to use
any file or device as a NBD.

The protocol:
The protocol is based on top of TCP/IP. Both client and server send packets
to each other. The server must send an init-packet to the client if a
client connects. After that the server just receives request-packets from
the client and sends back reply-packets until the connection is closed or a
disconnect-request is received.

The amount of data that can be read or written with one request
is limited to 128KB.

The current implementation of the NBD-protocol in the Linux-Kernel does
send multiple requests without waiting for replies. So it does make sense
for the server to handle requests in parallel.

Constants:
INIT_PASSWD = "NBDMAGIC"
INIT_MAGIC = 0x0000420281861253
REQUEST_MAGIC = 0x25609513
REPLY_MAGIC = 0x67446698

REQUEST_READ = 0
REQUEST_WRITE = 1
REQUEST_DISCONNECT = 2

Packets:

init-packet:
passwd : 8 bytes (string) = INIT_PASSWD
magic : 8 bytes (integer) = INIT_MAGIC
size : 8 bytes (integer) = size of the device in bytes
reserved : 128 bytes (filled with zeros)

request-packet:
magic : 4 bytes (integer) = REQUEST_MAGIC
type : 4 bytes (integer) = REQUEST_READ, REQUEST_WRITE or REQUEST_DISCONNECT
handle : 8 bytes (integer) = request-handle
from : 8 bytes (integer) = start offset for read/write-operation in bytes
length : 4 bytes (integer) = length of the read/write-operationion bytes
data : x bytes (only for write request, x = length field of this packet)

reply-packet:
magic : 4 bytes (integer) = REPLY_MAGIC
error : 4 bytes (integer) = errorcode (0 = no error)
handle : 8 bytes (integer) = copy of request-handle
data : x bytes (only for reply to read request and if no error occured,
x = length field of the request packet)

all integer values are stored unsigned and in network-byte-order (big-endian)


Attachments:
nbd-patch.gz (487.00 B)
protocol.txt (3.05 kB)
Download all attachments

2003-09-08 19:47:13

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

> Paul, do you think this docs could be added to the end of
> Documentation/nbd.txt?

Well, i noticed that i used tabs in some places. sorry for that. the
tabs should be replaced by 4 spaces.

> The patch also looks harmless enough for applying ;-).

I take it as a compliment.

2003-09-08 19:40:41

by Pavel Machek

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Hi!

Paul, do you think this docs could be added to the end of
Documentation/nbd.txt?

The patch also looks harmless enough for applying ;-).
Pavel

> The patch i attached defines max_sectors for each NBD device to be 256
> which is 128KB as described in the protocol description i attached.
> The value 255 which is default in 2.4 kernels is not optimal. See my
> other posting in the lkml to read why.
> The patch seems to work fine. My server-implementation receives one
> 128KB request as expected instead of two 127KB and 1Kb requests.
>
> Even if 256 is the new max_sectors-default of kernel 2.6, the patch
> should be applied since the value should be part of the protocol
> specification and therefor part of nbd.c
>
> The documentation i attached should be published somewhere. For example
> on nbd.sf.net since we didn't find one source where all the information
> is collected.
>
> Thx
> Sven


> The NBD-protocol
>
> How it works:
> To use the NBD features of the kernel you must compile the kernel module
> called "Network block device support" (CONFIG_BLK_DEV_NBD). After you've
> loaded the module the devices will appear in /dev/nbd/ if you're using
> devfs. If you're not using devfs you will possibly need to create your
> device-nodes with "mknod". All NBD devices have major ID 43. /dev/nbd/0
> has minor ID 0, /dev/nbd/1 has minor ID 1 and so on.
> To connect a NBD device to a remote server you need install the NBD tools
> downloadable at http://nbd.sf.net or http://sf.net/projects/nbd. Run
> "nbd-client <host> <tcp-port> /dev/nbd/0" to connect the device /dev/nbd/0
> to the remote server. nbd-client will open a TCP-connection to the server
> and waits for some initial data which contains the size of the device.
> Than the handle of the TCP-connection is transferred to the kernel for
> further use. nbd-client will fork into background because the handle of the
> TCP-connection would be closed by the kernel if nbd-client exits.
> The tools also contain a very basic NBD server which will enable you to use
> any file or device as a NBD.
>
> The protocol:
> The protocol is based on top of TCP/IP. Both client and server send packets
> to each other. The server must send an init-packet to the client if a
> client connects. After that the server just receives request-packets from
> the client and sends back reply-packets until the connection is closed or a
> disconnect-request is received.
>
> The amount of data that can be read or written with one request
> is limited to 128KB.
>
> The current implementation of the NBD-protocol in the Linux-Kernel does
> send multiple requests without waiting for replies. So it does make sense
> for the server to handle requests in parallel.
>
> Constants:
> INIT_PASSWD = "NBDMAGIC"
> INIT_MAGIC = 0x0000420281861253
> REQUEST_MAGIC = 0x25609513
> REPLY_MAGIC = 0x67446698
>
> REQUEST_READ = 0
> REQUEST_WRITE = 1
> REQUEST_DISCONNECT = 2
>
> Packets:
>
> init-packet:
> passwd : 8 bytes (string) = INIT_PASSWD
> magic : 8 bytes (integer) = INIT_MAGIC
> size : 8 bytes (integer) = size of the device in bytes
> reserved : 128 bytes (filled with zeros)
>
> request-packet:
> magic : 4 bytes (integer) = REQUEST_MAGIC
> type : 4 bytes (integer) = REQUEST_READ, REQUEST_WRITE or REQUEST_DISCONNECT
> handle : 8 bytes (integer) = request-handle
> from : 8 bytes (integer) = start offset for read/write-operation in bytes
> length : 4 bytes (integer) = length of the read/write-operationion bytes
> data : x bytes (only for write request, x = length field of this packet)
>
> reply-packet:
> magic : 4 bytes (integer) = REPLY_MAGIC
> error : 4 bytes (integer) = errorcode (0 = no error)
> handle : 8 bytes (integer) = copy of request-handle
> data : x bytes (only for reply to read request and if no error occured,
> x = length field of the request packet)
>
> all integer values are stored unsigned and in network-byte-order (big-endian)

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-08 20:05:59

by Paul Clements

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Pavel Machek wrote:

> Paul, do you think this docs could be added to the end of
> Documentation/nbd.txt?

I've made some changes to 2.6 nbd.txt, so let me look at integrating
this into those changes and also putting them into 2.4.

> The patch also looks harmless enough for applying ;-).

Harmless enough, although I'm not sure it really makes that much
difference. The max_sectors being set to 255 doesn't, by itself, explain
the back and forth 127k, 1k request thing. Typically what you'll see is
127k, 127k, 127k, etc. and then some odd sized request at the end. Or
the device gets unplugged anyway at some point and there are odd sized
requests scattered throughout...that's especially going to be true if
the reads or writes are from an actual disk, rather than /dev/null. I
may be just coincidence that setting max_sectors to 256 actually helps.
Also, are we sure that all those requests you're seeing are of the same
type (all reads, all writes)?

--
Paul


> > The patch i attached defines max_sectors for each NBD device to be 256
> > which is 128KB as described in the protocol description i attached.
> > The value 255 which is default in 2.4 kernels is not optimal. See my
> > other posting in the lkml to read why.
> > The patch seems to work fine. My server-implementation receives one
> > 128KB request as expected instead of two 127KB and 1Kb requests.
> >
> > Even if 256 is the new max_sectors-default of kernel 2.6, the patch
> > should be applied since the value should be part of the protocol
> > specification and therefor part of nbd.c
> >
> > The documentation i attached should be published somewhere. For example
> > on nbd.sf.net since we didn't find one source where all the information
> > is collected.
> >
> > Thx
> > Sven
>
> > The NBD-protocol
> >
> > How it works:
> > To use the NBD features of the kernel you must compile the kernel module
> > called "Network block device support" (CONFIG_BLK_DEV_NBD). After you've
> > loaded the module the devices will appear in /dev/nbd/ if you're using
> > devfs. If you're not using devfs you will possibly need to create your
> > device-nodes with "mknod". All NBD devices have major ID 43. /dev/nbd/0
> > has minor ID 0, /dev/nbd/1 has minor ID 1 and so on.
> > To connect a NBD device to a remote server you need install the NBD tools
> > downloadable at http://nbd.sf.net or http://sf.net/projects/nbd. Run
> > "nbd-client <host> <tcp-port> /dev/nbd/0" to connect the device /dev/nbd/0
> > to the remote server. nbd-client will open a TCP-connection to the server
> > and waits for some initial data which contains the size of the device.
> > Than the handle of the TCP-connection is transferred to the kernel for
> > further use. nbd-client will fork into background because the handle of the
> > TCP-connection would be closed by the kernel if nbd-client exits.
> > The tools also contain a very basic NBD server which will enable you to use
> > any file or device as a NBD.
> >
> > The protocol:
> > The protocol is based on top of TCP/IP. Both client and server send packets
> > to each other. The server must send an init-packet to the client if a
> > client connects. After that the server just receives request-packets from
> > the client and sends back reply-packets until the connection is closed or a
> > disconnect-request is received.
> >
> > The amount of data that can be read or written with one request
> > is limited to 128KB.
> >
> > The current implementation of the NBD-protocol in the Linux-Kernel does
> > send multiple requests without waiting for replies. So it does make sense
> > for the server to handle requests in parallel.
> >
> > Constants:
> > INIT_PASSWD = "NBDMAGIC"
> > INIT_MAGIC = 0x0000420281861253
> > REQUEST_MAGIC = 0x25609513
> > REPLY_MAGIC = 0x67446698
> >
> > REQUEST_READ = 0
> > REQUEST_WRITE = 1
> > REQUEST_DISCONNECT = 2
> >
> > Packets:
> >
> > init-packet:
> > passwd : 8 bytes (string) = INIT_PASSWD
> > magic : 8 bytes (integer) = INIT_MAGIC
> > size : 8 bytes (integer) = size of the device in bytes
> > reserved : 128 bytes (filled with zeros)
> >
> > request-packet:
> > magic : 4 bytes (integer) = REQUEST_MAGIC
> > type : 4 bytes (integer) = REQUEST_READ, REQUEST_WRITE or REQUEST_DISCONNECT
> > handle : 8 bytes (integer) = request-handle
> > from : 8 bytes (integer) = start offset for read/write-operation in bytes
> > length : 4 bytes (integer) = length of the read/write-operationion bytes
> > data : x bytes (only for write request, x = length field of this packet)
> >
> > reply-packet:
> > magic : 4 bytes (integer) = REPLY_MAGIC
> > error : 4 bytes (integer) = errorcode (0 = no error)
> > handle : 8 bytes (integer) = copy of request-handle
> > data : x bytes (only for reply to read request and if no error occured,
> > x = length field of the request packet)
> >
> > all integer values are stored unsigned and in network-byte-order (big-endian)
>
> --
> When do you have a heart between your knees?
> [Johanka's followup: and *two* hearts?]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-09-08 20:22:00

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

>>The patch also looks harmless enough for applying ;-).
>
> Harmless enough, although I'm not sure it really makes that much
> difference. The max_sectors being set to 255 doesn't, by itself, explain
> the back and forth 127k, 1k request thing. Typically what you'll see is
> 127k, 127k, 127k, etc. and then some odd sized request at the end. Or
> the device gets unplugged anyway at some point and there are odd sized
> requests scattered throughout...that's especially going to be true if
> the reads or writes are from an actual disk, rather than /dev/null. I
> may be just coincidence that setting max_sectors to 256 actually helps.
> Also, are we sure that all those requests you're seeing are of the same
> type (all reads, all writes)?

Well, i guess the cache uses a value of 256 sectors to do read-ahead and
such. I used dd if=/dev/nbd/0 of=/dev/null bs=X with both X=1 and X=1M.
Both with the same result. That the 1byte requests join together to
bigger ones can only be explained with read-aheads strategies.
Anyway, the result is always the same:

without patch: 127KB, 1KB, 127KB, 1KB
with path: 128KB, 128KB, 128KB

As long as dd doesn't write i'm sure that i didn't see any write
requests. In addition it is a very regular pattern.
If it is really the case that the cache reads 256 sectors and the
default limit is 255, than this would also happen for all other
block-devices. In addition it would be a good thing to look up if the
cache takes the max_sectors stuff into accout while determining the
amout of sectors it reads ahead.


2003-09-08 21:11:21

by Paul Clements

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Sven K?hler wrote:

> Well, i guess the cache uses a value of 256 sectors to do read-ahead and
> such.

Well it sounds like the real problem here is the vm_max_readahead
setting then. Try this:

cat /proc/sys/vm/max-readahead

Probably, it's set to 31 on your system.

Try something like the following:

echo "126" > /proc/sys/vm/max-readahead

I think that should help out.

--
Paul

2003-09-08 22:08:48

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

>>Well, i guess the cache uses a value of 256 sectors to do read-ahead and
>>such.
>
> Well it sounds like the real problem here is the vm_max_readahead
> setting then. Try this:

I will try it, although i think that i'm using the deafult values.

Anyway: the NBD module should set the max_sectors to a certain value - i
chose 256 sectors. Perhaps Pavel or Paul may decide to use a higher ot
smaller value. A limit should be part of the protocol or handshaked when
connecting to the server (what is not possible without changing the
protocol)

2003-09-08 22:17:49

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Another idea would be to be abled to specify the max_sectors while
connecting an NBD. That would add an optional paramter to the nbd-client
command line. (like it is possible for the blocksize)

2003-09-08 22:21:26

by Pavel Machek

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Hi!

> Another idea would be to be abled to specify the max_sectors while
> connecting an NBD. That would add an optional paramter to the nbd-client
> command line. (like it is possible for the blocksize)

I do not see why it should be configurable...

Pavel

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-08 22:17:56

by Pavel Machek

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Hi!

> >>Well, i guess the cache uses a value of 256 sectors to do read-ahead and
> >>such.
> >
> >Well it sounds like the real problem here is the vm_max_readahead
> >setting then. Try this:
>
> I will try it, although i think that i'm using the deafult values.
>
> Anyway: the NBD module should set the max_sectors to a certain value - i
> chose 256 sectors. Perhaps Pavel or Paul may decide to use a higher ot
> smaller value. A limit should be part of the protocol or handshaked when
> connecting to the server (what is not possible without changing the
> protocol)

I do not see a reason it should be handshaked. IMNSHO we should simply
say that no request should be bigger than 1MB in protocol, make sure
that kernel does <=128KB requests, and make sure nbd-servers can
handle 1MB, and be done with that.
Pavel

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-08 22:30:38

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

>>Another idea would be to be abled to specify the max_sectors while
>>connecting an NBD. That would add an optional paramter to the nbd-client
>>command line. (like it is possible for the blocksize)
>
> I do not see why it should be configurable...

We may regret to use a certain value, although i agree that 1MB should
be sufficient for the future.

2003-09-08 23:28:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Hi!

> >>Another idea would be to be abled to specify the max_sectors while
> >>connecting an NBD. That would add an optional paramter to the nbd-client
> >>command line. (like it is possible for the blocksize)
> >
> >I do not see why it should be configurable...
>
> We may regret to use a certain value, although i agree that 1MB should
> be sufficient for the future.

I believe that 1MB is good, and good enough for close future. If that
ever proves to be problem, we can add handshake at that point. But I
do not believe it will be neccessary.
Pavel

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-09 00:16:59

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

> I believe that 1MB is good, and good enough for close future. If that
> ever proves to be problem, we can add handshake at that point. But I
> do not believe it will be neccessary.

Well, we shouldn't discuss the max_sectors problem (or what ever we
should call it) too much. I must seem as a bullhead to you.

2003-09-09 00:48:39

by Paul Clements

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

Pavel Machek wrote:
>
> Hi!
>
> > >>Another idea would be to be abled to specify the max_sectors while
> > >>connecting an NBD. That would add an optional paramter to the nbd-client
> > >>command line. (like it is possible for the blocksize)
> > >
> > >I do not see why it should be configurable...
> >
> > We may regret to use a certain value, although i agree that 1MB should
> > be sufficient for the future.
>
> I believe that 1MB is good, and good enough for close future. If that
> ever proves to be problem, we can add handshake at that point. But I
> do not believe it will be neccessary.

But, who ever said the buffer in the nbd-server had to be statically
allocated? I have a version of nbd-server that is modified to handle any
size request that the client side throws at it -- if the buffer is not
large enough, it simply reallocates it.

--
Paul

2003-09-09 00:57:08

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

>>I believe that 1MB is good, and good enough for close future. If that
>>ever proves to be problem, we can add handshake at that point. But I
>>do not believe it will be neccessary.
>
> But, who ever said the buffer in the nbd-server had to be statically
> allocated? I have a version of nbd-server that is modified to handle any
> size request that the client side throws at it -- if the buffer is not
> large enough, it simply reallocates it.

Well, imagine that somebody develops a server implementation (well, that
was what some friends and me did in the past few days) than it is just
good to know, that there is a limit for the length field of the
request-packet. If there is no limit, the server implementation has to
be abled to answer requests of any size. That is very complicated.
For example if the server has only 64MB of memeory, and the kernel would
be abled to send a 128MB request, how would you handle that?

2003-09-09 06:14:48

by Sven Köhler

[permalink] [raw]
Subject: Re: [NBD] patch and documentation

> echo "126" > /proc/sys/vm/max-readahead
> I think that should help out.

Well, it didn't help. Still the same funny pattern of 127 and 1KB
requests ;-)