2007-08-07 14:38:07

by Steve Wise

[permalink] [raw]
Subject: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Networking experts,

I'd like input on the patch below, and help in solving this bug
properly. iWARP devices that support both native stack TCP and iWARP
(aka RDMA over TCP/IP/Ethernet) connections on the same interface need
the fix below or some similar fix to the RDMA connection manager.

This is a BUG in the Linux RDMA-CMA code as it stands today.

Here is the issue:

Consider an mpi cluster running mvapich2. And the cluster runs
MPI/Sockets jobs concurrently with MPI/RDMA jobs. It is possible,
without the patch below, for MPI/Sockets processes to mistakenly get
incoming RDMA connections and vice versa. The way mvapich2 works is
that the ranks all bind and listen to a random port (retrying new random
ports if the bind fails with "in use"). Once they get a free port and
bind/listen, they advertise that port number to the peers to do
connection setup. Currently, without the patch below, the mpi/rdma
processes can end up binding/listening to the _same_ port number as the
mpi/sockets processes running over the native tcp stack. This is due to
duplicate port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP
port space. If this happens, then the connections can get screwed up.

The correct solution in my mind is to use the host stack's TCP port
space for _all_ RDMA_PS_TCP port allocations. The patch below is a
minimal delta to unify the port spaces by using the kernel stack to bind
ports. This is done by allocating a kernel socket and binding to the
appropriate local addr/port. It also allows the kernel stack to pick
ephemeral ports by virtue of just passing in port 0 on the kernel bind
operation.

There has been a discussion already on the RDMA list if anyone is
interested:

http://www.mail-archive.com/[email protected]/msg05162.html


Thanks,

Steve.


---

RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

This is needed for iwarp providers that support native and rdma
connections over the same interface.

Signed-off-by: Steve Wise <[email protected]>
---

drivers/infiniband/core/cma.c | 27 ++++++++++++++++++++++++++-
1 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9e0ab04..e4d2d7f 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -111,6 +111,7 @@ struct rdma_id_private {
struct rdma_cm_id id;

struct rdma_bind_list *bind_list;
+ struct socket *sock;
struct hlist_node node;
struct list_head list;
struct list_head listen_list;
@@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
kfree(bind_list);
}
mutex_unlock(&lock);
+ if (id_priv->sock)
+ sock_release(id_priv->sock);
}

void rdma_destroy_id(struct rdma_cm_id *id)
@@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
return 0;
}

+static int cma_get_tcp_port(struct rdma_id_private *id_priv)
+{
+ int ret;
+ struct socket *sock;
+
+ ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
+ if (ret)
+ return ret;
+ ret = sock->ops->bind(sock,
+ (struct socketaddr *)&id_priv->id.route.addr.src_addr,
+ ip_addr_size(&id_priv->id.route.addr.src_addr));
+ if (ret) {
+ sock_release(sock);
+ return ret;
+ }
+ id_priv->sock = sock;
+ return 0;
+}
+
static int cma_get_port(struct rdma_id_private *id_priv)
{
struct idr *ps;
@@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
break;
case RDMA_PS_TCP:
ps = &tcp_ps;
+ ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
+ if (ret)
+ goto out;
break;
case RDMA_PS_UDP:
ps = &udp_ps;
@@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p
else
ret = cma_use_port(ps, id_priv);
mutex_unlock(&lock);
-
+out:
return ret;
}



2007-08-07 14:55:17

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Hi Steve.

On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise ([email protected]) wrote:
> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
> +{
> + int ret;
> + struct socket *sock;
> +
> + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
> + if (ret)
> + return ret;
> + ret = sock->ops->bind(sock,
> + (struct socketaddr
> *)&id_priv->id.route.addr.src_addr,
> + ip_addr_size(&id_priv->id.route.addr.src_addr));

If get away from talks about broken offloading, this one will result in
the case, when usual network dataflow can enter private rdma land, i.e.
after bind succeeded this socket is accessible via any other network
device. Is it inteded?
And this is quite noticeble overhead per rdma connection, btw.

--
Evgeniy Polyakov

2007-08-07 15:06:40

by Steve Wise

[permalink] [raw]
Subject: Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.



Evgeniy Polyakov wrote:
> Hi Steve.
>
> On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise ([email protected]) wrote:
>> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
>> +{
>> + int ret;
>> + struct socket *sock;
>> +
>> + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
>> + if (ret)
>> + return ret;
>> + ret = sock->ops->bind(sock,
>> + (struct socketaddr
>> *)&id_priv->id.route.addr.src_addr,
>> + ip_addr_size(&id_priv->id.route.addr.src_addr));
>
> If get away from talks about broken offloading, this one will result in
> the case, when usual network dataflow can enter private rdma land, i.e.
> after bind succeeded this socket is accessible via any other network
> device. Is it inteded?
> And this is quite noticeble overhead per rdma connection, btw.
>

I'm not sure I understand your question? What do you mean by
"accessible"? The intention is to _just_ reserve the addr/port.
The socket struct alloc and bind was a simple way to do this. I
assume we'll have to come up with a better way though.
Namely provide a low level interface to the port space allocator
allowing both rdma and the host tcp stack to share the space without
requiring a socket struct for rdma connections.

Or maybe we'll come up a different and better solution to this issue...

Steve.

2007-08-07 15:39:55

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

On Tue, Aug 07, 2007 at 10:06:29AM -0500, Steve Wise ([email protected]) wrote:
> >On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise
> >([email protected]) wrote:
> >>+static int cma_get_tcp_port(struct rdma_id_private *id_priv)
> >>+{
> >>+ int ret;
> >>+ struct socket *sock;
> >>+
> >>+ ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
> >>+ if (ret)
> >>+ return ret;
> >>+ ret = sock->ops->bind(sock,
> >>+ (struct socketaddr
> >>*)&id_priv->id.route.addr.src_addr,
> >>+ ip_addr_size(&id_priv->id.route.addr.src_addr));
> >
> >If get away from talks about broken offloading, this one will result in
> >the case, when usual network dataflow can enter private rdma land, i.e.
> >after bind succeeded this socket is accessible via any other network
> >device. Is it inteded?
> >And this is quite noticeble overhead per rdma connection, btw.
> >
>
> I'm not sure I understand your question? What do you mean by
> "accessible"? The intention is to _just_ reserve the addr/port.

Above RDMA ->bind() ends up with tcp_v4_get_port(), which will only add
socket into bhash, but it is only accessible for new sockets created for
listening connections or expilicit bind, network traffic checks only
listening and establised hashes, which are not affected by above change,
so it was false alarm from my side. It does allow to 'grab' a port and
forbid its possible reuse.

--
Evgeniy Polyakov

2007-08-09 18:50:27

by Steve Wise

[permalink] [raw]
Subject: Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Any more comments?


Steve Wise wrote:
> Networking experts,
>
> I'd like input on the patch below, and help in solving this bug
> properly. iWARP devices that support both native stack TCP and iWARP
> (aka RDMA over TCP/IP/Ethernet) connections on the same interface need
> the fix below or some similar fix to the RDMA connection manager.
>
> This is a BUG in the Linux RDMA-CMA code as it stands today.
>
> Here is the issue:
>
> Consider an mpi cluster running mvapich2. And the cluster runs
> MPI/Sockets jobs concurrently with MPI/RDMA jobs. It is possible,
> without the patch below, for MPI/Sockets processes to mistakenly get
> incoming RDMA connections and vice versa. The way mvapich2 works is
> that the ranks all bind and listen to a random port (retrying new random
> ports if the bind fails with "in use"). Once they get a free port and
> bind/listen, they advertise that port number to the peers to do
> connection setup. Currently, without the patch below, the mpi/rdma
> processes can end up binding/listening to the _same_ port number as the
> mpi/sockets processes running over the native tcp stack. This is due to
> duplicate port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP
> port space. If this happens, then the connections can get screwed up.
>
> The correct solution in my mind is to use the host stack's TCP port
> space for _all_ RDMA_PS_TCP port allocations. The patch below is a
> minimal delta to unify the port spaces by using the kernel stack to bind
> ports. This is done by allocating a kernel socket and binding to the
> appropriate local addr/port. It also allows the kernel stack to pick
> ephemeral ports by virtue of just passing in port 0 on the kernel bind
> operation.
>
> There has been a discussion already on the RDMA list if anyone is
> interested:
>
> http://www.mail-archive.com/[email protected]/msg05162.html
>
>
> Thanks,
>
> Steve.
>
>
> ---
>
> RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
>
> This is needed for iwarp providers that support native and rdma
> connections over the same interface.
>
> Signed-off-by: Steve Wise <[email protected]>
> ---
>
> drivers/infiniband/core/cma.c | 27 ++++++++++++++++++++++++++-
> 1 files changed, 26 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 9e0ab04..e4d2d7f 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -111,6 +111,7 @@ struct rdma_id_private {
> struct rdma_cm_id id;
>
> struct rdma_bind_list *bind_list;
> + struct socket *sock;
> struct hlist_node node;
> struct list_head list;
> struct list_head listen_list;
> @@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
> kfree(bind_list);
> }
> mutex_unlock(&lock);
> + if (id_priv->sock)
> + sock_release(id_priv->sock);
> }
>
> void rdma_destroy_id(struct rdma_cm_id *id)
> @@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
> return 0;
> }
>
> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
> +{
> + int ret;
> + struct socket *sock;
> +
> + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
> + if (ret)
> + return ret;
> + ret = sock->ops->bind(sock,
> + (struct sockaddr *)&id_priv->id.route.addr.src_addr,
> + ip_addr_size(&id_priv->id.route.addr.src_addr));
> + if (ret) {
> + sock_release(sock);
> + return ret;
> + }
> + id_priv->sock = sock;
> + return 0;
> +}
> +
> static int cma_get_port(struct rdma_id_private *id_priv)
> {
> struct idr *ps;
> @@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
> break;
> case RDMA_PS_TCP:
> ps = &tcp_ps;
> + ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
> + if (ret)
> + goto out;
> break;
> case RDMA_PS_UDP:
> ps = &udp_ps;
> @@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p
> else
> ret = cma_use_port(ps, id_priv);
> mutex_unlock(&lock);
> -
> +out:
> return ret;
> }
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-08-09 21:40:55

by Sean Hefty

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Steve Wise wrote:
> Any more comments?

Does anyone have ideas on how to reserve the port space without using a
struct socket?

- Sean

2007-08-09 21:55:52

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Sean Hefty <[email protected]>
Date: Thu, 09 Aug 2007 14:40:16 -0700

> Steve Wise wrote:
> > Any more comments?
>
> Does anyone have ideas on how to reserve the port space without using a
> struct socket?

How about we just remove the RDMA stack altogether? I am not at all
kidding. If you guys can't stay in your sand box and need to cause
problems for the normal network stack, it's unacceptable. We were
told all along the if RDMA went into the tree none of this kind of
stuff would be an issue.

These are exactly the kinds of problems for which people like myself
were dreading. These subsystems have no buisness using the TCP port
space of the Linux software stack, absolutely none.

After TCP port reservation, what's next? It seems an at least
bi-monthly event that the RDMA folks need to put their fingers
into something else in the normal networking stack. No more.

I will NACK any patch that opens up sockets to eat up ports or
anything stupid like that.

2007-08-09 23:23:03

by Sean Hefty

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

> How about we just remove the RDMA stack altogether? I am not at all
> kidding. If you guys can't stay in your sand box and need to cause
> problems for the normal network stack, it's unacceptable. We were
> told all along the if RDMA went into the tree none of this kind of
> stuff would be an issue.

There are currently two RDMA solutions available. Each solution has
different requirements and uses the normal network stack differently.
Infiniband uses its own transport. iWarp runs over TCP.

We have tried to leverage the existing infrastructure where it makes sense.

> After TCP port reservation, what's next? It seems an at least
> bi-monthly event that the RDMA folks need to put their fingers
> into something else in the normal networking stack. No more.

Currently, the RDMA stack uses its own port space. This causes a
problem for iWarp, and is what Steve is looking for a solution for. I'm
not an iWarp guru, so I don't know what options exist. Can iWarp use
its own address family? Identify specific IP addresses for iWarp use?
Restrict iWarp to specific port numbers? Let the app control the
correct operation? I don't know.

Steve merely defined a problem and suggested a possible solution. He's
looking for constructive help trying to solve the problem.

- Sean

2007-08-15 14:42:54

by Steve Wise

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.



David Miller wrote:
> From: Sean Hefty <[email protected]>
> Date: Thu, 09 Aug 2007 14:40:16 -0700
>
>> Steve Wise wrote:
>>> Any more comments?
>> Does anyone have ideas on how to reserve the port space without using a
>> struct socket?
>
> How about we just remove the RDMA stack altogether? I am not at all
> kidding. If you guys can't stay in your sand box and need to cause
> problems for the normal network stack, it's unacceptable. We were
> told all along the if RDMA went into the tree none of this kind of
> stuff would be an issue.

I think removing the RDMA stack is the wrong thing to do, and you
shouldn't just threaten to yank entire subsystems because you don't like
the technology. Lets keep this constructive, can we? RDMA should get
the respect of any other technology in Linux. Maybe its a niche in your
opinion, but come on, there's more RDMA users than say, the sparc64
port. Eh?

>
> These are exactly the kinds of problems for which people like myself
> were dreading. These subsystems have no buisness using the TCP port
> space of the Linux software stack, absolutely none.
>

Ok, although IMO its the correct solution. But I'll propose other
solutions below. I ask for your feedback (and everyones!) on these
alternate solutions.

> After TCP port reservation, what's next? It seems an at least
> bi-monthly event that the RDMA folks need to put their fingers
> into something else in the normal networking stack. No more.
>

The only other change requested and commited, if I recall correctly, was
for netevents, and that enabled both Infiniband and iWARP to integrate
with the neighbour subsystem. I think that was a useful and needed
change. Prior to that, these subsystems were snooping ARP replies to
trigger events. That was back in 2.6.18 or 2.6.19 I think...

> I will NACK any patch that opens up sockets to eat up ports or
> anything stupid like that.

Got it.

Here are alternate solutions that avoid the need to share the port space:

Solution 1)

1) admins must setup an alias interface on the iwarp device for use with
rdma. This interface will have to be a separate subnet from the "TCP
used" interface. And with a canonical name that indicates its "for rdma
only". Like eth2:iw or eth2:rdma. There can be many of these per device.

2) admins make sure their sockets/tcp services don't use the interface
configured in #1, and their rdma service do use said interface.

3) iwarp providers must translation binds to ipaddr 0.0.0.0 to the
associated "for rdma only" ip addresses. They can do this by searching
for all aliases of the canonical name that are aliases of the TCP
interface for their nic device. Or: somehow not handle incoming
connections to any address but the "for rdma use" addresses and instead
pass them up and not offload them.

This will avoid the collisions as long as the above steps are followed.


Solution 2)

Another possibility would be for the driver to create two net devices
(and hence two interace names) like "eth2" and "iw2", and artificially
separate the RDMA stuff that way.

These two solutions are similar in that they create a "rdma only" interface.

Pros:
- is not intrusive into the core networking code
- very minimal changes needed and in the iwarp provider's code, who are
the ones with this problem
- makes it clear which subnets are RDMA only

Cons:
- relies on system admin to set it up correctly.
- native stack can still "use" this rdma-only interface and the same
port space issue will exist.


For the record, here are possible port-sharing solutions Dave sez he'll NAK:

Solution NAK-1)

The rdma-cma just allocates a socket and binds it to reserve TCP ports.

Pros:
- minimal changes needed to implement (always a plus in my mind :)
- simple, clean, and it works (KISS)
- if no RDMA is in use, there is no impact on the native stack
- no need for a seperate RDMA interface

Cons:
- wastes memory
- puts a TCP socket in the "CLOSED" state in the pcb tables.
- Dave will NAK it :)

Solution NAK-2)

Create a low-level sockets-agnostic port allocation service that is
shared by both TCP and RDMA. This way, the rdma-cm can reserve ports in
an efficient manor instead of doing it via kernel_bind() using a sock
struct.

Pros:
- probably the correct solution (my opinion :) if we went down the path
of sharing port space
- if no RDMA is in use, there is no impact on the native stack
- no need for a separate RDMA interface

Cons:

- very intrusive change because the port allocations stuff is tightly
bound to the host stack and sock struct, etc.
- Dave will NAK it :)


Steve.

2007-08-16 02:27:24

by Jeff Garzik

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Steve Wise wrote:
>
>
> David Miller wrote:
>> From: Sean Hefty <[email protected]>
>> Date: Thu, 09 Aug 2007 14:40:16 -0700
>>
>>> Steve Wise wrote:
>>>> Any more comments?
>>> Does anyone have ideas on how to reserve the port space without using
>>> a struct socket?
>>
>> How about we just remove the RDMA stack altogether? I am not at all
>> kidding. If you guys can't stay in your sand box and need to cause
>> problems for the normal network stack, it's unacceptable. We were
>> told all along the if RDMA went into the tree none of this kind of
>> stuff would be an issue.
>
> I think removing the RDMA stack is the wrong thing to do, and you
> shouldn't just threaten to yank entire subsystems because you don't like
> the technology. Lets keep this constructive, can we? RDMA should get
> the respect of any other technology in Linux. Maybe its a niche in your
> opinion, but come on, there's more RDMA users than say, the sparc64
> port. Eh?

It's not about being a niche. It's about creating a maintainable
software net stack that has predictable behavior.

Needing to reach out of the RDMA sandbox and reserve net stack resources
away from itself travels a path we've consistently avoided.


>> I will NACK any patch that opens up sockets to eat up ports or
>> anything stupid like that.
>
> Got it.

Ditto for me as well.

Jeff


2007-08-16 03:11:32

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

> Needing to reach out of the RDMA sandbox and reserve net stack
> resources away from itself travels a path we've consistently avoided.

Where did the idea of an "RDMA sandbox" come from? Obviously no one
disagrees with keeping things clean and maintainable, but the idea
that RDMA is a second-class citizen that doesn't get any input into
the evolution of the networking code seems kind of offensive to me.

- R.

2007-08-16 03:27:43

by Hefty, Sean

[permalink] [raw]
Subject: RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.

>It's not about being a niche. It's about creating a maintainable
>software net stack that has predictable behavior.
>
>Needing to reach out of the RDMA sandbox and reserve net stack resources
>away from itself travels a path we've consistently avoided.

We need to ensure that we're also creating a maintainable kernel. RDMA doesn't
use sockets, but that doesn't mean it's not part of the networking support
provided by the Linux kernel. Making blanket statements that RDMA should stay
within a sandbox is equivalent to saying that RDMA should duplicate any network
related functionality that it might need.

>>> I will NACK any patch that opens up sockets to eat up ports or
>>> anything stupid like that.
>
>Ditto for me as well.

I agree that using a socket is the wrong approach, but my guess is that it was
suggested as a possibility because of the attempt to keep RDMA in its 'sandbox'.
The iWarp architecture implements RDMA over TCP; it just doesn't use sockets.
The Linux network stack doesn't easily support this possibility. Are there any
reasonable ways to enable this to the degree necessary for iWarp?

- Sean

2007-08-16 14:09:25

by Tom Tucker

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

On Wed, 2007-08-15 at 22:26 -0400, Jeff Garzik wrote:

[...snip...]

> > I think removing the RDMA stack is the wrong thing to do, and you
> > shouldn't just threaten to yank entire subsystems because you don't like
> > the technology. Lets keep this constructive, can we? RDMA should get
> > the respect of any other technology in Linux. Maybe its a niche in your
> > opinion, but come on, there's more RDMA users than say, the sparc64
> > port. Eh?
>
> It's not about being a niche. It's about creating a maintainable
> software net stack that has predictable behavior.

Isn't RDMA _part_ of the "software net stack" within Linux? Why isn't
making RDMA stable, supportable and maintainable equally as important as
any other subsystem?

>
> Needing to reach out of the RDMA sandbox and reserve net stack resources
> away from itself travels a path we've consistently avoided.
>
>
> >> I will NACK any patch that opens up sockets to eat up ports or
> >> anything stupid like that.
> >
> > Got it.
>
> Ditto for me as well.
>
> Jeff
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-08-16 21:18:09

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Tom Tucker <[email protected]>
Date: Thu, 16 Aug 2007 08:43:11 -0500

> Isn't RDMA _part_ of the "software net stack" within Linux?

It very much is not so.

When using RDMA you lose the capability to do packet shaping,
classification, and all the other wonderful networking facilities
you've grown to love and use over the years.

I'm glad this is a surprise to you, because it illustrates the
point some of us keep trying to make about technologies like
this.

Imagine if you didn't know any of this, you purchase and begin to
deploy a huge piece of RDMA infrastructure, you then get the mandate
from IT that you need to add firewalling on the RDMA connections at
the host level, and "oh shit" you can't?

This is why none of us core networking developers like RDMA at all.
It's totally not integrated with the rest of the Linux stack and on
top of that it even gets in the way. It's an abberation, an eye sore,
and a constant source of consternation.

2007-08-17 19:53:45

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

> > Isn't RDMA _part_ of the "software net stack" within Linux?

> It very much is not so.

This is just nit-picking. You can draw the boundary of the "software
net stack" wherever you want, but I think Sean's point was just that
RDMA drivers already are part of Linux, and we all want them to get
better.

> When using RDMA you lose the capability to do packet shaping,
> classification, and all the other wonderful networking facilities
> you've grown to love and use over the years.

Same thing with TSO and LRO and who knows what else. I know you're
going to make a distinction between "stateless" and "stateful"
offloads, but really it's just an arbitrary distinction between things
you like and things you don't.

> Imagine if you didn't know any of this, you purchase and begin to
> deploy a huge piece of RDMA infrastructure, you then get the mandate
> from IT that you need to add firewalling on the RDMA connections at
> the host level, and "oh shit" you can't?

It's ironic that you bring up firewalling. I've had vendors of iWARP
hardware tell me they would *love* to work with the community to make
firewalling work better for RDMA connections. But instead we get the
catch-22 of your changing arguments -- first, you won't even consider
changes that might help RDMA work better in the name of
maintainability; then you have to protect poor, ignorant users from
accidentally using RDMA because of some problem or another; and then
when someone tries to fix some of the problems you mention, it's back
to step one.

Obviously some decisions have been prejudged here, so I guess this
moves to the realm of politics. I have plenty of interesting
technical stuff, so I'll leave it to the people with a horse in the
race to find ways to twist your arm.

- R.

2007-08-17 21:28:15

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Roland Dreier <[email protected]>
Date: Fri, 17 Aug 2007 12:52:39 -0700

> > When using RDMA you lose the capability to do packet shaping,
> > classification, and all the other wonderful networking facilities
> > you've grown to love and use over the years.
>
> Same thing with TSO and LRO and who knows what else.

Not true at all. Full classification and filtering still is usable
with TSO and LRO.

2007-08-17 23:31:43

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

> > > When using RDMA you lose the capability to do packet shaping,
> > > classification, and all the other wonderful networking facilities
> > > you've grown to love and use over the years.
> >
> > Same thing with TSO and LRO and who knows what else.
>
> Not true at all. Full classification and filtering still is usable
> with TSO and LRO.

Well, obviously with TSO and LRO the packets that the stack sends or
receives are not the same as what's on the wire. Whether that breaks
your wonderful networking facilities or not depends on the specifics
of the particular facility I guess -- for example shaping is clearly
broken by TSO. (And people can wonder what the packet trains TSO
creates do to congestion control on the internet, but the netdev crowd
has already decided that TSO is "good" and RDMA is "bad")

- R.

2007-08-18 00:00:44

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Roland Dreier <[email protected]>
Date: Fri, 17 Aug 2007 16:31:07 -0700

> > > > When using RDMA you lose the capability to do packet shaping,
> > > > classification, and all the other wonderful networking facilities
> > > > you've grown to love and use over the years.
> > >
> > > Same thing with TSO and LRO and who knows what else.
> >
> > Not true at all. Full classification and filtering still is usable
> > with TSO and LRO.
>
> Well, obviously with TSO and LRO the packets that the stack sends or
> receives are not the same as what's on the wire. Whether that breaks
> your wonderful networking facilities or not depends on the specifics
> of the particular facility I guess -- for example shaping is clearly
> broken by TSO. (And people can wonder what the packet trains TSO
> creates do to congestion control on the internet, but the netdev crowd
> has already decided that TSO is "good" and RDMA is "bad")

This is also a series of falsehoods. All packet filtering,
queue management, and packet scheduling facilities work perfectly
fine and as designed with both LRO and TSO.

When problems come up, they are bugs, and we fix them.

Please stop spreading this FUD about TSO and LRO.

The fact is that RDMA bypasses the whole stack so that supporting
these facilities is not even _POSSIBLE_. With stateless offloads it
is possible to support all of these facilities, and we do.

2007-08-18 05:23:24

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

> This is also a series of falsehoods. All packet filtering,
> queue management, and packet scheduling facilities work perfectly
> fine and as designed with both LRO and TSO.

I'm not sure I follow. Perhaps "broken" was too strong a word to use,
but if you pass a huge segment to a NIC with TSO, then you've given
the NIC control of scheduling the packets that end up getting put on
the wire. If your software packet scheduling is operating at a bigger
scale, then things work fine, but I don't see how you can say that TSO
doesn't lead to head-of-line blocking etc at short time scales. And
yes of course I agree you can make sure things work by using short
segments or not using TSO at all.

Similarly with LRO the packets that get passed to the stack are not
the packets that were actually on the wire. Sure, most filtering will
work fine but eg are you sure your RTT estimates aren't going to get
screwed up and cause some subtle bug? And I could trot out all the
same bugaboos that are brought up about RDMA and warn darkly about
security problems with bugs in NIC hardware that after all has to
parse and rewrite TCP and IP packets.

Also, looking at the complexity and bug-fixing effort that go into
making TSO work vs the really pretty small gain it gives also makes
part of me wonder whether the noble proclamations about
maintainability are always taken to heart.

Of course I know everything I just wrote is wrong because I forgot to
refer to the crucial axiom that stateless == good && RDMA == bad.
And sometimes it's unfortunate that in Linux when there's disagreement
about something, the default action is *not* to do something.

Sorry for prolonging this argument. Dave, I should say that I
appreciate all the work you've done in helping build the most kick-ass
networking stack in history. And as I said before, I have plenty of
interesting work to do however this turns out, so I'll try to leave
any further arguing to people who actually have a dog in this fight.

- R.

2007-08-18 06:44:20

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Roland Dreier <[email protected]>
Date: Fri, 17 Aug 2007 22:23:01 -0700

> Also, looking at the complexity and bug-fixing effort that go into
> making TSO work vs the really pretty small gain it gives also makes
> part of me wonder whether the noble proclamations about
> maintainability are always taken to heart.

The cpu and bus utilization improvements of TSO on the sender side are
more than significant. Ask anyone who looks closely at this.

For example, as part of his batching work Krisha Kumar has been
posting lots of numbers lately on the netdev list, I'm sure he can
post more specific numbers comparing the current stack in the case of
TSO disabled vs. TSO enabled if that is what you need to see how
beneficial TSO in fact is.

If TSO is such a lose why does pretty much every ethernet chip vendor
implement it in hardware? If you say it's just because Microsoft
defines TSO in their NDI, that's a total cop-out. It really does help
performance a lot. Why did the Xen folks bother making generic
software TSO infrastructure for the kernel for the benefit of their
virtualization network device? Why would someone as bright as Herbert
Xu even bother to implement that stuff if TSO gives a "pretty small
gain"?

Similarly for LRO and this isn't defined in NDI at all. Vendors are
going so far as to put full flow tables in their chips in order to do
LRO better.

Using the bugs and issues we've run into while implementing TSO as
evidence there is something wrong with it is a total straw man. Look
how many times the filesystem page cache has been rewritten over the
years.

Use the TSO problems as more of an example of how shitty a programmer
I must be. :)

Just be realistic and accept that RDMA is a point in time solution,
and like any other such technology takes flexibility away from users.

Horizontal scaling of cpus up to huge arity cores, network devices
using large numbers of transmit and receive queues and classification
based queue selection, are all going to work to make things like RDMA
even more irrelevant than they already are.

If you can't see that this is the future, you have my condolences.
Because frankly, the signs are all around that this is where things
are going.

The work doesn't belong in these special purpose devices, they belong
in the far-end-node compute resources, and our computers are getting
more and more of these general purpose compute engines every day.
We will be constantly moving away from specialized solutions and
towards those which solve large classes of problems for large groups
of people.

2007-08-19 07:01:21

by Hefty, Sean

[permalink] [raw]
Subject: RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.

>Just be realistic and accept that RDMA is a point in time solution,
>and like any other such technology takes flexibility away from users.

All technologies are just point in time solutions. While management is
important, shouldn't the customers decide how important it is relative to their
problems? Whether some future technology will be better matters little if a
problem needs to be solved today.

>If you can't see that this is the future, you have my condolences.
>Because frankly, the signs are all around that this is where things
>are going.

Adding a bazillion cores to a processor doesn't do a thing to help memory
bandwidth.

Millions of Infiniband ports are in operation today. Over 25% of the top 500
supercomputers use Infiniband. The formation of the OpenFabrics Alliance was
pushed and has been continuously funded by an RDMA customer - the US National
Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire,
Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu,
LSI, SGI, Sandia, and at least two dozen other companies. IDC expects
Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue
to increase six-fold (combined revenues of 1 billion).

Customers see real benefits using channel based architectures. Do all customers
need it? Of course not. Is it a niche? Yes, but I would say that about any
10+ gig network. That doesn't mean that it hasn't become essential for some
customers.

- Sean

2007-08-19 07:23:49

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.

From: "Sean Hefty" <[email protected]>
Date: Sun, 19 Aug 2007 00:01:07 -0700

> Millions of Infiniband ports are in operation today. Over 25% of the top 500
> supercomputers use Infiniband. The formation of the OpenFabrics Alliance was
> pushed and has been continuously funded by an RDMA customer - the US National
> Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire,
> Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu,
> LSI, SGI, Sandia, and at least two dozen other companies. IDC expects
> Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue
> to increase six-fold (combined revenues of 1 billion).

Scale these numbers with reality and usage.

These vendors pour in huge amounts of money into a relatively small
number of extremely large cluster installations. Besides the folks
doing nuke and whole-earth simulations at some government lab, nobody
cares. And part of the investment is not being done wholly for smart
economic reasons, but also largely publicity purposes.

So present your great Infiniband numbers with that being admitted up
front, ok?

It's relevance to Linux as a general purpose operating system that
should be "good enough" for %99 of the world is close to NIL.

People have been pouring tons of money and research into doing stupid
things to make clusters go fast, and in such a way that make zero
sense for general purpose operating systems, for ages. RDMA is just
one such example.

BTW, I find it ironic that you mention memory bandwidth as a retort,
as Roland's favorite stateless offload devil, TSO, deals explicity
with lowering the per-packet BUS bandwidth usage of TCP. LRO
offloading does likewise.

2007-08-21 01:17:15

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

[TSO / LRO discussion snipped -- it's not the main point so no sense
spending energy arguing about it]

> Just be realistic and accept that RDMA is a point in time solution,
> and like any other such technology takes flexibility away from users.
>
> Horizontal scaling of cpus up to huge arity cores, network devices
> using large numbers of transmit and receive queues and classification
> based queue selection, are all going to work to make things like RDMA
> even more irrelevant than they already are.

To me there is a real fundamental difference between RDMA and
traditional SOCK_STREAM / SOCK_DATAGRAM networking, namely that
messages can carry the address where they're supposed to be
delivered (what the IETF calls "direct data placement"). And on top
of that you can build one-sided operations aka put/get aka RDMA.

And direct data placement really does give you a factor of two at
least, because otherwise you're stuck receiving the data in one
buffer, looking at some of the data at least, and then figuring out
where to copy it. And memory bandwidth is if anything becoming more
valuable; maybe LRO + header splitting + page remapping tricks can get
you somewhere but as NCPUS grows then it seems the TLB shootdown cost
of page flipping is only going to get worse.

Don't get too hung up on the fact that current iWARP (RDMA over IP)
implementations are using TCP offload -- to me that is just a side
effect of doing enough processing on the NIC side of the PCI bus to be
able to do direct data placement. InfiniBand with competely different
transport, link and physical layers is one way to implement RDMA
without TCP offload and I'm sure there will be others -- eg Intel's
IOAT stuff could probably evolve to the point where you could
implement iWARP with software TCP and the data placement offloaded to
some DMA engine.

- R.

2007-08-21 07:00:23

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Roland Dreier <[email protected]>
Date: Mon, 20 Aug 2007 18:16:54 -0700

> And direct data placement really does give you a factor of two at
> least, because otherwise you're stuck receiving the data in one
> buffer, looking at some of the data at least, and then figuring out
> where to copy it. And memory bandwidth is if anything becoming more
> valuable; maybe LRO + header splitting + page remapping tricks can get
> you somewhere but as NCPUS grows then it seems the TLB shootdown cost
> of page flipping is only going to get worse.

As Herbert has said already, people can code for this just like
they have to code for RDMA.

There is no fundamental difference from converting an application
to sendfile or similar.

The only thing this needs is a
"recvmsg_I_dont_care_where_the_data_is()" call. There are no alignment
issues unless you are trying to push this data directly into the
page cache.

Couple this with a card that makes sure that on a per-page basis, only
data for a particular flow (or group of flows) will accumulate.

People already make cards that can do stuff like this, it can be done
statelessly with an on-chip dynamically maintained flow table.

And best yet it doesn't turn off every feature in the networking nor
bypass it for the actual protocol processing.

2007-08-28 19:38:31

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Sorry for the long latency, I was at the beach all last week.

> > And direct data placement really does give you a factor of two at
> > least, because otherwise you're stuck receiving the data in one
> > buffer, looking at some of the data at least, and then figuring out
> > where to copy it. And memory bandwidth is if anything becoming more
> > valuable; maybe LRO + header splitting + page remapping tricks can get
> > you somewhere but as NCPUS grows then it seems the TLB shootdown cost
> > of page flipping is only going to get worse.

> As Herbert has said already, people can code for this just like
> they have to code for RDMA.

No argument, you need to change the interface to take advantage of RDMA.

> There is no fundamental difference from converting an application
> to sendfile or similar.

Yes, on the transmit side, there's not much difference from sendfile
or splice, although RDMA may give a slightly nicer interface that also
gives basically the equivalent of AIO.

> The only thing this needs is a
> "recvmsg_I_dont_care_where_the_data_is()" call. There are no alignment
> issues unless you are trying to push this data directly into the
> page cache.

I don't understand how this gives you the same thing as direct data
placement (DDP). There are many situations where the sender knows
where the data has to go and if there's some way to pass that to the
receiver, so that info can be used in the receive path to put the data
in the right place, the receiver can save a copy. This is
fundamentally the same "offload" that an FC HBA does -- the SCSI
midlayer queues up commands like "read block A and put the data at
address X" and "read block B and put the data at address Y" and the
HBA matches tags on incoming data to put the blocks at the right
addresses, even if block B is received before block A.

RFC 4297 has some discussion of the various approaches, and while you
might not agree with their conclusions, it is interesting reading.

> Couple this with a card that makes sure that on a per-page basis, only
> data for a particular flow (or group of flows) will accumulate.

It seems that the NIC would also have to look into a TCP stream (and
handle out of order segments etc) to find message boundaries for this
to be equivalent to what an RDMA NIC does.

- R.

2007-08-28 20:43:34

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Roland Dreier <[email protected]>
Date: Tue, 28 Aug 2007 12:38:07 -0700

> It seems that the NIC would also have to look into a TCP stream (and
> handle out of order segments etc) to find message boundaries for this
> to be equivalent to what an RDMA NIC does.

It would work for data that accumulates in-order, give or take a small
window, just like LRO does.

2007-10-08 21:55:04

by Steve Wise

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.



David Miller wrote:
> From: Sean Hefty <[email protected]>
> Date: Thu, 09 Aug 2007 14:40:16 -0700
>
>> Steve Wise wrote:
>>> Any more comments?
>> Does anyone have ideas on how to reserve the port space without using a
>> struct socket?
>
> How about we just remove the RDMA stack altogether? I am not at all
> kidding. If you guys can't stay in your sand box and need to cause
> problems for the normal network stack, it's unacceptable. We were
> told all along the if RDMA went into the tree none of this kind of
> stuff would be an issue.
>
> These are exactly the kinds of problems for which people like myself
> were dreading. These subsystems have no buisness using the TCP port
> space of the Linux software stack, absolutely none.
>
> After TCP port reservation, what's next? It seems an at least
> bi-monthly event that the RDMA folks need to put their fingers
> into something else in the normal networking stack. No more.
>
> I will NACK any patch that opens up sockets to eat up ports or
> anything stupid like that.

Hey Dave,

The hack to use a socket and bind it to claim the port was just for
demostrating the idea. The correct solution, IMO, is to enhance the
core low level 4-tuple allocation services to be more generic (eg: not
be tied to a struct sock). Then the host tcp stack and the host rdma
stack can allocate TCP/iWARP ports/4tuples from this common exported
service and share the port space. This allocation service could also be
used by other deep adapters like iscsi adapters if needed.

Will you NAK such a solution if I go implement it and submit for review?
The dual ip subnet solution really sux, and I'm trying one more time
to see if you will entertain the common port space solution, if done
correctly.

Thanks,

Steve.

2007-10-09 13:44:33

by James Lentini

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.


On Mon, 8 Oct 2007, Steve Wise wrote:

> The correct solution, IMO, is to enhance the core low level 4-tuple
> allocation services to be more generic (eg: not be tied to a struct
> sock). Then the host tcp stack and the host rdma stack can allocate
> TCP/iWARP ports/4tuples from this common exported service and share
> the port space. This allocation service could also be used by other
> deep adapters like iscsi adapters if needed.

As a developer of an RDMA ULP, NFS-RDMA, I like this approach because
it will simplify the configuration of an RDMA device and the services
that use it.

2007-10-10 21:02:59

by Sean Hefty

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

> The hack to use a socket and bind it to claim the port was just for
> demostrating the idea. The correct solution, IMO, is to enhance the
> core low level 4-tuple allocation services to be more generic (eg: not
> be tied to a struct sock). Then the host tcp stack and the host rdma
> stack can allocate TCP/iWARP ports/4tuples from this common exported
> service and share the port space. This allocation service could also be
> used by other deep adapters like iscsi adapters if needed.

Since iWarp runs on top of TCP, the port space is really the same.
FWIW, I agree that this proposal is the correct solution to support iWarp.

- Sean

2007-10-10 23:05:06

by David Miller

[permalink] [raw]
Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

From: Sean Hefty <[email protected]>
Date: Wed, 10 Oct 2007 14:01:07 -0700

> > The hack to use a socket and bind it to claim the port was just for
> > demostrating the idea. The correct solution, IMO, is to enhance the
> > core low level 4-tuple allocation services to be more generic (eg: not
> > be tied to a struct sock). Then the host tcp stack and the host rdma
> > stack can allocate TCP/iWARP ports/4tuples from this common exported
> > service and share the port space. This allocation service could also be
> > used by other deep adapters like iscsi adapters if needed.
>
> Since iWarp runs on top of TCP, the port space is really the same.
> FWIW, I agree that this proposal is the correct solution to support iWarp.

But you can be sure it's not going to happen, sorry.

It would mean that we'd need to export the entire TCP socket table so
then when iWARP connections are created you can search to make sure
there is not an existing full 4-tuple that is the same.

It is not just about local TCP ports.

iWARP needs to live in it's seperate little container and not
contaminate the rest of the networking, this is the deal. Any
suggested such change which breaks that deal will be NACK'd by all of
the core networking developers.