LinuxLists.cc - InfiniBand/RDMA merge plans for 2.6.24

2007-09-13 17:57:49

Subject: InfiniBand/RDMA merge plans for 2.6.24

With 2.6.24 probably opening in the not-too-distant future, it's
probably a good time to review what my plans are for when the merge
window opens.

At the kernel summit, we discussed patch review (doing a web search
for "kernel summit" "reviewed-by:" should turn up lots of info on
this). Due to an unfortunate combination of vacation and conference
travel, summer colds, and other inconveniences, I am very backed up on
reviewing. And in any case, I've allowed too much code review to be
dumped on me -- when there are dozens of people working on IB and RDMA
stuff, it obviously doesn't work to expect me to do all the reviewing.

Unfortunately, due to the length of the backlog and the fact that
2.6.23 seems fairly close, some of the things listed below are going
to miss the 2.6.24 merge window. So, although the plan is to phase in
requiring "Reviewed-by:" gently, for this merge, if you can get
someone other than me to review your work, then the chances of it
being merged increase dramatically. I'm talking about a real review--
ideally, someone independent (from another company would be good) who
is willing to provide a "Reviewed-by:" line that means the reviewer
has really looked at and thought about the patch. There should be a
mailing list thread you can point me at where the reviewer comments on
the patch and a new version of that patch addressing all comments is
posted (or in exceptional cases, where the patch is perfect to start
with, where the reviewer says the patch is great).

For example, given the number of IPoIB changes pending, it might be a
good idea for the people submitting them to get together and trade
reviews (ie "If you review my patch, I'll review your patch"). There
are a few cases where getting a review may not be necessary. First of
all, trivial and obvious patches don't need a review. It's a
judgement call what is trivial or obvious, and it's always a good idea
to provide a changelog that makes it clear why a patch is trivial and
obviously correct. Second, hardware driver patches may not make sense
to anyone outside of the company whose hardware the driver is for.
Still, in this case, an internal Reviewed-by: would be nice, and also
a changelog that explains the reason for the change always helps
(don't just tell me what your patch does, but also explain what the
patch fixes and what the impact of the current situation is).

Anyway, here are all the pending things that I'm aware of. As usual,
if something isn't already in my tree and isn't listed below, I
probably missed it or dropped it by mistake. Please remind me again
in that case.

Core:

- My user_mad P_Key index support patch. I'll test the ioctl to
change to the new mode and merge this I guess, since Hal and Sean
have tested this out.

- A fix to the user_mad 32-bit big-endian userspace 64/32 problem
with the method_mask when registering agents. I'll write a patch
to handle this in a way that doesn't change the ABI for anything
other than the broken case and hope to get someone to review this
so it can be merged.

- Sean's QoS changes. These look fine at first glance, and I just
plan to understand the backwards compatibility story (ie how this
works with an old SM) and merge. Anyone who objects let me know.

- Sean's IB CM MRA interface changes. Don't know at this point. It
seems OK but I'm not clear on what if any real-world improvement
this gives us.

ULPs:

- Pradeep's IPoIB CM support for devices that don't have SRQs. I
think the basic approach makes sense (I don't think faking SRQs at
some other layer is really feasible) and I need to find time to
look at the details to see if the current patch looks workable. I'm
likely to merge this; getting an independent Reviewed-by: would
certainly be appreciated too.

- Moni's IPoIB bonding support. This seems mostly an issue of
getting the core bonding maintainer's attention. However getting a
Reviewed-by: for the IPoIB changes wouldn't hurt too.

- Rolf's IPoIB MGID scope changes. Certainly we want to fix this
issue but the specific changes need review.

- Eli and Michael's IPoIB stateless offload (checksum offload, LSO,
LRO, etc). It's a big series that makes quite a few core changes.
I think it needs some careful review and is probably at risk of
missing this merge window. Sorting in order of invasiveness so we
can merge at least some of it (if splitting it makes sense) might
be a good idea.

HW specific:

- I already merged patches to enable MSI-X by default for mthca and
mlx4. I hope there aren't too many systems that get hosed if a
MSI-X interrupt is generated.

- Jack and Michael's mlx4 FMR support. Will merge I guess, although
I do hope to have time to address the DMA API abuse that is being
copied from mthca, so that mlx4 and mthca work in Xen domU.

- ehca patch queue. Will merge, pending fixes for the few minor
issues I commented on.

- Steve's mthca router mode support. Would be nice to see a review
from someone at Mellanox.

- Arthur's mthca doorbell alignment fixes. I will experiment with a
few different approaches and post what I like (and fix mlx4 as
well). I hope Arthur can review.

- Michael's mlx4 WQE shrinking patch. Not sure yet; I'll reply to
the latest patch directly.

Here are a few topics that I believe will not be ready in time for the
2.6.24 window and will need to wait for 2.6.25:

- Multiple CQ event vector support. I haven't seen any discussions
about how ULPs or userspace apps should decide which vector to use,
and hence no progress has been made since we deferred this during
the 2.6.23 merge window.

- XRC. Given the length of the backlog above and the fact that a
first draft of this code has not been posted yet, I don't see any
way that we could have something this major ready in time.

Here is the complete list of patches I have in my for-2.6.24 branch
waiting for the merge window so far. Mostly I haven't merged anything
big out of my backlog, so this is essentially all

Ali Ayoub (1):
IB/sa: Error handling thinko fix

Anton Blanchard (3):
IB/fmr_pool: Clean up some error messages in fmr_pool.c
IB/ehca: Make output clearer by removing some debug messages
IB/ehca: Export module parameters in sysfs

Dotan Barak (1):
mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS

Eli Cohen (2):
IPoIB: Fix typo to end statement with ';' instead of ','
IPoIB: Fix error path memory leak

Michael S. Tsirkin (2):
mlx4_core: Enable MSI-X by default
IB/mthca: Enable MSI-X by default

Peter Oruba (1):
IB/mthca: Use PCI-X/PCI-Express read control interfaces

Roland Dreier (6):
IPoIB: Make sure no receives are handled when stopping device
IB: find_first_zero_bit() takes unsigned pointer
mlx4_core: Don't free special QPs in QP number bitmap
IB/mlx4: Use set_data_seg() in mlx4_ib_post_recv()
IB/ehca: Include <linux/mutex.h> from ehca_classes.h
IB/mlx4: Fix up SRQ limit_watermark endianness

Steve Wise (1):
RDMA/cxgb3: Make the iw_cxgb3 module parameters writable

2007-09-13 18:04:46

by Steve Wise

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

Hey Roland,

I was about to post v2 of my patch to avoid port space collisions with
the native stack. Can we get that 2.6.24? It is high priority IMO.
I've tried to solicit review on it, but I think folks are reluctant... ;-)

Steve.

Roland Dreier wrote:
> With 2.6.24 probably opening in the not-too-distant future, it's
> probably a good time to review what my plans are for when the merge
> window opens.
>
> At the kernel summit, we discussed patch review (doing a web search
> for "kernel summit" "reviewed-by:" should turn up lots of info on
> this). Due to an unfortunate combination of vacation and conference
> travel, summer colds, and other inconveniences, I am very backed up on
> reviewing. And in any case, I've allowed too much code review to be
> dumped on me -- when there are dozens of people working on IB and RDMA
> stuff, it obviously doesn't work to expect me to do all the reviewing.
>
> Unfortunately, due to the length of the backlog and the fact that
> 2.6.23 seems fairly close, some of the things listed below are going
> to miss the 2.6.24 merge window. So, although the plan is to phase in
> requiring "Reviewed-by:" gently, for this merge, if you can get
> someone other than me to review your work, then the chances of it
> being merged increase dramatically. I'm talking about a real review--
> ideally, someone independent (from another company would be good) who
> is willing to provide a "Reviewed-by:" line that means the reviewer
> has really looked at and thought about the patch. There should be a
> mailing list thread you can point me at where the reviewer comments on
> the patch and a new version of that patch addressing all comments is
> posted (or in exceptional cases, where the patch is perfect to start
> with, where the reviewer says the patch is great).
>
> For example, given the number of IPoIB changes pending, it might be a
> good idea for the people submitting them to get together and trade
> reviews (ie "If you review my patch, I'll review your patch"). There
> are a few cases where getting a review may not be necessary. First of
> all, trivial and obvious patches don't need a review. It's a
> judgement call what is trivial or obvious, and it's always a good idea
> to provide a changelog that makes it clear why a patch is trivial and
> obviously correct. Second, hardware driver patches may not make sense
> to anyone outside of the company whose hardware the driver is for.
> Still, in this case, an internal Reviewed-by: would be nice, and also
> a changelog that explains the reason for the change always helps
> (don't just tell me what your patch does, but also explain what the
> patch fixes and what the impact of the current situation is).
>
> Anyway, here are all the pending things that I'm aware of. As usual,
> if something isn't already in my tree and isn't listed below, I
> probably missed it or dropped it by mistake. Please remind me again
> in that case.
>
> Core:
>
> - My user_mad P_Key index support patch. I'll test the ioctl to
> change to the new mode and merge this I guess, since Hal and Sean
> have tested this out.
>
> - A fix to the user_mad 32-bit big-endian userspace 64/32 problem
> with the method_mask when registering agents. I'll write a patch
> to handle this in a way that doesn't change the ABI for anything
> other than the broken case and hope to get someone to review this
> so it can be merged.
>
> - Sean's QoS changes. These look fine at first glance, and I just
> plan to understand the backwards compatibility story (ie how this
> works with an old SM) and merge. Anyone who objects let me know.
>
> - Sean's IB CM MRA interface changes. Don't know at this point. It
> seems OK but I'm not clear on what if any real-world improvement
> this gives us.
>
> ULPs:
>
> - Pradeep's IPoIB CM support for devices that don't have SRQs. I
> think the basic approach makes sense (I don't think faking SRQs at
> some other layer is really feasible) and I need to find time to
> look at the details to see if the current patch looks workable. I'm
> likely to merge this; getting an independent Reviewed-by: would
> certainly be appreciated too.
>
> - Moni's IPoIB bonding support. This seems mostly an issue of
> getting the core bonding maintainer's attention. However getting a
> Reviewed-by: for the IPoIB changes wouldn't hurt too.
>
> - Rolf's IPoIB MGID scope changes. Certainly we want to fix this
> issue but the specific changes need review.
>
> - Eli and Michael's IPoIB stateless offload (checksum offload, LSO,
> LRO, etc). It's a big series that makes quite a few core changes.
> I think it needs some careful review and is probably at risk of
> missing this merge window. Sorting in order of invasiveness so we
> can merge at least some of it (if splitting it makes sense) might
> be a good idea.
>
> HW specific:
>
> - I already merged patches to enable MSI-X by default for mthca and
> mlx4. I hope there aren't too many systems that get hosed if a
> MSI-X interrupt is generated.
>
> - Jack and Michael's mlx4 FMR support. Will merge I guess, although
> I do hope to have time to address the DMA API abuse that is being
> copied from mthca, so that mlx4 and mthca work in Xen domU.
>
> - ehca patch queue. Will merge, pending fixes for the few minor
> issues I commented on.
>
> - Steve's mthca router mode support. Would be nice to see a review
> from someone at Mellanox.
>
> - Arthur's mthca doorbell alignment fixes. I will experiment with a
> few different approaches and post what I like (and fix mlx4 as
> well). I hope Arthur can review.
>
> - Michael's mlx4 WQE shrinking patch. Not sure yet; I'll reply to
> the latest patch directly.
>
> Here are a few topics that I believe will not be ready in time for the
> 2.6.24 window and will need to wait for 2.6.25:
>
> - Multiple CQ event vector support. I haven't seen any discussions
> about how ULPs or userspace apps should decide which vector to use,
> and hence no progress has been made since we deferred this during
> the 2.6.23 merge window.
>
> - XRC. Given the length of the backlog above and the fact that a
> first draft of this code has not been posted yet, I don't see any
> way that we could have something this major ready in time.
>
> Here is the complete list of patches I have in my for-2.6.24 branch
> waiting for the merge window so far. Mostly I haven't merged anything
> big out of my backlog, so this is essentially all
>
> Ali Ayoub (1):
> IB/sa: Error handling thinko fix
>
> Anton Blanchard (3):
> IB/fmr_pool: Clean up some error messages in fmr_pool.c
> IB/ehca: Make output clearer by removing some debug messages
> IB/ehca: Export module parameters in sysfs
>
> Dotan Barak (1):
> mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS
>
> Eli Cohen (2):
> IPoIB: Fix typo to end statement with ';' instead of ','
> IPoIB: Fix error path memory leak
>
> Michael S. Tsirkin (2):
> mlx4_core: Enable MSI-X by default
> IB/mthca: Enable MSI-X by default
>
> Peter Oruba (1):
> IB/mthca: Use PCI-X/PCI-Express read control interfaces
>
> Roland Dreier (6):
> IPoIB: Make sure no receives are handled when stopping device
> IB: find_first_zero_bit() takes unsigned pointer
> mlx4_core: Don't free special QPs in QP number bitmap
> IB/mlx4: Use set_data_seg() in mlx4_ib_post_recv()
> IB/ehca: Include <linux/mutex.h> from ehca_classes.h
> IB/mlx4: Fix up SRQ limit_watermark endianness
>
> Steve Wise (1):
> RDMA/cxgb3: Make the iw_cxgb3 module parameters writable
> _______________________________________________
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

2007-09-13 18:20:54

by Hefty, Sean

[permalink] [raw]

Subject: RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> - My user_mad P_Key index support patch. I'll test the ioctl to
> change to the new mode and merge this I guess, since Hal and Sean
> have tested this out.

I can give this patch a reviewed-by: too, and I will also try to review a couple
of the pending ipoib patches.

> - Sean's QoS changes. These look fine at first glance, and I just
> plan to understand the backwards compatibility story (ie how this
> works with an old SM) and merge. Anyone who objects let me know.

The new QoS fields fall into fields that are currently reserved, which should be
ignored by an older SM. I've only tested this against openSM however.

> - Sean's IB CM MRA interface changes. Don't know at this point. It
> seems OK but I'm not clear on what if any real-world improvement
> this gives us.

This patch was generated in response to an Intel MPI issue. We've seen MPI take
several minutes to respond to a connection request during the middle of large
application runs. When this happens, the active side times out the connection.
In OFED, we added module parameters to adjust the rdma_cm connection timeout on
the active side, but I believe that sending an MRA from the passive side is a
better solution.

- Sean

2007-09-13 18:22:33

by Shirley Ma

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

Hello Roland,

Since ehca can support 4K MTU, we would like to see a patch in
IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24
kernel. The idea is IPoIB link MTU will pick up a return value from SM's
default broadcast MTU. This patch should be a small patch, I hope you are
OK with this.

Thanks
Shirley

2007-09-13 18:56:47

by Jeff Garzik

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

Steve Wise wrote:
> I was about to post v2 of my patch to avoid port space collisions with
> the native stack. Can we get that 2.6.24? It is high priority IMO.
> I've tried to solicit review on it, but I think folks are reluctant... ;-)

Well, if it involves /sharing/ port space with the native stack, i.e.
where port 1234 is IB but 1235 is Linux, pretty much all the networking
devs have NAK'd that approach AFAICS.

Jeff

2007-09-13 18:59:32

by Steve Wise

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

Jeff Garzik wrote:
> Steve Wise wrote:
>> I was about to post v2 of my patch to avoid port space collisions with
>> the native stack. Can we get that 2.6.24? It is high priority IMO.
>> I've tried to solicit review on it, but I think folks are reluctant...
>> ;-)
>
> Well, if it involves /sharing/ port space with the native stack, i.e.
> where port 1234 is IB but 1235 is Linux, pretty much all the networking
> devs have NAK'd that approach AFAICS.
>

Jeff, I posted a fix that doesn't do this. No port sharing. The iwarp
device will use its own ip address and subnet to avoid collisions. You
should review the patch when I post v2.

Thanks,

Steve.

2007-09-13 19:55:39

by Jeff Garzik

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

Steve Wise wrote:
> Jeff Garzik wrote:
>> Steve Wise wrote:
>>> I was about to post v2 of my patch to avoid port space collisions
>>> with the native stack. Can we get that 2.6.24? It is high priority
>>> IMO. I've tried to solicit review on it, but I think folks are
>>> reluctant... ;-)

>> Well, if it involves /sharing/ port space with the native stack, i.e.
>> where port 1234 is IB but 1235 is Linux, pretty much all the
>> networking devs have NAK'd that approach AFAICS.

> Jeff, I posted a fix that doesn't do this. No port sharing. The iwarp
> device will use its own ip address and subnet to avoid collisions. You
> should review the patch when I post v2.

Sounds promising, then! :)

Jeff

2007-09-13 21:00:59

by Roland Dreier

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

> Since ehca can support 4K MTU, we would like to see a patch in
> IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24
> kernel. The idea is IPoIB link MTU will pick up a return value from SM's
> default broadcast MTU. This patch should be a small patch, I hope you are
> OK with this.

It's actually not small, since it turns the skb allocation into a
4100-byte buffer, which ends up being more than 1 page usually, which
means it fails if memory is fragmented.

Anyway given the backlog anything substantial that hasn't been posted
already is almost surely going to have to wait until 2.6.25.

2007-09-13 21:03:20

by Roland Dreier

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> > - My user_mad P_Key index support patch. I'll test the ioctl to
> > change to the new mode and merge this I guess, since Hal and Sean
> > have tested this out.
>
> I can give this patch a reviewed-by: too, and I will also try to review a couple
> of the pending ipoib patches.

Thanks!

> > - Sean's QoS changes. These look fine at first glance, and I just
> > plan to understand the backwards compatibility story (ie how this
> > works with an old SM) and merge. Anyone who objects let me know.
>
> The new QoS fields fall into fields that are currently reserved, which should be
> ignored by an older SM. I've only tested this against openSM however.

That seems OK -- I'm OK with breaking things if an SM is clearly buggy
(and not ignoring fields that are defined to be ignored in the spec
would certainly be a clear bug to me).

> This patch was generated in response to an Intel MPI issue. We've seen MPI take
> several minutes to respond to a connection request during the middle of large
> application runs. When this happens, the active side times out the connection.
> In OFED, we added module parameters to adjust the rdma_cm connection timeout on
> the active side, but I believe that sending an MRA from the passive side is a
> better solution.

OK -- just to make sure I'm understanding what you're saying: have you
confirmed that your proposed patches actually fix the issue?

- R.

2007-09-13 21:11:42

by Roland Dreier

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> I was about to post v2 of my patch to avoid port space collisions with
> the native stack. Can we get that 2.6.24? It is high priority
> IMO. I've tried to solicit review on it, but I think folks are
> reluctant... ;-)

I would like to get this in, but I'm still at least a little
reluctant, since we would be committing to a user interface that seems
a little awkward at best, so I'd like to try and find something
better. Just to summarize my understanding:

- your patch requires the administration to configure an ethX:iwY
alias address to use iwarp. (By the way is there anything other
than "don't do that" that avoids assigning the same address to the
iwarp alias and a non-iwarp interface?)

- it would be nicer to create the alias automatically, but an alias
without an address doesn't make sense. Creating a whole separate
net device causes problems because the iwarp stuff still needs to
use the main net device to do ARP etc.

- so I'm out of better ideas but I still want to push back a little
before we commit to something ugly.

I've been meaning to track down the bnx2 iscsi offload patch to look
and see if this issue is addressed, since the same problem seems to
exist: it seems an iscsi connection and a main stack tcp connection
might share the same 4-tuple unless something is done to avoid that
happening.

Also, I think it behooves us to get some agreement on this approach
with NetEffect and Kanoj (NetXen?) at least, since their iwarp drivers
seem to be imminent.

- R.

2007-09-13 21:12:48

by Roland Dreier

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> Well, if it involves /sharing/ port space with the native stack,
> i.e. where port 1234 is IB but 1235 is Linux, pretty much all the
> networking devs have NAK'd that approach AFAICS.

Just to be clear, InfiniBand has no problem; the issue is port
collisions involving iWARP connections.

- R.

2007-09-13 22:02:32

by Michael Chan

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

On Thu, 2007-09-13 at 14:11 -0700, Roland Dreier wrote:

>
> I've been meaning to track down the bnx2 iscsi offload patch to look
> and see if this issue is addressed, since the same problem seems to
> exist: it seems an iscsi connection and a main stack tcp connection
> might share the same 4-tuple unless something is done to avoid that
> happening.
>

iSCSI does not do passive listens, only active connections to the
target. But you're right, the port space is still shared between iSCSI
and the main stack. We currently rely on user apps binding to the main
stack to reserve certain ephemeral ports, and telling the iSCSI driver
which ports to use.

2007-09-14 12:56:12

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

On Thu, Sep 13, 2007 at 01:59:21PM -0500, Steve Wise ([email protected]) wrote:
> >Well, if it involves /sharing/ port space with the native stack, i.e.
> >where port 1234 is IB but 1235 is Linux, pretty much all the networking
> >devs have NAK'd that approach AFAICS.
>
> Jeff, I posted a fix that doesn't do this. No port sharing. The iwarp
> device will use its own ip address and subnet to avoid collisions. You
> should review the patch when I post v2.

Could you please resend it, since I missed it in netdev@.

--
Evgeniy Polyakov

2007-09-14 16:09:26

by Roland Dreier

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

> The patch is just needed to pick up broadcast MTU size instead of hard
> coding 2K right now. SKB allocation shouldn't be different with Ethernet
> Jambo Frame and IPoIB-CM which 64K MTU. I don't understand why it's
> different. Could you please explain this?

It's exactly the same problem as ethernet jumbo frames. A web search
for '"order 1" failure e1000' might be interesting.

IPoIB CM handles this properly by gathering together single pages in
skbs' fragment lists.

- R.

2007-09-14 16:18:46

by Roland Dreier

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> > I've been meaning to track down the bnx2 iscsi offload patch to look
> > and see if this issue is addressed, since the same problem seems to
> > exist: it seems an iscsi connection and a main stack tcp connection
> > might share the same 4-tuple unless something is done to avoid that
> > happening.

> iSCSI does not do passive listens, only active connections to the
> target. But you're right, the port space is still shared between iSCSI
> and the main stack. We currently rely on user apps binding to the main
> stack to reserve certain ephemeral ports, and telling the iSCSI driver
> which ports to use.

Got it... I wasn't thinking that clearly, but it is clear that a full
4-tuple collision with only active connections is quite unlikely. I
guess you would have to make both an offloaded and a non-offloaded
iSCSI connection to the same target and get really unlucky with
ephemeral port allocation. So in practice I guess it's not an issue
at all with your driver yet.

However, do you have any plans to support iSCSI offload for targets?
Also, looking at the first CNIC patch, I can't help but notice that
you seem to have at least some support for iWARP there. How does the
CNIC look? Does it share the same interface/addresses as the
non-offload NIC, or does it create a completely separate netdevice?

I want to make sure that whatever solution we come up with for cxgb3
doesn't cause problems for you. And of course if you have a better
idea than what Steve has come up with, that would be great :)

- R.

2007-09-14 17:46:16

by Hefty, Sean

[permalink] [raw]

Subject: RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

>OK -- just to make sure I'm understanding what you're saying: have you
>confirmed that your proposed patches actually fix the issue?

Not directly. I cannot easily test kernel patches on our larger, production
clusters. We've seen the issue with specific applications on 512 and 1024
cores, but I've only been able to test the patch on a 48-core cluster. I have
verified that it successfully increases the timeout to where it *should* work,
but cannot absolutely confirm that it will fix the problem. I'm unlikely to
know that until the production clusters move to an OFED release (1.3?)
containing this patch.

- Sean

2007-09-14 18:36:25

by Shirley Ma

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

> IPoIB CM handles this properly by gathering together single pages in
> skbs' fragment lists.
>
> - R.

Then can we reuse IPoIB CM code here?

Thanks
Shirley

2007-09-14 20:12:22

by Michael Chan

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

On Fri, 2007-09-14 at 09:18 -0700, Roland Dreier wrote:

> However, do you have any plans to support iSCSI offload for targets?
> Also, looking at the first CNIC patch, I can't help but notice that
> you seem to have at least some support for iWARP there. How does the
> CNIC look? Does it share the same interface/addresses as the
> non-offload NIC, or does it create a completely separate netdevice?

We will support iWARP in the future and it should be similar to the way
we do iSCSI - using the same interface/addresses as the bnx2 NIC.

>
> I want to make sure that whatever solution we come up with for cxgb3
> doesn't cause problems for you. And of course if you have a better
> idea than what Steve has come up with, that would be great :)
>

We are looking at these discussions with great interest. If we have any
new ideas, we'll definitely let everyone know. Thanks.

2007-09-15 14:04:14

by Steve Wise

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

Roland Dreier wrote:
> > I was about to post v2 of my patch to avoid port space collisions with
> > the native stack. Can we get that 2.6.24? It is high priority
> > IMO. I've tried to solicit review on it, but I think folks are
> > reluctant... ;-)
>
> I would like to get this in, but I'm still at least a little
> reluctant, since we would be committing to a user interface that seems
> a little awkward at best, so I'd like to try and find something
> better. Just to summarize my understanding:
>
> - your patch requires the administration to configure an ethX:iwY
> alias address to use iwarp. (By the way is there anything other
> than "don't do that" that avoids assigning the same address to the
> iwarp alias and a non-iwarp interface?)
>

Nope. Its totally up to the admin to create the ethX:iwY interface
-and- to segment his services so host TCP runs on the ethX subnet(s) and
the iwarp rdma ones run on ethX:iwY subnet(s). Without changing the
core network serices, I don't see any way around this.

> - it would be nicer to create the alias automatically, but an alias
> without an address doesn't make sense. Creating a whole separate
> net device causes problems because the iwarp stuff still needs to
> use the main net device to do ARP etc.
>

I do log a warning if an iwarp application binds to address 0.0.0.0 and
there are no ethX:iwY address available.

> - so I'm out of better ideas but I still want to push back a little
> before we commit to something ugly.
>

Me 2. :-(

> I've been meaning to track down the bnx2 iscsi offload patch to look
> and see if this issue is addressed, since the same problem seems to
> exist: it seems an iscsi connection and a main stack tcp connection
> might share the same 4-tuple unless something is done to avoid that
> happening.
>
> Also, I think it behooves us to get some agreement on this approach
> with NetEffect and Kanoj (NetXen?) at least, since their iwarp drivers
> seem to be imminent.
>
> - R.

2007-09-16 09:50:33

by Or Gerlitz

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

Roland Dreier wrote:
> With 2.6.24 probably opening in the not-too-distant future, it's
> probably a good time to review what my plans are for when the merge
> window opens.

> Core:
> - Sean's QoS changes. These look fine at first glance, and I just
> plan to understand the backwards compatibility story (ie how this
> works with an old SM) and merge. Anyone who objects let me know.

Hi Roland,

I have reviewed the qos patches and provided comments which were
deployed in v2 of the series. I also tested it (ipoib and iser which is
rdma-cm based) against the Voltaire SM/SA to see that nothing was
broken. I will send you a "reviewed by:" signature.

> ULPs:

> [ofa-general] [PATCH RFC] IB/ipoib: enable IGMP for userpsace multicast IB apps
The IGMP enabling patch posted by me on September 2nd isn't on your list
http://lists.openfabrics.org/pipermail/general/2007-September/040250.html
can you add it?

> - Moni's IPoIB bonding support. This seems mostly an issue of
> getting the core bonding maintainer's attention. However getting a
> Reviewed-by: for the IPoIB changes wouldn't hurt too.

Jay Vosburgh, the bonding driver maintainer just sent an ack on all
patch series. As for the IPoIB changes, there are three patches, where
two of them, namely
> [PATCH 02/11] IB/ipoib: Notify the world before doing unregister
> [PATCH 04/11] IB/ipoib: Verify address handle validity on send
are handling a corner-case problems pointed by Michael Tsirkin.
Michael, will you be able to look on it and provide a reviewed-by
signature? the third patch
> [PATCH 03/11] IB/ipoib: Bound the net device to the ipoib_neigh structue
is somehow much more simple, I don't think more review is needed for it.

> - Eli and Michael's IPoIB stateless offload (checksum offload, LSO,
> LRO, etc). It's a big series that makes quite a few core changes.
> I think it needs some careful review and is probably at risk of
> missing this merge window. Sorting in order of invasiveness so we
> can merge at least some of it (if splitting it makes sense) might
> be a good idea.

Just for the record, the 'etc' above relates to the interrupt moderation
support (mlx4, core, ipoib {config through ethertool, usage). Among
other things, what is not clear to me here is if/how this goes
hand-in-hand with NAPI.

As you saw the patch adding checksum offload support had a long thread,
and I think the discussion has reached the point where Michael is
waiting for your take on it.

As for the LSO, LRO patches, I did not see any review comment.

I will see that I can review from the series, to begin with, will send
Eli some comments and questions.

> HW specific:
> - Jack and Michael's mlx4 FMR support. Will merge I guess, although
> I do hope to have time to address the DMA API abuse that is being
> copied from mthca, so that mlx4 and mthca work in Xen domU.

This patch series is somehow important as without them iser is useless
over connectx. Can be nice if you merge this and at max fix the abuse later.

Or.

2007-09-17 21:47:55

by Roland Dreier

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

> > IPoIB CM handles this properly by gathering together single pages in
> > skbs' fragment lists.

> Then can we reuse IPoIB CM code here?

Yes, if possible, refactoring things so that the rx skb allocation
code becomes common between CM and non-CM would definitely make sense.

2007-09-17 22:11:38

by Roland Dreier

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> The IGMP enabling patch posted by me on September 2nd isn't on your list
> http://lists.openfabrics.org/pipermail/general/2007-September/040250.html
> can you add it?

Yes, I lost that somehow. I will add it to my list of things to take
a look at (no opinion yet).

- R.

2007-09-18 07:05:33

by jackm

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

On Thursday 13 September 2007 20:57, Roland Dreier wrote:
> HW specific:
>
> ?- I already merged patches to enable MSI-X by default for mthca and
> ? ?mlx4. ?I hope there aren't too many systems that get hosed if a
> ? ?MSI-X interrupt is generated.
>
> ?- Jack and Michael's mlx4 FMR support. ?Will merge I guess, although
> ? ?I do hope to have time to address the DMA API abuse that is being
> ? ?copied from mthca, so that mlx4 and mthca work in Xen domU.
>
> ?- ehca patch queue. ?Will merge, pending fixes for the few minor
> ? ?issues I commented on.
>
> ?- Steve's mthca router mode support. ?Would be nice to see a review
> ? ?from someone at Mellanox.
>
> ?- Arthur's mthca doorbell alignment fixes. ?I will experiment with a
> ? ?few different approaches and post what I like (and fix mlx4 as
> ? ?well). ?I hope Arthur can review.
>
> ?- Michael's mlx4 WQE shrinking patch. ?Not sure yet; I'll reply to
> ? ?the latest patch directly.
>
Missing from this list (IMPORTANT patch!):
[ofa-general] [PATCH 2 of 2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists
(Posted by me to list on Sept 4)
{patch header:
This is an addendum to Roland's commit 0e6e74162164d908edf7889ac66dca09e7505745
(June 18). This addendum adds prefetch headroom marking processing for s/g segments.

We write s/g segments in reverse order into the WQE, in order to guarantee
that the first dword of all cachelines containing s/g segments is written last
(overwriting the headroom invalidation pattern). The entire cacheline will thus
contain valid data when the invalidation pattern is overwritten.
}
This patch series (1 of 2 is for libmlx4, the same issue).
============================================================

Also, I'm now posting (in a separate post) the following patch to mlx4, which is important:
display the following device information via sysfs:
board_id, fw_ver, hw_rev, hca_type.

The info is displayed under directory /sys/class/infiniband/mlx4_x, where x is
the pci bus sequence number (starting from zero).

This patch makes information available to ibstat and ibv_devinfo under the
same directory as is used for tavor/arbel/sinai -- thus requiring no userspace
modifications.

- Jack

2007-09-18 16:35:13

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

> Quoting Roland Dreier <[email protected]>:
> Subject: InfiniBand/RDMA merge plans for 2.6.24
>
> With 2.6.24 probably opening in the not-too-distant future, it's
> probably a good time to review what my plans are for when the merge
> window opens.

Roland, could you merge the common TX CQ patch please?
It actually fixes a real problem.

--
MST

2007-09-18 17:18:20

by Roland Dreier

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

> Roland, could you merge the common TX CQ patch please?
> It actually fixes a real problem.

Yes, I will, but it collides with the net-2.6.24 NAPI rework I think,
so it may not go in until a few days after the merge window.

Have you verified that the patch cures the interrupt overload issues?

2007-09-18 17:23:23

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

> Quoting Roland Dreier <[email protected]>:
> Subject: Re: InfiniBand/RDMA merge plans for 2.6.24
>
> > Roland, could you merge the common TX CQ patch please?
> > It actually fixes a real problem.
>
> Yes, I will, but it collides with the net-2.6.24 NAPI rework I think,
> so it may not go in until a few days after the merge window.
>
> Have you verified that the patch cures the interrupt overload issues?

Yes.

--
MST

2007-10-02 18:27:20

by Roland Dreier

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> >OK -- just to make sure I'm understanding what you're saying: have you
> >confirmed that your proposed [CM MRA] patches actually fix the issue?
>
> Not directly. I cannot easily test kernel patches on our larger, production
> clusters. We've seen the issue with specific applications on 512 and 1024
> cores, but I've only been able to test the patch on a 48-core cluster. I have
> verified that it successfully increases the timeout to where it *should* work,
> but cannot absolutely confirm that it will fix the problem. I'm unlikely to
> know that until the production clusters move to an OFED release (1.3?)
> containing this patch.

Umm... this is a difficult situation for me to merge the changes then.
We're changing the CM retry behavior blind here. How do we know that
the MRA changes don't make the scalability issue worse?

- R.

2007-10-02 18:51:09

by Hefty, Sean

[permalink] [raw]

Subject: RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

>Umm... this is a difficult situation for me to merge the changes then.
>We're changing the CM retry behavior blind here. How do we know that
>the MRA changes don't make the scalability issue worse?

What's currently upstream doesn't work for Intel MPI on our larger clusters.
The connection requests time out on the active side before the passive side can
respond.

The OFED release works because it provides a kernel patch to make the timeout a
module parameter. I'm trying to avoid adding a module parameter, and the MRA is
designed for this situation.

I tested this by simulating a slow passive side responder, and it worked as
expected for those tests. Using an MRA does add another MAD to the CM exchange,
which is why it is sent only after seeing a duplicate request. Alternatively,
we can take the OFED module parameter patch.

- Sean

2007-10-03 18:44:23

by Shirley Ma

[permalink] [raw]

Subject: Re: InfiniBand/RDMA merge plans for 2.6.24

Roland Dreier <[email protected]> wrote on 09/17/2007 02:47:42 PM:

> > > IPoIB CM handles this properly by gathering together single pages
in
> > > skbs' fragment lists.
>
> > Then can we reuse IPoIB CM code here?
>
> Yes, if possible, refactoring things so that the rx skb allocation
> code becomes common between CM and non-CM would definitely make sense.

IPoIB-CM rx skb allocation is not generic to be used by UD, it allocates
more buffers than needed if mtu is not 64K, and doesn't query the real
max_num_sg from the device. I am thinking to have a generic skb allocation
in IPoIB based on matrix of (ipoib-mtu-size, page-size, max_num_sg,
head-size).

Thanks
Shirley

2007-10-05 23:11:25

by Roland Dreier

[permalink] [raw]

Subject: Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

> I tested this by simulating a slow passive side responder, and it worked as
> expected for those tests. Using an MRA does add another MAD to the CM exchange,
> which is why it is sent only after seeing a duplicate request. Alternatively,
> we can take the OFED module parameter patch.

What the heck, I added this for 2.6.24. If it doesn't work out we can
back it out.

- R.