2008-07-14 05:16:19

by Roland Dreier

[permalink] [raw]
Subject: InfiniBand/RDMA merge plans for 2.6.27

I've been a little lazy about writing this, and 2.6.26 is already out,
so it's time to review what my plans are for the merge window opens.

I have to say that I've been somewhat disappointed with the quality of
things this cycle. Maybe I've just been in a tetchy mood these days,
but before you send me an email saying, "this is ready to go, when are
you going to merge it," please try doing some of the following;

- Run your patch through checkpatch.pl so I don't have to nag you to
fix trivial issues (or spend time fixing them myself).

- Read your patch over so I don't see a memory leak or deadlock as
soon as I look at it.

- Build your patch with sparse checking ("C=2 CF=-D__CHECK_ENDIAN__")
and make sure it doesn't introduce new warnings. (A big bonus in
goodwill for sending patches that fix old warnings)

- Test your patch on a kernel with things like slab debugging and
lockdep turned on.

And while you're waiting for me to get to your patch, I sure wouldn't
mind if you read and commented on someone else's patch. None of this
means you shouldn't remind me about pending patches, since I often
lose track of things and drop them accidentally.

Anyway, here are all the pending things that I'm aware of. As usual,
if something isn't already in my tree and isn't listed below, I
probably missed it or dropped it by mistake. Please remind me again
in that case.

Core:

- I'm waiting to merge the RDMA_CM_EVENT_ADDR_CHANGE changes that
depend on core networking changes until such changes are upstream.
Or, please remind me when that happens.

- Jack's XRC patch set. I think we're getting closer to converging
here, and I hope to get this merged but we're getting down to the
wire, so we'll see.

ULPs:

- I merged a decent amount of stuff for IPoIB, including LRO,
"loopback blocking," and better handling of fabric events, and I
don't think I have anything pending.

HW specific:

- Yevgeny's mlx4 changes. We'll see how much time is left after I
get done with XRC (which is before this on my list) but to be
honest I'm not sure how mergeable a lot of this is without the
mlx4_en patches that actually use it.

- I've been working on memory management extensions support for mlx4,
but I'm not sure if it will be ready in time. Firmware for this
may not be released for a while so it ain't urgent anyway.

Here are a few topics that I believe will not be ready in time for the
2.6.27 window and will need to wait for 2.6.28 at least:

- Remove LLTX from IPoIB. I haven't had time to finish this yet, so
I guess it will probably wait for 2.6.28 now...

- Multiple CQ event vector support. No one has convinced me that we
know how ULPs or userspace apps should decide which vector to use,
and hence little progress has been made since we deferred this
during the 2.6.23 merge window.

Here all the patches I already have in my for-2.6.27 branch:

Christophe Jaillet (1):
RDMA/nes: Remove unnecessary memset()

Dotan Barak (1):
RDMA: Improve include file coding style

Eli Cohen (12):
IB/mlx4: Optimize QP stamping
IPoIB: Copy small received SKBs in connected mode
IB/mlx4: Configure QPs' max message size based on real device capability
IB/mlx4: Pass congestion management class MADs to the HCA
IPoIB: Remove unused IPOIB_MCAST_STARTED code
IPoIB: Remove priv->mcast_mutex
IPoIB: Only set Q_Key once: after joining broadcast group
IPoIB: Use rtnl lock/unlock when changing device flags
IPoIB: Use dev_set_mtu() to change mtu
IPoIB/cm: Reduce connected mode TX object size
IPoIB: Double default RX/TX ring sizes
IB/mlx4: Use kzalloc() for new QPs so flags are initialized to 0

Joachim Fenkes (2):
IB/ehca: Reject receive work requests if QP is in RESET state
IB/ehca: Make device table externally visible

Jon Mason (1):
RDMA/cxgb3: Propagate HW page size capabilities

Moni Shoua (2):
IB/sa: Fail requests made while creating new SM AH
IPoIB: Refresh paths instead of flushing them on SM change events

Or Gerlitz (2):
RDMA/addr: Keep pointer to netdevice in struct rdma_dev_addr
RDMA/cma: Simplify locking needed for serialization of callbacks

Ralph Campbell (2):
IB/core: Reset to error QP state transition is not allowed
IB/ipath: Use IEEE OUI for vendor_id reported by ibv_query_device()

Robert P. J. Day (1):
IB/ipath: Simplify code using ARRAY_SIZE() macro

Roland Dreier (13):
IB/srp: Remove use of cached P_Key/GID queries
RDMA: Remove subversion $Id tags
IB/mthca: Remove extra code for RESET->ERR QP state transition
IB/mlx4: Remove extra code for RESET->ERR QP state transition
RDMA/cxgb3: Remove write-only iwch_rnic_attributes fields
RDMA/cma: Add missing newlines to printk()s
IPoIB/cm: Fix racy use of receive WR/SGL in ipoib_cm_post_receive_nonsrq()
RDMA/nes: Encapsulate logic nes_put_cqp_request()
RDMA/nes: Get rid of ring_doorbell parameter of nes_post_cqp_request()
IPoIB: Get rid of ipoib_mcast_detach() wrapper
IB/mthca: Remove "stop" flag for catastrophic error polling timer
IB/mthca: Use round_jiffies() for catastrophic error polling timer
IB/mthca: Fix check of max_send_sge for special QPs

Ron Livne (3):
IB/core: Add support for multicast loopback blocking
IB/mlx4: Add support for blocking multicast loopback packets
IPoIB: Use multicast loopback blocking if available

Sean Hefty (1):
RDMA: Fix license text

Stefan Roscher (1):
IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts

Steve Wise (8):
RDMA/core: Add memory management extensions support
RDMA/cxgb3: MEM_MGT_EXTENSIONS support
RDMA/cxgb3: Fix up some ib_device_attr fields
RDMA/core: Add iWARP protocol statistics attributes in sysfs
RDMA/cxgb3: Add support for protocol statistics
RDMA/cxgb3: Set rkey field for new memory windows in iwch_alloc_mw()
RDMA/core: Add local DMA L_Key support
RDMA/cxgb3: Fixes for zero STag

Vladimir Sokolovsky (2):
IPoIB: add LRO support
mlx4_core: Use MOD_STAT_CFG command to get minimal page size


2008-07-14 13:54:16

by Eli Cohen

[permalink] [raw]
Subject: Re: [ofa-general] ***SPAM*** InfiniBand/RDMA merge plans for 2.6.27

On Sun, Jul 13, 2008 at 10:16:08PM -0700, Roland Dreier wrote:
>
> - I merged a decent amount of stuff for IPoIB, including LRO,
> "loopback blocking," and better handling of fabric events, and I
> don't think I have anything pending.

There is this patch that I did not recieve your response for. We think
it's reasonable to do this - what do you think?

http://lists.openfabrics.org/pipermail/general/2008-July/052460.html
>

2008-07-14 16:43:18

by Tziporet Koren

[permalink] [raw]
Subject: Re: [ofa-general] ***SPAM*** InfiniBand/RDMA merge plans for 2.6.27

Roland Dreier wrote:
> Core:
>
> - I'm waiting to merge the RDMA_CM_EVENT_ADDR_CHANGE changes that
> depend on core networking changes until such changes are upstream.
> Or, please remind me when that happens.
>
> - Jack's XRC patch set. I think we're getting closer to converging
> here, and I hope to get this merged but we're getting down to the
> wire, so we'll see.
>
I hope we can get those in
> HW specific:
>
> - Yevgeny's mlx4 changes. We'll see how much time is left after I
> get done with XRC (which is before this on my list) but to be
> honest I'm not sure how mergeable a lot of this is without the
> mlx4_en patches that actually use it.
>
We just posted the mlx4_en patches, and we need to coordinate the merge
of them together
> - I've been working on memory management extensions support for mlx4,
> but I'm not sure if it will be ready in time. Firmware for this
> may not be released for a while so it ain't urgent anyway.
>
We are testing the patches and we already have FW that enable them.
I agree its not urgent but it would be good to have it, so ULPs that are
interested can be tested over IB too.
> Here are a few topics that I believe will not be ready in time for the
> 2.6.27 window and will need to wait for 2.6.28 at least:
>
> - Multiple CQ event vector support. No one has convinced me that we
> know how ULPs or userspace apps should decide which vector to use,
> and hence little progress has been made since we deferred this
> during the 2.6.23 merge window.
>
We should progress this one even if we missed 2.6.27, especially we need
it for RSS, and I know also RDS can gain from it.

Tziporet

2008-07-15 06:44:59

by jackm

[permalink] [raw]
Subject: Re: [ofa-general] ***SPAM*** InfiniBand/RDMA merge plans for 2.6.27

On Monday 14 July 2008 08:16, Roland Dreier wrote:
> HW specific:
>
> ?- Yevgeny's mlx4 changes. ?We'll see how much time is left after I
> ? ?get done with XRC (which is before this on my list) but to be
> ? ?honest I'm not sure how mergeable a lot of this is without the
> ? ?mlx4_en patches that actually use it.
>
> ?- I've been working on memory management extensions support for mlx4,
> ? ?but I'm not sure if it will be ready in time. ?Firmware for this
> ? ?may not be released for a while so it ain't urgent anyway.
>
Roland, what about the fw diagnostic patch for the mlx4 driver?
http://lists.openfabrics.org/pipermail/general/2008-June/051655.html

Can you put this in 2.6.27, too?

- Jack

2008-07-15 06:52:25

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] ***SPAM*** InfiniBand/RDMA merge plans for 2.6.27

> There is this patch that I did not recieve your response for. We think
> it's reasonable to do this - what do you think?
>
> http://lists.openfabrics.org/pipermail/general/2008-July/052460.html

I'm ambivalent. Seems like a minor usability improvement in some cases,
but on the other hand it seems a little strange to change the MTU behind
the user's back. Maybe I'll stick it in -- I need to think about it.

2008-07-15 06:52:43

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] ***SPAM*** InfiniBand/RDMA merge plans for 2.6.27

> Roland, what about the fw diagnostic patch for the mlx4 driver?
> http://lists.openfabrics.org/pipermail/general/2008-June/051655.html

Oh yeah, I'll reply on that thread.

2008-07-15 19:11:39

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] ***SPAM*** InfiniBand/RDMA merge plans for 2.6.27

> > - I've been working on memory management extensions support for mlx4,
> > but I'm not sure if it will be ready in time. Firmware for this
> > may not be released for a while so it ain't urgent anyway.

> We are testing the patches and we already have FW that enable them.
> I agree its not urgent but it would be good to have it, so ULPs that
> are interested can be tested over IB too.

By the way, for outside of Mellanox, here is my patch that adds mem mgt
extensions and local lkey support to mlx4. Comments appreciated...

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 299f208..0b191a4 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -637,6 +637,7 @@ repoll:
case MLX4_OPCODE_SEND_IMM:
wc->wc_flags |= IB_WC_WITH_IMM;
case MLX4_OPCODE_SEND:
+ case MLX4_OPCODE_SEND_INVAL:
wc->opcode = IB_WC_SEND;
break;
case MLX4_OPCODE_RDMA_READ:
@@ -657,6 +658,12 @@ repoll:
case MLX4_OPCODE_LSO:
wc->opcode = IB_WC_LSO;
break;
+ case MLX4_OPCODE_FMR:
+ wc->opcode = IB_WC_FAST_REG_MR;
+ break;
+ case MLX4_OPCODE_LOCAL_INVAL:
+ wc->opcode = IB_WC_LOCAL_INV;
+ break;
}
} else {
wc->byte_len = be32_to_cpu(cqe->byte_cnt);
@@ -667,6 +674,11 @@ repoll:
wc->wc_flags = IB_WC_WITH_IMM;
wc->ex.imm_data = cqe->immed_rss_invalid;
break;
+ case MLX4_RECV_OPCODE_SEND_INVAL:
+ wc->opcode = IB_WC_RECV;
+ wc->wc_flags = IB_WC_WITH_INVALIDATE;
+ wc->ex.invalidate_rkey = be32_to_cpu(cqe->immed_rss_invalid);
+ break;
case MLX4_RECV_OPCODE_SEND:
wc->opcode = IB_WC_RECV;
wc->wc_flags = 0;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index bcf5064..38d6907 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -104,6 +104,12 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
props->device_cap_flags |= IB_DEVICE_UD_IP_CSUM;
if (dev->dev->caps.max_gso_sz)
props->device_cap_flags |= IB_DEVICE_UD_TSO;
+ if (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_RESERVED_LKEY)
+ props->device_cap_flags |= IB_DEVICE_LOCAL_DMA_LKEY;
+ if ((dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_LOCAL_INV) &&
+ (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_REMOTE_INV) &&
+ (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_FAST_REG_WR))
+ props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;

props->vendor_id = be32_to_cpup((__be32 *) (out_mad->data + 36)) &
0xffffff;
@@ -127,6 +133,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
props->max_srq = dev->dev->caps.num_srqs - dev->dev->caps.reserved_srqs;
props->max_srq_wr = dev->dev->caps.max_srq_wqes - 1;
props->max_srq_sge = dev->dev->caps.max_srq_sge;
+ props->max_fast_reg_page_list_len = PAGE_SIZE / sizeof (u64);
props->local_ca_ack_delay = dev->dev->caps.local_ca_ack_delay;
props->atomic_cap = dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_ATOMIC ?
IB_ATOMIC_HCA : IB_ATOMIC_NONE;
@@ -565,6 +572,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
strlcpy(ibdev->ib_dev.name, "mlx4_%d", IB_DEVICE_NAME_MAX);
ibdev->ib_dev.owner = THIS_MODULE;
ibdev->ib_dev.node_type = RDMA_NODE_IB_CA;
+ ibdev->ib_dev.local_dma_lkey = dev->caps.reserved_lkey;
ibdev->ib_dev.phys_port_cnt = dev->caps.num_ports;
ibdev->ib_dev.num_comp_vectors = 1;
ibdev->ib_dev.dma_device = &dev->pdev->dev;
@@ -627,6 +635,9 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
ibdev->ib_dev.get_dma_mr = mlx4_ib_get_dma_mr;
ibdev->ib_dev.reg_user_mr = mlx4_ib_reg_user_mr;
ibdev->ib_dev.dereg_mr = mlx4_ib_dereg_mr;
+ ibdev->ib_dev.alloc_fast_reg_mr = mlx4_ib_alloc_fast_reg_mr;
+ ibdev->ib_dev.alloc_fast_reg_page_list = mlx4_ib_alloc_fast_reg_page_list;
+ ibdev->ib_dev.free_fast_reg_page_list = mlx4_ib_free_fast_reg_page_list;
ibdev->ib_dev.attach_mcast = mlx4_ib_mcg_attach;
ibdev->ib_dev.detach_mcast = mlx4_ib_mcg_detach;
ibdev->ib_dev.process_mad = mlx4_ib_process_mad;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index c4cf5b6..d26a913 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -83,6 +83,11 @@ struct mlx4_ib_mr {
struct ib_umem *umem;
};

+struct mlx4_ib_fast_reg_page_list {
+ struct ib_fast_reg_page_list ibfrpl;
+ dma_addr_t map;
+};
+
struct mlx4_ib_fmr {
struct ib_fmr ibfmr;
struct mlx4_fmr mfmr;
@@ -199,6 +204,11 @@ static inline struct mlx4_ib_mr *to_mmr(struct ib_mr *ibmr)
return container_of(ibmr, struct mlx4_ib_mr, ibmr);
}

+static inline struct mlx4_ib_fast_reg_page_list *to_mfrpl(struct ib_fast_reg_page_list *ibfrpl)
+{
+ return container_of(ibfrpl, struct mlx4_ib_fast_reg_page_list, ibfrpl);
+}
+
static inline struct mlx4_ib_fmr *to_mfmr(struct ib_fmr *ibfmr)
{
return container_of(ibfmr, struct mlx4_ib_fmr, ibfmr);
@@ -239,6 +249,11 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 virt_addr, int access_flags,
struct ib_udata *udata);
int mlx4_ib_dereg_mr(struct ib_mr *mr);
+struct ib_mr *mlx4_ib_alloc_fast_reg_mr(struct ib_pd *pd,
+ int max_page_list_len);
+struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device *ibdev,
+ int page_list_len);
+void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);

int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
int mlx4_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata);
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 68e9248..db2086f 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -183,6 +183,76 @@ int mlx4_ib_dereg_mr(struct ib_mr *ibmr)
return 0;
}

+struct ib_mr *mlx4_ib_alloc_fast_reg_mr(struct ib_pd *pd,
+ int max_page_list_len)
+{
+ struct mlx4_ib_dev *dev = to_mdev(pd->device);
+ struct mlx4_ib_mr *mr;
+ int err;
+
+ mr = kmalloc(sizeof *mr, GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, 0, 0, 0,
+ max_page_list_len, 0, &mr->mmr);
+ if (err)
+ goto err_free;
+
+ err = mlx4_mr_enable(dev->dev, &mr->mmr);
+ if (err)
+ goto err_mr;
+
+ return &mr->ibmr;
+
+err_mr:
+ mlx4_mr_free(dev->dev, &mr->mmr);
+
+err_free:
+ kfree(mr);
+ return ERR_PTR(err);
+}
+
+struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device *ibdev,
+ int page_list_len)
+{
+ struct mlx4_ib_dev *dev = to_mdev(ibdev);
+ struct mlx4_ib_fast_reg_page_list *mfrpl;
+ int size = page_list_len * sizeof (u64);
+
+ if (size > PAGE_SIZE)
+ return ERR_PTR(-EINVAL);
+
+ mfrpl = kmalloc(sizeof *mfrpl, GFP_KERNEL);
+ if (!mfrpl)
+ return ERR_PTR(-ENOMEM);
+
+ mfrpl->ibfrpl.page_list = dma_alloc_coherent(&dev->dev->pdev->dev,
+ size, &mfrpl->map,
+ GFP_KERNEL);
+ if (!mfrpl->ibfrpl.page_list)
+ goto err_free;
+
+ WARN_ON(mfrpl->map & 0x3f);
+
+ return &mfrpl->ibfrpl;
+
+err_free:
+ kfree(mfrpl);
+ return ERR_PTR(-ENOMEM);
+}
+
+void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+ struct mlx4_ib_dev *dev = to_mdev(page_list->device);
+ struct mlx4_ib_fast_reg_page_list *mfrpl = to_mfrpl(page_list);
+ int size = page_list->max_page_list_len * sizeof (u64);
+
+ dma_free_coherent(&dev->dev->pdev->dev, size, page_list->page_list,
+ mfrpl->map);
+ kfree(mfrpl);
+}
+
struct ib_fmr *mlx4_ib_fmr_alloc(struct ib_pd *pd, int acc,
struct ib_fmr_attr *fmr_attr)
{
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 91590e7..47ec68d 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -78,6 +78,9 @@ static const __be32 mlx4_ib_opcode[] = {
[IB_WR_RDMA_READ] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ),
[IB_WR_ATOMIC_CMP_AND_SWP] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS),
[IB_WR_ATOMIC_FETCH_AND_ADD] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA),
+ [IB_WR_SEND_WITH_INV] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_INVAL),
+ [IB_WR_LOCAL_INV] = __constant_cpu_to_be32(MLX4_OPCODE_LOCAL_INVAL),
+ [IB_WR_FAST_REG_MR] = __constant_cpu_to_be32(MLX4_OPCODE_FMR),
};

static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp)
@@ -987,6 +990,10 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pdn);
context->params1 = cpu_to_be32(MLX4_IB_ACK_REQ_FREQ << 28);

+ /* Set "fast registration enabled" for all kernel QPs */
+ if (!qp->ibqp.uobject)
+ context->params1 |= cpu_to_be32(1 << 11);
+
if (attr_mask & IB_QP_RNR_RETRY) {
context->params1 |= cpu_to_be32(attr->rnr_retry << 13);
optpar |= MLX4_QP_OPTPAR_RNR_RETRY;
@@ -1333,6 +1340,38 @@ static int mlx4_wq_overflow(struct mlx4_ib_wq *wq, int nreq, struct ib_cq *ib_cq
return cur + nreq >= wq->max_post;
}

+static __be32 convert_access(int acc)
+{
+ return (acc & IB_ACCESS_REMOTE_ATOMIC ? cpu_to_be32(MLX4_WQE_FMR_PERM_ATOMIC) : 0) |
+ (acc & IB_ACCESS_REMOTE_WRITE ? cpu_to_be32(MLX4_WQE_FMR_PERM_REMOTE_WRITE) : 0) |
+ (acc & IB_ACCESS_REMOTE_READ ? cpu_to_be32(MLX4_WQE_FMR_PERM_REMOTE_READ) : 0) |
+ (acc & IB_ACCESS_LOCAL_WRITE ? cpu_to_be32(MLX4_WQE_FMR_PERM_LOCAL_WRITE) : 0) |
+ cpu_to_be32(MLX4_WQE_FMR_PERM_LOCAL_READ);
+}
+
+static void set_fmr_seg(struct mlx4_wqe_fmr_seg *fseg, struct ib_send_wr *wr)
+{
+ struct mlx4_ib_fast_reg_page_list *mfrpl = to_mfrpl(wr->wr.fast_reg.page_list);
+
+ fseg->flags = convert_access(wr->wr.fast_reg.access_flags);
+ fseg->mem_key = cpu_to_be32(wr->wr.fast_reg.rkey);
+ fseg->buf_list = cpu_to_be64(mfrpl->map);
+ fseg->start_addr = cpu_to_be64(wr->wr.fast_reg.iova_start);
+ fseg->reg_len = cpu_to_be64(wr->wr.fast_reg.length);
+ fseg->offset = 0; /* XXX -- is this just for ZBVA? */
+ fseg->page_size = cpu_to_be32(wr->wr.fast_reg.page_shift);
+ fseg->reserved[0] = 0;
+ fseg->reserved[1] = 0;
+}
+
+static void set_local_inv_seg(struct mlx4_wqe_local_inval_seg *iseg, u32 rkey)
+{
+ iseg->flags = 0;
+ iseg->mem_key = cpu_to_be32(rkey);
+ iseg->guest_id = 0;
+ iseg->pa = 0;
+}
+
static __always_inline void set_raddr_seg(struct mlx4_wqe_raddr_seg *rseg,
u64 remote_addr, u32 rkey)
{
@@ -1434,6 +1473,21 @@ static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr,
return 0;
}

+static __be32 send_ieth(struct ib_send_wr *wr)
+{
+ switch (wr->opcode) {
+ case IB_WR_SEND_WITH_IMM:
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ return wr->ex.imm_data;
+
+ case IB_WR_SEND_WITH_INV:
+ return cpu_to_be32(wr->ex.invalidate_rkey);
+
+ default:
+ return 0;
+ }
+}
+
int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
struct ib_send_wr **bad_wr)
{
@@ -1480,11 +1534,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) |
qp->sq_signal_bits;

- if (wr->opcode == IB_WR_SEND_WITH_IMM ||
- wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
- ctrl->imm = wr->ex.imm_data;
- else
- ctrl->imm = 0;
+ ctrl->imm = send_ieth(wr);

wqe += sizeof *ctrl;
size = sizeof *ctrl / 16;
@@ -1516,6 +1566,18 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
size += sizeof (struct mlx4_wqe_raddr_seg) / 16;
break;

+ case IB_WR_LOCAL_INV:
+ set_local_inv_seg(wqe, wr->ex.invalidate_rkey);
+ wqe += sizeof (struct mlx4_wqe_local_inval_seg);
+ size += sizeof (struct mlx4_wqe_local_inval_seg) / 16;
+ break;
+
+ case IB_WR_FAST_REG_MR:
+ set_fmr_seg(wqe, wr);
+ wqe += sizeof (struct mlx4_wqe_fmr_seg);
+ size += sizeof (struct mlx4_wqe_fmr_seg) / 16;
+ break;
+
default:
/* No extra segments required for sends */
break;
diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index 2b5006b..1180fca 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -198,7 +198,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
#define QUERY_DEV_CAP_C_MPT_ENTRY_SZ_OFFSET 0x8e
#define QUERY_DEV_CAP_MTT_ENTRY_SZ_OFFSET 0x90
#define QUERY_DEV_CAP_D_MPT_ENTRY_SZ_OFFSET 0x92
-#define QUERY_DEV_CAP_BMME_FLAGS_OFFSET 0x97
+#define QUERY_DEV_CAP_BMME_FLAGS_OFFSET 0x94
#define QUERY_DEV_CAP_RSVD_LKEY_OFFSET 0x98
#define QUERY_DEV_CAP_MAX_ICM_SZ_OFFSET 0xa0

@@ -373,12 +373,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
}
}

- if (dev_cap->bmme_flags & 1)
- mlx4_dbg(dev, "Base MM extensions: yes "
- "(flags %d, rsvd L_Key %08x)\n",
- dev_cap->bmme_flags, dev_cap->reserved_lkey);
- else
- mlx4_dbg(dev, "Base MM extensions: no\n");
+ mlx4_dbg(dev, "Base MM extensions: flags %08x, rsvd L_Key %08x\n",
+ dev_cap->bmme_flags, dev_cap->reserved_lkey);

/*
* Each UAR has 4 EQ doorbells; so if a UAR is reserved, then
diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h
index a0e046c..fbf0e22 100644
--- a/drivers/net/mlx4/fw.h
+++ b/drivers/net/mlx4/fw.h
@@ -98,7 +98,7 @@ struct mlx4_dev_cap {
int cmpt_entry_sz;
int mtt_entry_sz;
int resize_srq;
- u8 bmme_flags;
+ u32 bmme_flags;
u32 reserved_lkey;
u64 max_icm_sz;
int max_gso_sz;
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index d373601..8e1d24c 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -158,6 +158,8 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
dev->caps.max_msg_sz = dev_cap->max_msg_sz;
dev->caps.page_size_cap = ~(u32) (dev_cap->min_page_sz - 1);
dev->caps.flags = dev_cap->flags;
+ dev->caps.bmme_flags = dev_cap->bmme_flags;
+ dev->caps.reserved_lkey = dev_cap->reserved_lkey;
dev->caps.stat_rate_support = dev_cap->stat_rate_support;
dev->caps.max_gso_sz = dev_cap->max_gso_sz;

diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 03a9abc..66c6f84 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -47,7 +47,7 @@ struct mlx4_mpt_entry {
__be32 flags;
__be32 qpn;
__be32 key;
- __be32 pd;
+ __be32 pd_flags;
__be64 start;
__be64 length;
__be32 lkey;
@@ -61,11 +61,15 @@ struct mlx4_mpt_entry {
} __attribute__((packed));

#define MLX4_MPT_FLAG_SW_OWNS (0xfUL << 28)
+#define MLX4_MPT_FLAG_FREE (0x3UL << 28)
#define MLX4_MPT_FLAG_MIO (1 << 17)
#define MLX4_MPT_FLAG_BIND_ENABLE (1 << 15)
#define MLX4_MPT_FLAG_PHYSICAL (1 << 9)
#define MLX4_MPT_FLAG_REGION (1 << 8)

+#define MLX4_MPT_PD_FLAG_FAST_REG (1 << 26)
+#define MLX4_MPT_PD_FLAG_EN_INV (3 << 24)
+
#define MLX4_MTT_FLAG_PRESENT 1

#define MLX4_MPT_STATUS_SW 0xF0
@@ -314,21 +318,30 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)

memset(mpt_entry, 0, sizeof *mpt_entry);

- mpt_entry->flags = cpu_to_be32(MLX4_MPT_FLAG_SW_OWNS |
- MLX4_MPT_FLAG_MIO |
+ mpt_entry->flags = cpu_to_be32(MLX4_MPT_FLAG_MIO |
MLX4_MPT_FLAG_REGION |
mr->access);

mpt_entry->key = cpu_to_be32(key_to_hw_index(mr->key));
- mpt_entry->pd = cpu_to_be32(mr->pd);
+ mpt_entry->pd_flags = cpu_to_be32(mr->pd | MLX4_MPT_PD_FLAG_EN_INV);
mpt_entry->start = cpu_to_be64(mr->iova);
mpt_entry->length = cpu_to_be64(mr->size);
mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift);
+
if (mr->mtt.order < 0) {
mpt_entry->flags |= cpu_to_be32(MLX4_MPT_FLAG_PHYSICAL);
mpt_entry->mtt_seg = 0;
- } else
+ } else {
mpt_entry->mtt_seg = cpu_to_be64(mlx4_mtt_addr(dev, &mr->mtt));
+ }
+
+ if (mr->mtt.order >= 0 && mr->mtt.page_shift == 0) {
+ /* fast register MR in free state */
+ mpt_entry->flags |= cpu_to_be32(MLX4_MPT_FLAG_FREE);
+ mpt_entry->pd_flags |= cpu_to_be32(MLX4_MPT_PD_FLAG_FAST_REG);
+ } else {
+ mpt_entry->flags |= cpu_to_be32(MLX4_MPT_FLAG_SW_OWNS);
+ }

err = mlx4_SW2HW_MPT(dev, mailbox,
key_to_hw_index(mr->key) & (dev->caps.num_mpts - 1));
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 81b3dd5..655ea0d 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -68,6 +68,14 @@ enum {
MLX4_DEV_CAP_FLAG_UD_MCAST = 1 << 21
};

+enum {
+ MLX4_BMME_FLAG_LOCAL_INV = 1 << 6,
+ MLX4_BMME_FLAG_REMOTE_INV = 1 << 7,
+ MLX4_BMME_FLAG_TYPE_2_WIN = 1 << 9,
+ MLX4_BMME_FLAG_RESERVED_LKEY = 1 << 10,
+ MLX4_BMME_FLAG_FAST_REG_WR = 1 << 11,
+};
+
enum mlx4_event {
MLX4_EVENT_TYPE_COMP = 0x00,
MLX4_EVENT_TYPE_PATH_MIG = 0x01,
@@ -184,6 +192,8 @@ struct mlx4_caps {
u32 max_msg_sz;
u32 page_size_cap;
u32 flags;
+ u32 bmme_flags;
+ u32 reserved_lkey;
u16 stat_rate_support;
u8 port_width_cap[MLX4_MAX_PORTS + 1];
int max_gso_sz;
diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index 7f128b2..5cb34fb 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -233,6 +233,14 @@ struct mlx4_wqe_bind_seg {
__be64 length;
};

+enum {
+ MLX4_WQE_FMR_PERM_LOCAL_READ = 1 << 27,
+ MLX4_WQE_FMR_PERM_LOCAL_WRITE = 1 << 28,
+ MLX4_WQE_FMR_PERM_REMOTE_READ = 1 << 29,
+ MLX4_WQE_FMR_PERM_REMOTE_WRITE = 1 << 30,
+ MLX4_WQE_FMR_PERM_ATOMIC = 1 << 31
+};
+
struct mlx4_wqe_fmr_seg {
__be32 flags;
__be32 mem_key;
@@ -255,11 +263,9 @@ struct mlx4_wqe_fmr_ext_seg {
};

struct mlx4_wqe_local_inval_seg {
- u8 flags;
- u8 reserved1[3];
+ __be32 flags;
__be32 mem_key;
- u8 reserved2[3];
- u8 guest_id;
+ __be32 guest_id;
__be64 pa;
};