2014-02-21 21:53:31

by Jiri Kosina

[permalink] [raw]
Subject: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

This was originally a patch from Matthew Finlay <[email protected]> that
addressed a problem whereby NFS writes would enter uninterruptible sleep
forever. The issue happened when using NFS over IPoIB. This is not a
recommended configuration as RDMA is preferred but it is still a valid
configuration and is important to have in situations where the NFS server
does not support RDMA. The problem encountered was described as follows:

It's not memory reclamation that is the problem as such. There is
an indirect dependency between network filesystems writing back
pages and ipoib_cm_tx_init() due to how a kworker is used. Page
reclaim cannot make forward progress until ipoib_cm_tx_init()
succeeds and it is stuck in page reclaim itself waiting for network
transmission. Ordinarily this sitaution may be avoided by having
the caller use GFP_NOFS but ipoib_cm_tx_init() does not have
that information.

The patch has been ported to newer kernels by Mel Gorman and later ported
further by Jiri Kosina.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
---


I'd like to get confirmation from Matt that he's fine with having his
Signed-off-by: on this, but he's unfortunately not responding to any of my
queries.

Any ideas for cleaner fix are more than welcome. We've been carrying this
patch in SUSE kernel tree to fix a real reported issue for quite some time
already.


drivers/infiniband/hw/mlx4/cq.c | 6 ++--
drivers/infiniband/hw/mlx4/qp.c | 36 +++++++++++++++-----
drivers/infiniband/hw/mlx4/srq.c | 7 ++--
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 16 ++++++++-
drivers/net/ethernet/mellanox/mlx4/alloc.c | 35 ++++++++++++-------
drivers/net/ethernet/mellanox/mlx4/cq.c | 4 +-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 6 ++--
drivers/net/ethernet/mellanox/mlx4/en_tx.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/icm.c | 10 ++++--
drivers/net/ethernet/mellanox/mlx4/icm.h | 3 +-
drivers/net/ethernet/mellanox/mlx4/mlx4.h | 4 +-
drivers/net/ethernet/mellanox/mlx4/mr.c | 17 +++++----
drivers/net/ethernet/mellanox/mlx4/qp.c | 21 ++++++-----
.../net/ethernet/mellanox/mlx4/resource_tracker.c | 4 +-
drivers/net/ethernet/mellanox/mlx4/srq.c | 4 +-
include/linux/mlx4/device.h | 10 +++--
16 files changed, 117 insertions(+), 68 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index cc40f08..661185a 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -102,7 +102,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
int err;

err = mlx4_buf_alloc(dev->dev, nent * dev->dev->caps.cqe_size,
- PAGE_SIZE * 2, &buf->buf);
+ PAGE_SIZE * 2, &buf->buf, 0);

if (err)
goto out;
@@ -113,7 +113,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf);
+ err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf, 0);
if (err)
goto err_mtt;

@@ -209,7 +209,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector

uar = &to_mucontext(context)->uar;
} else {
- err = mlx4_db_alloc(dev->dev, &cq->db, 1);
+ err = mlx4_db_alloc(dev->dev, &cq->db, 1, 0);
if (err)
goto err_cq;

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index d8f4d1f..1379ee7 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -744,14 +744,18 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
goto err;

if (qp_has_rq(init_attr)) {
- err = mlx4_db_alloc(dev->dev, &qp->db, 0);
+ err = mlx4_db_alloc(dev->dev, &qp->db, 0,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS);
if (err)
goto err;

*qp->db.db = 0;
}

- if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) {
+ if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS)) {
err = -ENOMEM;
goto err_db;
}
@@ -761,12 +765,20 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf);
+ err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS);
if (err)
goto err_mtt;

- qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
- qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
+ qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64),
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS ?
+ GFP_NOFS : GFP_KERNEL);
+ qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64),
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS ?
+ GFP_NOFS : GFP_KERNEL);

if (!qp->sq.wrid || !qp->rq.wrid) {
err = -ENOMEM;
@@ -797,7 +809,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
goto err_proxy;
}

- err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp);
+ err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp,
+ init_attr->create_flags & IB_QP_CREATE_USE_GFP_NOFS);
if (err)
goto err_qpn;

@@ -1024,10 +1037,12 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK |
MLX4_IB_SRIOV_TUNNEL_QP |
MLX4_IB_SRIOV_SQP |
- MLX4_IB_QP_NETIF))
+ MLX4_IB_QP_NETIF |
+ IB_QP_CREATE_USE_GFP_NOFS))
return ERR_PTR(-EINVAL);

- if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP) {
+ if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP &&
+ init_attr->create_flags & ~IB_QP_CREATE_USE_GFP_NOFS) {
if (init_attr->qp_type != IB_QPT_UD)
return ERR_PTR(-EINVAL);
}
@@ -1054,7 +1069,10 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
case IB_QPT_RC:
case IB_QPT_UC:
case IB_QPT_RAW_PACKET:
- qp = kzalloc(sizeof *qp, GFP_KERNEL);
+ qp = kzalloc(sizeof *qp,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS ?
+ GFP_NOFS : GFP_KERNEL);
if (!qp)
return ERR_PTR(-ENOMEM);
/* fall through */
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 60c5fb0..17552c0 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -134,13 +134,14 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
if (err)
goto err_mtt;
} else {
- err = mlx4_db_alloc(dev->dev, &srq->db, 0);
+ err = mlx4_db_alloc(dev->dev, &srq->db, 0, 0);
if (err)
goto err_srq;

*srq->db.db = 0;

- if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2, &srq->buf)) {
+ if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2,
+ &srq->buf, 0)) {
err = -ENOMEM;
goto err_db;
}
@@ -165,7 +166,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf);
+ err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf, 0);
if (err)
goto err_mtt;

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 1377f85..b6dd279 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -48,6 +48,13 @@ MODULE_PARM_DESC(max_nonsrq_conn_qp,
"Max number of connected-mode QPs per interface "
"(applied only if shared receive queue is not available)");

+int ipoib_use_gfp_nofs = 0;
+
+module_param_named(use_gfp_nofs, ipoib_use_gfp_nofs, int, 0444);
+MODULE_PARM_DESC(use_gfp_nofs,
+ "Use GFP_NOFS flags when allocating memory during the TX "
+ "path for CM. This should be used when running NFS over IPoIB.");
+
#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
static int data_debug_level;

@@ -1030,7 +1037,9 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
.cap.max_send_sge = 1,
.sq_sig_type = IB_SIGNAL_ALL_WR,
.qp_type = IB_QPT_RC,
- .qp_context = tx
+ .qp_context = tx,
+ .create_flags = ipoib_use_gfp_nofs ?
+ IB_QP_CREATE_USE_GFP_NOFS : 0
};

return ib_create_qp(priv->pd, &attr);
@@ -1104,12 +1113,15 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn,
struct ipoib_dev_priv *priv = netdev_priv(p->dev);
int ret;

- p->tx_ring = vzalloc(ipoib_sendq_size * sizeof *p->tx_ring);
+ p->tx_ring = __vmalloc(ipoib_sendq_size * sizeof *p->tx_ring,
+ ipoib_use_gfp_nofs ? GFP_NOFS : GFP_KERNEL,
+ PAGE_KERNEL);
if (!p->tx_ring) {
ipoib_warn(priv, "failed to allocate tx ring\n");
ret = -ENOMEM;
goto err_tx;
}
+ memset(p->tx_ring, 0, ipoib_sendq_size * sizeof *p->tx_ring);

p->qp = ipoib_cm_create_tx_qp(p->dev, p);
if (IS_ERR(p->qp)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c b/drivers/net/ethernet/mellanox/mlx4/alloc.c
index c3ad464..60ca7f1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -171,7 +171,7 @@ void mlx4_bitmap_cleanup(struct mlx4_bitmap *bitmap)
*/

int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
- struct mlx4_buf *buf)
+ struct mlx4_buf *buf, int use_gfp_nofs)
{
dma_addr_t t;

@@ -180,7 +180,9 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
buf->npages = 1;
buf->page_shift = get_order(size) + PAGE_SHIFT;
buf->direct.buf = dma_alloc_coherent(&dev->pdev->dev,
- size, &t, GFP_KERNEL);
+ size, &t,
+ use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL);
if (!buf->direct.buf)
return -ENOMEM;

@@ -200,14 +202,16 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
buf->npages = buf->nbufs;
buf->page_shift = PAGE_SHIFT;
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
- GFP_KERNEL);
+ use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!buf->page_list)
return -ENOMEM;

for (i = 0; i < buf->nbufs; ++i) {
buf->page_list[i].buf =
dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE,
- &t, GFP_KERNEL);
+ &t,
+ use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL);
if (!buf->page_list[i].buf)
goto err_free;

@@ -218,7 +222,8 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,

if (BITS_PER_LONG == 64) {
struct page **pages;
- pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL);
+ pages = kmalloc(sizeof *pages * buf->nbufs,
+ use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!pages)
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
@@ -260,11 +265,12 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf)
}
EXPORT_SYMBOL_GPL(mlx4_buf_free);

-static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
+static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device,
+ int use_gfp_nofs)
{
struct mlx4_db_pgdir *pgdir;

- pgdir = kzalloc(sizeof *pgdir, GFP_KERNEL);
+ pgdir = kzalloc(sizeof *pgdir, use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!pgdir)
return NULL;

@@ -272,7 +278,9 @@ static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
pgdir->bits[0] = pgdir->order0;
pgdir->bits[1] = pgdir->order1;
pgdir->db_page = dma_alloc_coherent(dma_device, PAGE_SIZE,
- &pgdir->db_dma, GFP_KERNEL);
+ &pgdir->db_dma,
+ use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL);
if (!pgdir->db_page) {
kfree(pgdir);
return NULL;
@@ -312,7 +320,8 @@ found:
return 0;
}

-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+ int use_gfp_nofs)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_db_pgdir *pgdir;
@@ -324,7 +333,7 @@ int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
if (!mlx4_alloc_db_from_pgdir(pgdir, db, order))
goto out;

- pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev));
+ pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev), use_gfp_nofs);
if (!pgdir) {
ret = -ENOMEM;
goto out;
@@ -376,13 +385,13 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
{
int err;

- err = mlx4_db_alloc(dev, &wqres->db, 1);
+ err = mlx4_db_alloc(dev, &wqres->db, 1, 0);
if (err)
return err;

*wqres->db.db = 0;

- err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf);
+ err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf, 0);
if (err)
goto err_db;

@@ -391,7 +400,7 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf);
+ err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf, 0);
if (err)
goto err_mtt;

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 0487121..9727175 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -173,11 +173,11 @@ int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn)
if (*cqn == -1)
return -ENOMEM;

- err = mlx4_table_get(dev, &cq_table->table, *cqn);
+ err = mlx4_table_get(dev, &cq_table->table, *cqn, 0);
if (err)
goto err_out;

- err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn);
+ err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn, 0);
if (err)
goto err_put;
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 890922c..ee77284 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -944,7 +944,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn,
if (!context)
return -ENOMEM;

- err = mlx4_qp_alloc(mdev->dev, qpn, qp);
+ err = mlx4_qp_alloc(mdev->dev, qpn, qp, 0);
if (err) {
en_err(priv, "Failed to allocate qp #%x\n", qpn);
goto out;
@@ -984,7 +984,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv)
en_err(priv, "Failed reserving drop qpn\n");
return err;
}
- err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp);
+ err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp, 0);
if (err) {
en_err(priv, "Failed allocating drop qp\n");
mlx4_qp_release_range(priv->mdev->dev, qpn, 1);
@@ -1043,7 +1043,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
}

/* Configure RSS indirection qp */
- err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp);
+ err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp, 0);
if (err) {
en_err(priv, "Failed to allocate RSS indirection QP\n");
goto rss_err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 1345703..2f123bf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -124,7 +124,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
ring->buf_size, (unsigned long long) ring->wqres.buf.direct.map);

ring->qpn = qpn;
- err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp);
+ err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp, 0);
if (err) {
en_err(priv, "Failed allocating qp %d\n", ring->qpn);
goto err_map;
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 5fbf492..c83b4e6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -245,7 +245,8 @@ int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev)
MLX4_CMD_TIME_CLASS_B, MLX4_CMD_NATIVE);
}

-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+ int use_gfp_nofs)
{
u32 i = (obj & (table->num_obj - 1)) /
(MLX4_TABLE_CHUNK_SIZE / table->obj_size);
@@ -259,7 +260,10 @@ int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
}

table->icm[i] = mlx4_alloc_icm(dev, MLX4_TABLE_CHUNK_SIZE >> PAGE_SHIFT,
- (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
+ (table->lowmem ?
+ (use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL) :
+ GFP_HIGHUSER) |
__GFP_NOWARN, table->coherent);
if (!table->icm[i]) {
ret = -ENOMEM;
@@ -356,7 +360,7 @@ int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
u32 i;

for (i = start; i <= end; i += inc) {
- err = mlx4_table_get(dev, table, i);
+ err = mlx4_table_get(dev, table, i, 0);
if (err)
goto fail;
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.h b/drivers/net/ethernet/mellanox/mlx4/icm.h
index dee67fa..2be6ac5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.h
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.h
@@ -71,7 +71,8 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages,
gfp_t gfp_mask, int coherent);
void mlx4_free_icm(struct mlx4_dev *dev, struct mlx4_icm *icm, int coherent);

-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+ int use_gfp_nofs);
void mlx4_table_put(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
u32 start, u32 end);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index 6b65f77..2d73e12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -882,7 +882,7 @@ void mlx4_cleanup_cq_table(struct mlx4_dev *dev);
void mlx4_cleanup_qp_table(struct mlx4_dev *dev);
void mlx4_cleanup_srq_table(struct mlx4_dev *dev);
void mlx4_cleanup_mcg_table(struct mlx4_dev *dev);
-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn);
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs);
void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn);
int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn);
void __mlx4_cq_free_icm(struct mlx4_dev *dev, int cqn);
@@ -890,7 +890,7 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn);
void __mlx4_srq_free_icm(struct mlx4_dev *dev, int srqn);
int __mlx4_mpt_reserve(struct mlx4_dev *dev);
void __mlx4_mpt_release(struct mlx4_dev *dev, u32 index);
-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index);
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs);
void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index);
u32 __mlx4_alloc_mtt_range(struct mlx4_dev *dev, int order);
void __mlx4_free_mtt_range(struct mlx4_dev *dev, u32 first_seg, int order);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 2483585..5fa9371 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -364,14 +364,14 @@ static void mlx4_mpt_release(struct mlx4_dev *dev, u32 index)
__mlx4_mpt_release(dev, index);
}

-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
{
struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table;

- return mlx4_table_get(dev, &mr_table->dmpt_table, index);
+ return mlx4_table_get(dev, &mr_table->dmpt_table, index, use_gfp_nofs);
}

-static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
{
u64 param = 0;

@@ -382,7 +382,7 @@ static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
MLX4_CMD_TIME_CLASS_A,
MLX4_CMD_WRAPPED);
}
- return __mlx4_mpt_alloc_icm(dev, index);
+ return __mlx4_mpt_alloc_icm(dev, index, use_gfp_nofs);
}

void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index)
@@ -469,7 +469,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
struct mlx4_mpt_entry *mpt_entry;
int err;

- err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key));
+ err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key), 0);
if (err)
return err;

@@ -627,13 +627,14 @@ int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
EXPORT_SYMBOL_GPL(mlx4_write_mtt);

int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
- struct mlx4_buf *buf)
+ struct mlx4_buf *buf, int use_gfp_nofs)
{
u64 *page_list;
int err;
int i;

- page_list = kmalloc(buf->npages * sizeof *page_list, GFP_KERNEL);
+ page_list = kmalloc(buf->npages * sizeof *page_list,
+ use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!page_list)
return -ENOMEM;

@@ -680,7 +681,7 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw)
struct mlx4_mpt_entry *mpt_entry;
int err;

- err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key));
+ err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key), 0);
if (err)
return err;

diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c
index 61d64eb..c6db326 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -272,29 +272,29 @@ void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt)
}
EXPORT_SYMBOL_GPL(mlx4_qp_release_range);

-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_qp_table *qp_table = &priv->qp_table;
int err;

- err = mlx4_table_get(dev, &qp_table->qp_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->qp_table, qpn, use_gfp_nofs);
if (err)
goto err_out;

- err = mlx4_table_get(dev, &qp_table->auxc_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->auxc_table, qpn, use_gfp_nofs);
if (err)
goto err_put_qp;

- err = mlx4_table_get(dev, &qp_table->altc_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->altc_table, qpn, use_gfp_nofs);
if (err)
goto err_put_auxc;

- err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn, use_gfp_nofs);
if (err)
goto err_put_altc;

- err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn, use_gfp_nofs);
if (err)
goto err_put_rdmarc;

@@ -316,7 +316,7 @@ err_out:
return err;
}

-static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
{
u64 param = 0;

@@ -326,7 +326,7 @@ static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
MLX4_CMD_ALLOC_RES, MLX4_CMD_TIME_CLASS_A,
MLX4_CMD_WRAPPED);
}
- return __mlx4_qp_alloc_icm(dev, qpn);
+ return __mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
}

void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
@@ -355,7 +355,8 @@ static void mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
__mlx4_qp_free_icm(dev, qpn);
}

-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+ int use_gfp_nofs)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_qp_table *qp_table = &priv->qp_table;
@@ -366,7 +367,7 @@ int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)

qp->qpn = qpn;

- err = mlx4_qp_alloc_icm(dev, qpn);
+ err = mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
if (err)
return err;

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 57428a0..007434d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -1490,7 +1490,7 @@ static int qp_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
return err;

if (!fw_reserved(dev, qpn)) {
- err = __mlx4_qp_alloc_icm(dev, qpn);
+ err = __mlx4_qp_alloc_icm(dev, qpn, 0);
if (err) {
res_abort_move(dev, slave, RES_QP, qpn);
return err;
@@ -1577,7 +1577,7 @@ static int mpt_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
if (err)
return err;

- err = __mlx4_mpt_alloc_icm(dev, mpt->key);
+ err = __mlx4_mpt_alloc_icm(dev, mpt->key, 0);
if (err) {
res_abort_move(dev, slave, RES_MPT, id);
return err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/srq.c b/drivers/net/ethernet/mellanox/mlx4/srq.c
index 98faf87..2cd51a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/srq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/srq.c
@@ -103,11 +103,11 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn)
if (*srqn == -1)
return -ENOMEM;

- err = mlx4_table_get(dev, &srq_table->table, *srqn);
+ err = mlx4_table_get(dev, &srq_table->table, *srqn, 0);
if (err)
goto err_out;

- err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn);
+ err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn, 0);
if (err)
goto err_put;
return 0;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 5edd2c6..de2fcf5 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -826,7 +826,7 @@ static inline int mlx4_is_slave(struct mlx4_dev *dev)
}

int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
- struct mlx4_buf *buf);
+ struct mlx4_buf *buf, int use_gfp_nofs);
void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf);
static inline void *mlx4_buf_offset(struct mlx4_buf *buf, int offset)
{
@@ -863,9 +863,10 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw);
int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
int start_index, int npages, u64 *page_list);
int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
- struct mlx4_buf *buf);
+ struct mlx4_buf *buf, int use_gfp_nofs);

-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order);
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+ int use_gfp_nofs);
void mlx4_db_free(struct mlx4_dev *dev, struct mlx4_db *db);

int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
@@ -881,7 +882,8 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq);
int mlx4_qp_reserve_range(struct mlx4_dev *dev, int cnt, int align, int *base);
void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt);

-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp);
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+ int use_gfp_nofs);
void mlx4_qp_free(struct mlx4_dev *dev, struct mlx4_qp *qp);

int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, u32 cqn, u16 xrcdn,

--
Jiri Kosina
SUSE Labs


2014-02-25 21:52:47

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Fri, Feb 21, 2014 at 11:53 PM, Jiri Kosina <[email protected]> wrote:
>>
>> This was originally a patch from Matthew Finlay <[email protected]> that
>> addressed a problem whereby NFS writes would enter uninterruptible sleep
>> forever. The issue happened when using NFS over IPoIB. This is not a
>> recommended configuration as RDMA is preferred but it is still a valid
>> configuration and is important to have in situations where the NFS server
>> does not support RDMA. The problem encountered was described as follows:
>> .



And what happens if you use IPoIB datagram mode, is/why the patch is
needed there?


Also the patch uses a new QP creatiob flag IB_QP_CREATE_USE_GFP_NOFS
but it doesn't
touch include/rdma/ib_verbs.h nor I see this flag defined anywhere on
the patch, does it compile? how?

2014-02-25 22:11:13

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Tue, 25 Feb 2014, Or Gerlitz wrote:

> > This was originally a patch from Matthew Finlay <[email protected]> that
> > addressed a problem whereby NFS writes would enter uninterruptible sleep
> > forever. The issue happened when using NFS over IPoIB. This is not a
> > recommended configuration as RDMA is preferred but it is still a valid
> > configuration and is important to have in situations where the NFS server
> > does not support RDMA. The problem encountered was described as follows:
> >
> And what happens if you use IPoIB datagram mode, is/why the patch is needed
> there?

First, thanks a lot for looking into this.

I admittedly am no infiniband expert, but my understanding is that in
principle Connected/Datagram mode is about MTU and checksum offloading,
but the TX path is the same. Please correct me if I am wrong.

> Also the patch uses a new QP creatiob flag IB_QP_CREATE_USE_GFP_NOFS but
> it doesn't touch include/rdma/ib_verbs.h nor I see this flag defined
> anywhere on the patch, does it compile? how?

That's my fault, I forgot 'git add', therefore my tree was building, but
include/rdma/ib_verbs.h was missing in git index. Updated patch below,
sorry for the noise.




From: Jiri Kosina <[email protected]>
Subject: [PATCH] mlx4: Use all GFP_NOFS calls during the ipoib TX path when creating the QP

This was a patch from Matthew Finlay <[email protected]> that addressed a
problem whereby NFS writes would enter uninterruptible sleep forever. The
issue happened when using NFS over IPoIB. This is not a recommended
configuration as RDMA is preferred but it is still a valid configuration and is
important to have in situations where the NFS server does not support RDMA.
The problem encountered was described as follows:

It's not memory reclamation that is the problem as such. There is
an indirect dependency between network filesystems writing back
pages and ipoib_cm_tx_init() due to how a kworker is used. Page
reclaim cannot make forward progress until ipoib_cm_tx_init()
succeeds and it is stuck in page reclaim itself waiting for network
transmission. Ordinarily this sitaution may be avoided by having
the caller use GFP_NOFS but ipoib_cm_tx_init() does not have
that information.

The patch has been ported to newer kernels by Mel Gorman and later
ported further by Jiri Kosina.
I'd like to get confirmation from Matt that he's fine with having
his Signed-off-by: on this, but he's not responding to my queries.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
---
drivers/infiniband/hw/mlx4/cq.c | 6 ++--
drivers/infiniband/hw/mlx4/qp.c | 36 +++++++++++++++-----
drivers/infiniband/hw/mlx4/srq.c | 7 ++--
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 16 ++++++++-
drivers/net/ethernet/mellanox/mlx4/alloc.c | 35 ++++++++++++-------
drivers/net/ethernet/mellanox/mlx4/cq.c | 4 +-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 6 ++--
drivers/net/ethernet/mellanox/mlx4/en_tx.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/icm.c | 10 ++++--
drivers/net/ethernet/mellanox/mlx4/icm.h | 3 +-
drivers/net/ethernet/mellanox/mlx4/mlx4.h | 4 +-
drivers/net/ethernet/mellanox/mlx4/mr.c | 17 +++++----
drivers/net/ethernet/mellanox/mlx4/qp.c | 21 ++++++-----
.../net/ethernet/mellanox/mlx4/resource_tracker.c | 4 +-
drivers/net/ethernet/mellanox/mlx4/srq.c | 4 +-
include/linux/mlx4/device.h | 10 +++--
include/rdma/ib_verbs.h | 1 +
17 files changed, 118 insertions(+), 68 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index cc40f08..661185a 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -102,7 +102,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
int err;

err = mlx4_buf_alloc(dev->dev, nent * dev->dev->caps.cqe_size,
- PAGE_SIZE * 2, &buf->buf);
+ PAGE_SIZE * 2, &buf->buf, 0);

if (err)
goto out;
@@ -113,7 +113,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf);
+ err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf, 0);
if (err)
goto err_mtt;

@@ -209,7 +209,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector

uar = &to_mucontext(context)->uar;
} else {
- err = mlx4_db_alloc(dev->dev, &cq->db, 1);
+ err = mlx4_db_alloc(dev->dev, &cq->db, 1, 0);
if (err)
goto err_cq;

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index d8f4d1f..1379ee7 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -744,14 +744,18 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
goto err;

if (qp_has_rq(init_attr)) {
- err = mlx4_db_alloc(dev->dev, &qp->db, 0);
+ err = mlx4_db_alloc(dev->dev, &qp->db, 0,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS);
if (err)
goto err;

*qp->db.db = 0;
}

- if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) {
+ if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS)) {
err = -ENOMEM;
goto err_db;
}
@@ -761,12 +765,20 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf);
+ err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS);
if (err)
goto err_mtt;

- qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
- qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
+ qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64),
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS ?
+ GFP_NOFS : GFP_KERNEL);
+ qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64),
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS ?
+ GFP_NOFS : GFP_KERNEL);

if (!qp->sq.wrid || !qp->rq.wrid) {
err = -ENOMEM;
@@ -797,7 +809,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
goto err_proxy;
}

- err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp);
+ err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp,
+ init_attr->create_flags & IB_QP_CREATE_USE_GFP_NOFS);
if (err)
goto err_qpn;

@@ -1024,10 +1037,12 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK |
MLX4_IB_SRIOV_TUNNEL_QP |
MLX4_IB_SRIOV_SQP |
- MLX4_IB_QP_NETIF))
+ MLX4_IB_QP_NETIF |
+ IB_QP_CREATE_USE_GFP_NOFS))
return ERR_PTR(-EINVAL);

- if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP) {
+ if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP &&
+ init_attr->create_flags & ~IB_QP_CREATE_USE_GFP_NOFS) {
if (init_attr->qp_type != IB_QPT_UD)
return ERR_PTR(-EINVAL);
}
@@ -1054,7 +1069,10 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
case IB_QPT_RC:
case IB_QPT_UC:
case IB_QPT_RAW_PACKET:
- qp = kzalloc(sizeof *qp, GFP_KERNEL);
+ qp = kzalloc(sizeof *qp,
+ init_attr->create_flags &
+ IB_QP_CREATE_USE_GFP_NOFS ?
+ GFP_NOFS : GFP_KERNEL);
if (!qp)
return ERR_PTR(-ENOMEM);
/* fall through */
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 60c5fb0..17552c0 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -134,13 +134,14 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
if (err)
goto err_mtt;
} else {
- err = mlx4_db_alloc(dev->dev, &srq->db, 0);
+ err = mlx4_db_alloc(dev->dev, &srq->db, 0, 0);
if (err)
goto err_srq;

*srq->db.db = 0;

- if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2, &srq->buf)) {
+ if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2,
+ &srq->buf, 0)) {
err = -ENOMEM;
goto err_db;
}
@@ -165,7 +166,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf);
+ err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf, 0);
if (err)
goto err_mtt;

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 1377f85..b6dd279 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -48,6 +48,13 @@ MODULE_PARM_DESC(max_nonsrq_conn_qp,
"Max number of connected-mode QPs per interface "
"(applied only if shared receive queue is not available)");

+int ipoib_use_gfp_nofs = 0;
+
+module_param_named(use_gfp_nofs, ipoib_use_gfp_nofs, int, 0444);
+MODULE_PARM_DESC(use_gfp_nofs,
+ "Use GFP_NOFS flags when allocating memory during the TX "
+ "path for CM. This should be used when running NFS over IPoIB.");
+
#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
static int data_debug_level;

@@ -1030,7 +1037,9 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
.cap.max_send_sge = 1,
.sq_sig_type = IB_SIGNAL_ALL_WR,
.qp_type = IB_QPT_RC,
- .qp_context = tx
+ .qp_context = tx,
+ .create_flags = ipoib_use_gfp_nofs ?
+ IB_QP_CREATE_USE_GFP_NOFS : 0
};

return ib_create_qp(priv->pd, &attr);
@@ -1104,12 +1113,15 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn,
struct ipoib_dev_priv *priv = netdev_priv(p->dev);
int ret;

- p->tx_ring = vzalloc(ipoib_sendq_size * sizeof *p->tx_ring);
+ p->tx_ring = __vmalloc(ipoib_sendq_size * sizeof *p->tx_ring,
+ ipoib_use_gfp_nofs ? GFP_NOFS : GFP_KERNEL,
+ PAGE_KERNEL);
if (!p->tx_ring) {
ipoib_warn(priv, "failed to allocate tx ring\n");
ret = -ENOMEM;
goto err_tx;
}
+ memset(p->tx_ring, 0, ipoib_sendq_size * sizeof *p->tx_ring);

p->qp = ipoib_cm_create_tx_qp(p->dev, p);
if (IS_ERR(p->qp)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c b/drivers/net/ethernet/mellanox/mlx4/alloc.c
index c3ad464..60ca7f1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -171,7 +171,7 @@ void mlx4_bitmap_cleanup(struct mlx4_bitmap *bitmap)
*/

int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
- struct mlx4_buf *buf)
+ struct mlx4_buf *buf, int use_gfp_nofs)
{
dma_addr_t t;

@@ -180,7 +180,9 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
buf->npages = 1;
buf->page_shift = get_order(size) + PAGE_SHIFT;
buf->direct.buf = dma_alloc_coherent(&dev->pdev->dev,
- size, &t, GFP_KERNEL);
+ size, &t,
+ use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL);
if (!buf->direct.buf)
return -ENOMEM;

@@ -200,14 +202,16 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
buf->npages = buf->nbufs;
buf->page_shift = PAGE_SHIFT;
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
- GFP_KERNEL);
+ use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!buf->page_list)
return -ENOMEM;

for (i = 0; i < buf->nbufs; ++i) {
buf->page_list[i].buf =
dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE,
- &t, GFP_KERNEL);
+ &t,
+ use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL);
if (!buf->page_list[i].buf)
goto err_free;

@@ -218,7 +222,8 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,

if (BITS_PER_LONG == 64) {
struct page **pages;
- pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL);
+ pages = kmalloc(sizeof *pages * buf->nbufs,
+ use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!pages)
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
@@ -260,11 +265,12 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf)
}
EXPORT_SYMBOL_GPL(mlx4_buf_free);

-static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
+static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device,
+ int use_gfp_nofs)
{
struct mlx4_db_pgdir *pgdir;

- pgdir = kzalloc(sizeof *pgdir, GFP_KERNEL);
+ pgdir = kzalloc(sizeof *pgdir, use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!pgdir)
return NULL;

@@ -272,7 +278,9 @@ static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
pgdir->bits[0] = pgdir->order0;
pgdir->bits[1] = pgdir->order1;
pgdir->db_page = dma_alloc_coherent(dma_device, PAGE_SIZE,
- &pgdir->db_dma, GFP_KERNEL);
+ &pgdir->db_dma,
+ use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL);
if (!pgdir->db_page) {
kfree(pgdir);
return NULL;
@@ -312,7 +320,8 @@ found:
return 0;
}

-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+ int use_gfp_nofs)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_db_pgdir *pgdir;
@@ -324,7 +333,7 @@ int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
if (!mlx4_alloc_db_from_pgdir(pgdir, db, order))
goto out;

- pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev));
+ pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev), use_gfp_nofs);
if (!pgdir) {
ret = -ENOMEM;
goto out;
@@ -376,13 +385,13 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
{
int err;

- err = mlx4_db_alloc(dev, &wqres->db, 1);
+ err = mlx4_db_alloc(dev, &wqres->db, 1, 0);
if (err)
return err;

*wqres->db.db = 0;

- err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf);
+ err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf, 0);
if (err)
goto err_db;

@@ -391,7 +400,7 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
if (err)
goto err_buf;

- err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf);
+ err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf, 0);
if (err)
goto err_mtt;

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 0487121..9727175 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -173,11 +173,11 @@ int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn)
if (*cqn == -1)
return -ENOMEM;

- err = mlx4_table_get(dev, &cq_table->table, *cqn);
+ err = mlx4_table_get(dev, &cq_table->table, *cqn, 0);
if (err)
goto err_out;

- err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn);
+ err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn, 0);
if (err)
goto err_put;
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 890922c..ee77284 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -944,7 +944,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn,
if (!context)
return -ENOMEM;

- err = mlx4_qp_alloc(mdev->dev, qpn, qp);
+ err = mlx4_qp_alloc(mdev->dev, qpn, qp, 0);
if (err) {
en_err(priv, "Failed to allocate qp #%x\n", qpn);
goto out;
@@ -984,7 +984,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv)
en_err(priv, "Failed reserving drop qpn\n");
return err;
}
- err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp);
+ err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp, 0);
if (err) {
en_err(priv, "Failed allocating drop qp\n");
mlx4_qp_release_range(priv->mdev->dev, qpn, 1);
@@ -1043,7 +1043,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
}

/* Configure RSS indirection qp */
- err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp);
+ err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp, 0);
if (err) {
en_err(priv, "Failed to allocate RSS indirection QP\n");
goto rss_err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 1345703..2f123bf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -124,7 +124,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
ring->buf_size, (unsigned long long) ring->wqres.buf.direct.map);

ring->qpn = qpn;
- err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp);
+ err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp, 0);
if (err) {
en_err(priv, "Failed allocating qp %d\n", ring->qpn);
goto err_map;
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 5fbf492..c83b4e6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -245,7 +245,8 @@ int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev)
MLX4_CMD_TIME_CLASS_B, MLX4_CMD_NATIVE);
}

-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+ int use_gfp_nofs)
{
u32 i = (obj & (table->num_obj - 1)) /
(MLX4_TABLE_CHUNK_SIZE / table->obj_size);
@@ -259,7 +260,10 @@ int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
}

table->icm[i] = mlx4_alloc_icm(dev, MLX4_TABLE_CHUNK_SIZE >> PAGE_SHIFT,
- (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
+ (table->lowmem ?
+ (use_gfp_nofs ?
+ GFP_NOFS : GFP_KERNEL) :
+ GFP_HIGHUSER) |
__GFP_NOWARN, table->coherent);
if (!table->icm[i]) {
ret = -ENOMEM;
@@ -356,7 +360,7 @@ int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
u32 i;

for (i = start; i <= end; i += inc) {
- err = mlx4_table_get(dev, table, i);
+ err = mlx4_table_get(dev, table, i, 0);
if (err)
goto fail;
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.h b/drivers/net/ethernet/mellanox/mlx4/icm.h
index dee67fa..2be6ac5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.h
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.h
@@ -71,7 +71,8 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages,
gfp_t gfp_mask, int coherent);
void mlx4_free_icm(struct mlx4_dev *dev, struct mlx4_icm *icm, int coherent);

-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+ int use_gfp_nofs);
void mlx4_table_put(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
u32 start, u32 end);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index 6b65f77..2d73e12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -882,7 +882,7 @@ void mlx4_cleanup_cq_table(struct mlx4_dev *dev);
void mlx4_cleanup_qp_table(struct mlx4_dev *dev);
void mlx4_cleanup_srq_table(struct mlx4_dev *dev);
void mlx4_cleanup_mcg_table(struct mlx4_dev *dev);
-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn);
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs);
void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn);
int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn);
void __mlx4_cq_free_icm(struct mlx4_dev *dev, int cqn);
@@ -890,7 +890,7 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn);
void __mlx4_srq_free_icm(struct mlx4_dev *dev, int srqn);
int __mlx4_mpt_reserve(struct mlx4_dev *dev);
void __mlx4_mpt_release(struct mlx4_dev *dev, u32 index);
-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index);
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs);
void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index);
u32 __mlx4_alloc_mtt_range(struct mlx4_dev *dev, int order);
void __mlx4_free_mtt_range(struct mlx4_dev *dev, u32 first_seg, int order);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 2483585..5fa9371 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -364,14 +364,14 @@ static void mlx4_mpt_release(struct mlx4_dev *dev, u32 index)
__mlx4_mpt_release(dev, index);
}

-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
{
struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table;

- return mlx4_table_get(dev, &mr_table->dmpt_table, index);
+ return mlx4_table_get(dev, &mr_table->dmpt_table, index, use_gfp_nofs);
}

-static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
{
u64 param = 0;

@@ -382,7 +382,7 @@ static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
MLX4_CMD_TIME_CLASS_A,
MLX4_CMD_WRAPPED);
}
- return __mlx4_mpt_alloc_icm(dev, index);
+ return __mlx4_mpt_alloc_icm(dev, index, use_gfp_nofs);
}

void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index)
@@ -469,7 +469,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
struct mlx4_mpt_entry *mpt_entry;
int err;

- err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key));
+ err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key), 0);
if (err)
return err;

@@ -627,13 +627,14 @@ int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
EXPORT_SYMBOL_GPL(mlx4_write_mtt);

int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
- struct mlx4_buf *buf)
+ struct mlx4_buf *buf, int use_gfp_nofs)
{
u64 *page_list;
int err;
int i;

- page_list = kmalloc(buf->npages * sizeof *page_list, GFP_KERNEL);
+ page_list = kmalloc(buf->npages * sizeof *page_list,
+ use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
if (!page_list)
return -ENOMEM;

@@ -680,7 +681,7 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw)
struct mlx4_mpt_entry *mpt_entry;
int err;

- err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key));
+ err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key), 0);
if (err)
return err;

diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c
index 61d64eb..c6db326 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -272,29 +272,29 @@ void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt)
}
EXPORT_SYMBOL_GPL(mlx4_qp_release_range);

-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_qp_table *qp_table = &priv->qp_table;
int err;

- err = mlx4_table_get(dev, &qp_table->qp_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->qp_table, qpn, use_gfp_nofs);
if (err)
goto err_out;

- err = mlx4_table_get(dev, &qp_table->auxc_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->auxc_table, qpn, use_gfp_nofs);
if (err)
goto err_put_qp;

- err = mlx4_table_get(dev, &qp_table->altc_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->altc_table, qpn, use_gfp_nofs);
if (err)
goto err_put_auxc;

- err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn, use_gfp_nofs);
if (err)
goto err_put_altc;

- err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn);
+ err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn, use_gfp_nofs);
if (err)
goto err_put_rdmarc;

@@ -316,7 +316,7 @@ err_out:
return err;
}

-static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
{
u64 param = 0;

@@ -326,7 +326,7 @@ static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
MLX4_CMD_ALLOC_RES, MLX4_CMD_TIME_CLASS_A,
MLX4_CMD_WRAPPED);
}
- return __mlx4_qp_alloc_icm(dev, qpn);
+ return __mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
}

void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
@@ -355,7 +355,8 @@ static void mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
__mlx4_qp_free_icm(dev, qpn);
}

-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+ int use_gfp_nofs)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_qp_table *qp_table = &priv->qp_table;
@@ -366,7 +367,7 @@ int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)

qp->qpn = qpn;

- err = mlx4_qp_alloc_icm(dev, qpn);
+ err = mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
if (err)
return err;

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 57428a0..007434d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -1490,7 +1490,7 @@ static int qp_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
return err;

if (!fw_reserved(dev, qpn)) {
- err = __mlx4_qp_alloc_icm(dev, qpn);
+ err = __mlx4_qp_alloc_icm(dev, qpn, 0);
if (err) {
res_abort_move(dev, slave, RES_QP, qpn);
return err;
@@ -1577,7 +1577,7 @@ static int mpt_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
if (err)
return err;

- err = __mlx4_mpt_alloc_icm(dev, mpt->key);
+ err = __mlx4_mpt_alloc_icm(dev, mpt->key, 0);
if (err) {
res_abort_move(dev, slave, RES_MPT, id);
return err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/srq.c b/drivers/net/ethernet/mellanox/mlx4/srq.c
index 98faf87..2cd51a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/srq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/srq.c
@@ -103,11 +103,11 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn)
if (*srqn == -1)
return -ENOMEM;

- err = mlx4_table_get(dev, &srq_table->table, *srqn);
+ err = mlx4_table_get(dev, &srq_table->table, *srqn, 0);
if (err)
goto err_out;

- err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn);
+ err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn, 0);
if (err)
goto err_put;
return 0;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 5edd2c6..de2fcf5 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -826,7 +826,7 @@ static inline int mlx4_is_slave(struct mlx4_dev *dev)
}

int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
- struct mlx4_buf *buf);
+ struct mlx4_buf *buf, int use_gfp_nofs);
void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf);
static inline void *mlx4_buf_offset(struct mlx4_buf *buf, int offset)
{
@@ -863,9 +863,10 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw);
int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
int start_index, int npages, u64 *page_list);
int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
- struct mlx4_buf *buf);
+ struct mlx4_buf *buf, int use_gfp_nofs);

-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order);
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+ int use_gfp_nofs);
void mlx4_db_free(struct mlx4_dev *dev, struct mlx4_db *db);

int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
@@ -881,7 +882,8 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq);
int mlx4_qp_reserve_range(struct mlx4_dev *dev, int cnt, int align, int *base);
void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt);

-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp);
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+ int use_gfp_nofs);
void mlx4_qp_free(struct mlx4_dev *dev, struct mlx4_qp *qp);

int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, u32 cqn, u16 xrcdn,
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6793f32..f39001c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -643,6 +643,7 @@ enum ib_qp_type {
enum ib_qp_create_flags {
IB_QP_CREATE_IPOIB_UD_LSO = 1 << 0,
IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK = 1 << 1,
+ IB_QP_CREATE_USE_GFP_NOFS = 1 << 2,
IB_QP_CREATE_NETIF_QP = 1 << 5,
/* reserve bits 26-31 for low level drivers' internal use */
IB_QP_CREATE_RESERVED_START = 1 << 26,
--
Jiri Kosina
SUSE Labs

2014-02-25 22:20:08

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Wed, Feb 26, 2014 at 12:11 AM, Jiri Kosina <[email protected]> wrote:
> On Tue, 25 Feb 2014, Or Gerlitz wrote:

>> And what happens if you use IPoIB datagram mode, is/why the patch is
>> needed there?

> I admittedly am no infiniband expert, but my understanding is that in
> principle Connected/Datagram mode is about MTU and checksum offloading,

yes, the differences between the mode relate to these aspects, however

> but the TX path is the same. Please correct me if I am wrong.

no, note that your patch only touched drivers/infiniband/ulp/ipoib/ipoib_cm.c
which is basically compiled out if you set CONFIG_INFINIBAND_IPOIB_CM,
so surely the TX path for the datagram vs. connected modes are
different.

2014-02-25 22:40:44

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Wed, 26 Feb 2014, Or Gerlitz wrote:

> >> And what happens if you use IPoIB datagram mode, is/why the patch is
> >> needed there?
>
> > I admittedly am no infiniband expert, but my understanding is that in
> > principle Connected/Datagram mode is about MTU and checksum offloading,
>
> yes, the differences between the mode relate to these aspects, however

Thanks for confirming.

> > but the TX path is the same. Please correct me if I am wrong.
>
> no, note that your patch only touched drivers/infiniband/ulp/ipoib/ipoib_cm.c
> which is basically compiled out if you set CONFIG_INFINIBAND_IPOIB_CM,
> so surely the TX path for the datagram vs. connected modes are
> different.

Yes, but for datagram mode, the tx_ring is allocated in a completely
different way (not from kworker), so this might be a non-issue, right? I
will have to look into it more deeply to be really sure; if you can
provide your insight, that'd be helpful.

--
Jiri Kosina
SUSE Labs

2014-02-25 22:48:59

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Wed, Feb 26, 2014 at 12:40 AM, Jiri Kosina <[email protected]> wrote:
> On Wed, 26 Feb 2014, Or Gerlitz wrote:
>
>>>> And what happens if you use IPoIB datagram mode, is/why the patch is
>>>> needed there?

>>> I admittedly am no infiniband expert, but my understanding is that in
>>> principle Connected/Datagram mode is about MTU and checksum
>>> offloading


>> yes, the differences between the mode relate to these aspects, however

> Thanks for confirming

Still, even if different, I still don't see why not use datagram mode
if the problem hits you only for connected mode. E.g datagram mode
supports LSO/GRO and TX/RX checksum offloads which should cover on the
smaller MTU vs. connected mode


>> > but the TX path is the same. Please correct me if I am wrong.

>> no, note that your patch only touched drivers/infiniband/ulp/ipoib/ipoib_cm.c
>> which is basically compiled out if you set CONFIG_INFINIBAND_IPOIB_CM,
>> so surely the TX path for the datagram vs. connected modes are
>> different.

> Yes, but for datagram mode, the tx_ring is allocated in a completely
> different way (not from kworker), so this might be a non-issue, right? I
> will have to look into it more deeply to be really sure; if you can
> provide your insight, that'd be helpful.


Note that even when operating in connected mode, the ipoib net-device
instance can speak in datagram mode with remote nodes who don't
support connected mor and/or when sending multicast -- specifically
ipoib_dev_init() does the setup of the TX ring. Maybe you can just try
this out and see if it works?

2014-02-25 22:55:16

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Wed, 26 Feb 2014, Or Gerlitz wrote:

> Still, even if different, I still don't see why not use datagram mode
> if the problem hits you only for connected mode. E.g datagram mode
> supports LSO/GRO and TX/RX checksum offloads which should cover on the
> smaller MTU vs. connected mode
[ ... snip ... ]
> > Yes, but for datagram mode, the tx_ring is allocated in a completely
> > different way (not from kworker), so this might be a non-issue, right? I
> > will have to look into it more deeply to be really sure; if you can
> > provide your insight, that'd be helpful.
>
> Note that even when operating in connected mode, the ipoib net-device
> instance can speak in datagram mode with remote nodes who don't
> support connected mor and/or when sending multicast -- specifically
> ipoib_dev_init() does the setup of the TX ring. Maybe you can just try
> this out and see if it works?

That definitely can be verified, and I am putting it on my TODO list.

But let's make sure that we don't diverge from the original problem too
much. Simple fact is that the deadlock is there when using connected mode,
and there is nothing preventing users from using it this way, therefore I
believe it should be fixed one way or another.

If the problem is still there in datagram mode (which, as far as my
understanding of the code goes, is not the case), it should be fixed as
well, but that's a different story.

Thanks,

--
Jiri Kosina
SUSE Labs

2014-02-26 21:18:52

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Wed, Feb 26, 2014 at 12:55 AM, Jiri Kosina <[email protected]> wrote:
> On Wed, 26 Feb 2014, Or Gerlitz wrote:
[...]
> That definitely can be verified, and I am putting it on my TODO list.

OK, thanks


> But let's make sure that we don't diverge from the original problem too
> much. Simple fact is that the deadlock is there when using connected mode,
> and there is nothing preventing users from using it this way, therefore I
> believe it should be fixed one way or another.

the patch is titled with "mlx4:" -- do you expect the problem to come
into play only when ipoib connected mode runs over the mlx4 driver?
what's about mlx5 or other upstream IB drivers?

I'll be looking on the details of the problem/solution, but this way
or another the API being module param sounds more like a hack....

Do we have a way to tell a net-device instance they should do their
memory allocations in a NOFS manner? if not, shouldn't we come up with
more general injection method?

2014-02-27 09:48:27

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Wed, 26 Feb 2014, Or Gerlitz wrote:

> > But let's make sure that we don't diverge from the original problem too
> > much. Simple fact is that the deadlock is there when using connected mode,
> > and there is nothing preventing users from using it this way, therefore I
> > believe it should be fixed one way or another.
>
> the patch is titled with "mlx4:" -- do you expect the problem to come
> into play only when ipoib connected mode runs over the mlx4 driver?
> what's about mlx5 or other upstream IB drivers?

Honestly, I have no idea. I am pretty sure that Mellanox folks have much
better understanding of the mlx* driver internals than I do. I tried to
figure out where mlx5 is standing in this respect, but I don't even see
where ipoib_cm_tx->tx_ring is being allocated there.

> I'll be looking on the details of the problem/solution,

Awesome, thanks a lot, that's highly appreciated.

> Do we have a way to tell a net-device instance they should do their
> memory allocations in a NOFS manner? if not, shouldn't we come up with
> more general injection method?

I don't think we have, and it indeed should be rather easy to add. The
more challenging part of the problem is where (and based on which data)
the flag would actually be set up on the netdevice so that it's not
horrible layering violation.

--
Jiri Kosina
SUSE Labs

2014-02-27 09:58:57

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On 27/02/2014 11:48, Jiri Kosina wrote:
> On Wed, 26 Feb 2014, Or Gerlitz wrote:
>
>>> But let's make sure that we don't diverge from the original problem too
>>> much. Simple fact is that the deadlock is there when using connected mode,
>>> and there is nothing preventing users from using it this way, therefore I
>>> believe it should be fixed one way or another.
>> the patch is titled with "mlx4:" -- do you expect the problem to come
>> into play only when ipoib connected mode runs over the mlx4 driver?
>> what's about mlx5 or other upstream IB drivers?
> Honestly, I have no idea. I am pretty sure that Mellanox folks have much
> better understanding of the mlx* driver internals than I do. I tried to
> figure out where mlx5 is standing in this respect, but I don't even see
> where ipoib_cm_tx->tx_ring is being allocated there.

ipoib is coded over the verbs API (include/rdma/ib_verbs.h) --- so
tracking the path from ipoib through the verbs api into mlx4 should be
similar exercise as doing so for mlx5, but let's 1st treat the higher
level elements involved with this patch.

Can you shed some light why the problem happens only for NFS, and not
for example with other IP/TCP storage protocols?

For example, do you expect it to happen with iSCSI/TCP too? the Linux
iSCSI initiator 1st open a TCP socket from user space to the target,
next they do login exchange over this socket and later provide the
socket to the kernel iscsi code to use as the back-end of a SCSI block
device registered with the SCSI midlayer


>
>> I'll be looking on the details of the problem/solution,
> Awesome, thanks a lot, that's highly appreciated.
>
>> Do we have a way to tell a net-device instance they should do their
>> memory allocations in a NOFS manner? if not, shouldn't we come up with
>> more general injection method?
> I don't think we have, and it indeed should be rather easy to add. The
> more challenging part of the problem is where (and based on which data)
> the flag would actually be set up on the netdevice so that it's not
> horrible layering violation.
>

I assume that in the same manner netdevices advertize features to the
networking core, the core can provide them
operating directives after they register themselves.

2014-02-27 10:42:39

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Thu, 27 Feb 2014, Or Gerlitz wrote:

> ipoib is coded over the verbs API (include/rdma/ib_verbs.h) --- so tracking
> the path from ipoib through the verbs api into mlx4 should be similar exercise
> as doing so for mlx5, but let's 1st treat the higher level elements involved
> with this patch.
>
> Can you shed some light why the problem happens only for NFS, and not for
> example with other IP/TCP storage protocols?
>
> For example, do you expect it to happen with iSCSI/TCP too? the Linux
> iSCSI initiator 1st open a TCP socket from user space to the target,
> next they do login exchange over this socket and later provide the
> socket to the kernel iscsi code to use as the back-end of a SCSI block
> device registered with the SCSI midlayer

Frankly, no idea. There was a problem with swapping over NFS, as writeback
was deadlocked with memory reclaim (memory needs to be allocated so that
swap could be accessed to reclaim memory). That's fixed by allocating the
buffers from PF_MEMALLOC reserve, introduced by Mel's and Peter's patchset
back in 3.9 or so. Oh, and the same has been done for swapping over NBD,
btw. Maybe iSCSI needs similar treatment, maybe it has it already, I
haven't checked. We haven't seen a bugreport for that though.

> > I don't think we have, and it indeed should be rather easy to add. The
> > more challenging part of the problem is where (and based on which
> > data) the flag would actually be set up on the netdevice so that it's
> > not horrible layering violation.
>
> I assume that in the same manner netdevices advertize features to the
> networking core, the core can provide them operating directives after
> they register themselves.

Whatever suits you best. To sum it up:

- mlx4 is confirmed to have this problem, and we know how that problem
happens -- see the paragraph in the changelog explaining the dependency
between memory reclaim and allocation of TX ring

- we have a work around which requires human interaction in order
to provide the information whether GFP_NOFS should be used or not

- I can very well understand why Mellanox would see that as a hack, but if
more comprehensive fix is necessary, I'd expect those who understand
the code the best to come up with a solution/proposal. I'd assume that
you don't want to keep the code with known and easily triggerable
deadlock out there unfixed.

- where I see the potential for layering violation in any 'general'
solution is that it's the filesystem that has to be "talking" to the
underlying netdevice, i.e. you'll have to make filesystem
netdevice-aware, right?

Thanks,

--
Jiri Kosina
SUSE Labs

2014-04-24 17:03:20

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Tue, 11 Mar 2014, Or Gerlitz wrote:

> > mode on any IB device. In connected mode, a packet send can trigger
> > establishing a new connection, which will allocate a new QP, which in
> > particular will allocate memory for the QP in the low-level IB device
> > driver. Currently I'm positive that every driver will do GFP_KERNEL
> > allocations when allocating a QP (ehca does both a GFP_KERNEL
> > kmem_cache allocation and vmalloc in internal_create_qp(), mlx5 and
> > mthca are similar to mlx4 and qib does vmalloc() in qib_create_qp()).
> > So this patch needs to be extended to the other 4 IB device drivers in
> > the tree.
> >
> > Also, I don't think GFP_NOFS is enough -- it seems we need GFP_NOIO,
> > since we could be swapping to a block device over iSCSI over IPoIB-CM,
> > so even non-FS stuff could deadlock.
> >
> > I don't think it makes any sense to have a "do_not_deadlock" module
> > parameter, especially one that defaults to "false." If this is the
> > right thing to do, then we should just unconditionally do it.
> >
> > It does seem that only using GFP_NOIO when we really need to would be
> > a very difficult problem--how can we carry information about whether a
> > particular packet is involved in freeing memory through all the layers
> > of, say, NFS, TCP, IPSEC, bonding, &c?
>
> Agree with all the above... next,
>
> If we don't have away to nicely overcome the layer violations here, let's
> change IPoIB so they always ask the
> IB driver to allocate QPs used for Connected Mode in a GFP_NOIO manner, to be
> practical I suggest the following:
>
> 1. Add new QP creation flag IB_QP_CREATE_USE_GFP to the existing creation
> flags of struct ib_qp_init_attr
> and a new "gfp_t gfp" field to that structure too
>
> 2. in the IPoIB CM code, do the vzalloc allocation for new connection in
> GFP_NOIO manner and issue
> the call to create QP with setting the IB_QP_CREATE_USE_GFP flag and GFO_NOIO
> to the gfp field
>
> 3. If the QP creation fails, with -EINVAL, issue a warning and retry the QP
> creation attempt without the GFP setting
>
> 4. implement in the mlx4 driver the support for GFP directives on QP creation
>
> 5. for the rest of the IB drivers, return -EINVAL if IB_QP_CREATE_USE_GFP is
> set
>
> This will allow to provide working solution for mlx4 users and gradually add
> support for the rest of the IB drivers.
>
> as for proper patch planning
>
> patch #1 / items 1 and 5
> patch #2 / item 4
> patch #3 / item 3
>
> Re item 5 -- I made a check and the ehca, ipath and mthca driver already
> return -EINVAL if provided with any creation flag, so you only need to patch
> the qib driver in qib_create_qp() to do that as well which is trivial.
>
> As for the rest of the code, you practically have it all by now, just need to
> port the mlx4 changes you did to the suggested framework, remove the module
> param (which you don't like either) and add the new gfp_t field to
> ib_qp_init_attr
>
> So sounds like a plan that makes sense?

Hi everybody,

seems like this fell through cracks, probably because it wasn't clearly
stated *who* will be preparing patch(es) that'd be implementing the ideas
above.

My understanding was that it'd be Mellanox, given that they basically own
the driver, have the best testing coverage compared to very very limited
testing coverage I can do, and will be pushing it upstream anyway.

I believe all the above can be easily created on top of the original patch
I sent.

So ... was there a misunderstanding on who is going to do it? :)

Thanks,

--
Jiri Kosina
SUSE Labs

2014-04-24 20:01:18

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

On Thu, Apr 24, 2014 at 8:03 PM, Jiri Kosina <[email protected]> wrote:
>
> On Tue, 11 Mar 2014, Or Gerlitz wrote:


[...]
>
> > So sounds like a plan that makes sense?
>
> Hi everybody, seems like this fell through cracks,


Hi Jiri,

I sent you private note on Mar 19th saying "are you on track with this
for 3.15? the merge window is coming soon and you have 99% of what you
need -- lets get there" which seems to be the piece that fell between
the cracks.

>
> probably because it wasn't clearly stated *who* will be preparing patch(es) that'd be implementing the ideas above. My understanding was that it'd be Mellanox, given that they basically own the driver,



to be precise partially own the mlx4 HW driver (Roland is the
mlx4_core/ib author and maintainer, we are asking for few years to get
co-maintainer hat there, no success so far), the problem you have
described can happen with any HW driver, but as stated earlier, the
suggested plan will fix ipoib and mlx4 and following that more hw
drivers can be enhanced to support that too.


>
> have the best testing coverage compared to very very limited
> testing coverage I can do, and will be pushing it upstream anyway.
>
> I believe all the above can be easily created on top of the original patch I sent.



Indeed, so will take a look next week and let you know if it works for me


>
>
> So ... was there a misunderstanding on who is going to do it? :)
>