Received: by 10.192.165.148 with SMTP id m20csp3527305imm; Mon, 7 May 2018 13:59:37 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqVzzS9IIVug3QmzN2qdqZAYJaixrkC0TM8s6QECrJXaoB6OI0NaYBNRllOAcie7L0r01Jg X-Received: by 2002:aca:309:: with SMTP id 9-v6mr23002670oid.280.1525726777366; Mon, 07 May 2018 13:59:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525726777; cv=none; d=google.com; s=arc-20160816; b=b2MCWCqLEwOUbBKJq5nSVZ42XzVCkUI0V6+iSrHi9TblpZuO+AHSB40jCAM9pP3z4j w7Y+m1+ke34dULdn02l/oVM7sNZkDIFcY2Ij4+aPVlz7Z0SeWUIK1QMR8WRke2cN8zz/ j1c5JZPuKV4iNUPJgkb8fI98Gw9nLGC2g1V+Lo4ydVFWh3aZmtjzUjnwbGqsPiv19tBp rpLxk27KCmqTWRPFmPxt2CBh34Mcs8rQEyVg80GBIzM/FPZuY1J+V6Gd1vgK62PQheLS +FN9eZlxabOVk8VJkVZ/hegYziMkQfEd189z/8S9LnEJNxxeiPrk3Ts+S+zL9nOcBwKF J0uw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:content-disposition :mime-version:message-id:subject:to:from:date:dkim-signature :arc-authentication-results; bh=7L86ZWzpHZecFjDpqpE+x/VvUF+4/WEJXPay7kbD5eU=; b=jxQr7eaClP98cS7+XuE6ucBq9IjDq+Xs4VEbAIvErvejnhs2keRHQLfKdfqxy85hMI 9jCXl9RPO+7GM22BS0HG6gUHOQV6b3hDvxyPMLcIYDz1dsXLjIs/MiNx2VaSjy1Gm7aj Q9pF12ux4Pvn/+eUMAZMl7v1NVNSNyQjzwWI2B+2YMxHHCAhR+LuD+nczl9v5oA4Xhvq sJWXO4DvHYoV3gAbWxCqLBseWYOq5BcTQUmp6Sj2dR+3L7VT9BRrxos/o49/Sh7Z2H0S CU+7Cmn5pcS4qEBG8Vf8WropftJgzz9K8M6HJAWBmZt9fBmXS73cFZTgvkaivs8erMP3 TTJQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=aRUZHHxa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c4-v6si1305188otb.111.2018.05.07.13.59.22; Mon, 07 May 2018 13:59:37 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=aRUZHHxa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753281AbeEGU7I (ORCPT + 99 others); Mon, 7 May 2018 16:59:08 -0400 Received: from mail-wr0-f169.google.com ([209.85.128.169]:39290 "EHLO mail-wr0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752832AbeEGU7H (ORCPT ); Mon, 7 May 2018 16:59:07 -0400 Received: by mail-wr0-f169.google.com with SMTP id q3-v6so30113054wrj.6; Mon, 07 May 2018 13:59:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:mime-version:content-disposition :user-agent; bh=7L86ZWzpHZecFjDpqpE+x/VvUF+4/WEJXPay7kbD5eU=; b=aRUZHHxabumBLXQ/XcaAe/A9ssPlZsju59eJm8g4cItxd/bGDkjxyE1Fz8JFHYtYWA li5xNkGGnxEN2LbjPt/6DFmylq2Kb6/PtBAlhzXugrYQnXeOQZC9aBAnYldu2RCGTLps GLGn0cGL3T4B/sU4DC/ojsEzriw6JAoOYDSgUNYvujc6+UYm+RhpoUi3XNVhkUbbE5aM 6VOemjBxWoXJI5yCOadk5zmc3+yEe44fv3PiXqO0BaAvGuz/nbvTpdAFbWxiqp793fpx bYExcMtRjlTO2/54+GU0khs+IFnJgiGySrR1jsGbxGchVI+ECaZreg9L+A/IZ88535Or ps4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:user-agent; bh=7L86ZWzpHZecFjDpqpE+x/VvUF+4/WEJXPay7kbD5eU=; b=C728lpYusjhzipgyzHkDJ8LZiDIeBb8lHtKk12pEPTy0RG0XnR7HGiJwIIzQ0s9YS8 vxPx9h/670YlcYssV88plDwBuS4chp48vVh6Bboxx03N+MxIv68odcEEjM8Pd68OPyih 4mEF0SoAWBr0qs8UGNQ3Q9bG1gcvJ79t37DsB8nmS/sixwyGLRgk4fMGAOg9HN9m2z+S zE4iAfPziZe3+LSPr1uri8jXYd+LoeRbFZ3Sm6sz11FImIla9dlYPycBZfO/zveqvoSV m/h4f6gMFGPlEEJDgXoL8KijEhR5LMdHu1yzDPS23wxcU5Af5Yk5NjxwoiCFR0HOeOd1 Elog== X-Gm-Message-State: ALQs6tCsE/JLaXJm3wqlMlT1WUtBzf2CJ39v4zpBogjWh/APkM2PQ8V5 Lfp8OfRxb6RAgbQsCiemIlc= X-Received: by 2002:adf:b456:: with SMTP id v22-v6mr20690120wrd.67.1525726745534; Mon, 07 May 2018 13:59:05 -0700 (PDT) Received: from gmail.com (51B6778B.dsl.pool.telekom.hu. [81.182.119.139]) by smtp.gmail.com with ESMTPSA id w31-v6sm54368955wrb.93.2018.05.07.13.59.04 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 07 May 2018 13:59:04 -0700 (PDT) Date: Mon, 7 May 2018 22:58:58 +0200 From: Alexandru Moise <00moses.alexander00@gmail.com> To: monis@mellanox.com, dledford@redhat.com, jgg@ziepe.ca, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, yanjun.zhu@oracle.com Subject: [PATCH v3] nvmet,rxe: defer ip datagram sending to tasklet Message-ID: <20180507205858.GA11664@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This addresses 3 separate problems: 1. When using NVME over Fabrics we may end up sending IP packets in interrupt context, we should defer this work to a tasklet. [ 50.939957] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x1f/0xa0 [ 50.942602] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G W 4.17.0-rc3-ARCH+ #104 [ 50.945466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 [ 50.948163] RIP: 0010:__local_bh_enable_ip+0x1f/0xa0 [ 50.949631] RSP: 0018:ffff88009c183900 EFLAGS: 00010006 [ 50.951029] RAX: 0000000080010403 RBX: 0000000000000200 RCX: 0000000000000001 [ 50.952636] RDX: 0000000000000000 RSI: 0000000000000200 RDI: ffffffff817e04ec [ 50.954278] RBP: ffff88009c183910 R08: 0000000000000001 R09: 0000000000000614 [ 50.956000] R10: ffffea00021d5500 R11: 0000000000000001 R12: ffffffff817e04ec [ 50.957779] R13: 0000000000000000 R14: ffff88009566f400 R15: ffff8800956c7000 [ 50.959402] FS: 0000000000000000(0000) GS:ffff88009c180000(0000) knlGS:0000000000000000 [ 50.961552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 50.963798] CR2: 000055c4ec0ccac0 CR3: 0000000002209001 CR4: 00000000000606e0 [ 50.966121] Call Trace: [ 50.966845] [ 50.967497] __dev_queue_xmit+0x62d/0x690 [ 50.968722] dev_queue_xmit+0x10/0x20 [ 50.969894] neigh_resolve_output+0x173/0x190 [ 50.971244] ip_finish_output2+0x2b8/0x370 [ 50.972527] ip_finish_output+0x1d2/0x220 [ 50.973785] ? ip_finish_output+0x1d2/0x220 [ 50.975010] ip_output+0xd4/0x100 [ 50.975903] ip_local_out+0x3b/0x50 [ 50.976823] rxe_send+0x74/0x120 [ 50.977702] rxe_requester+0xe3b/0x10b0 [ 50.978881] ? ip_local_deliver_finish+0xd1/0xe0 [ 50.980260] rxe_do_task+0x85/0x100 [ 50.981386] rxe_run_task+0x2f/0x40 [ 50.982470] rxe_post_send+0x51a/0x550 [ 50.983591] nvmet_rdma_queue_response+0x10a/0x170 [ 50.985024] __nvmet_req_complete+0x95/0xa0 [ 50.986287] nvmet_req_complete+0x15/0x60 [ 50.987469] nvmet_bio_done+0x2d/0x40 [ 50.988564] bio_endio+0x12c/0x140 [ 50.989654] blk_update_request+0x185/0x2a0 [ 50.990947] blk_mq_end_request+0x1e/0x80 [ 50.991997] nvme_complete_rq+0x1cc/0x1e0 [ 50.993171] nvme_pci_complete_rq+0x117/0x120 [ 50.994355] __blk_mq_complete_request+0x15e/0x180 [ 50.995988] blk_mq_complete_request+0x6f/0xa0 [ 50.997304] nvme_process_cq+0xe0/0x1b0 [ 50.998494] nvme_irq+0x28/0x50 [ 50.999572] __handle_irq_event_percpu+0xa2/0x1c0 [ 51.000986] handle_irq_event_percpu+0x32/0x80 [ 51.002356] handle_irq_event+0x3c/0x60 [ 51.003463] handle_edge_irq+0x1c9/0x200 [ 51.004473] handle_irq+0x23/0x30 [ 51.005363] do_IRQ+0x46/0xd0 [ 51.006182] common_interrupt+0xf/0xf [ 51.007129] 2. Work must always be offloaded to tasklet when using NVMEoF in order to solve lock ordering between neigh->ha_lock seqlock and the nvme queue lock: [ 77.833783] Possible interrupt unsafe locking scenario: [ 77.833783] [ 77.835831] CPU0 CPU1 [ 77.837129] ---- ---- [ 77.838313] lock(&(&n->ha_lock)->seqcount); [ 77.839550] local_irq_disable(); [ 77.841377] lock(&(&nvmeq->q_lock)->rlock); [ 77.843222] lock(&(&n->ha_lock)->seqcount); [ 77.845178] [ 77.846298] lock(&(&nvmeq->q_lock)->rlock); [ 77.847986] [ 77.847986] *** DEADLOCK *** 3. Same goes for the lock ordering between sch->q.lock and nvme queue lock: [ 47.634271] Possible interrupt unsafe locking scenario: [ 47.634271] [ 47.636452] CPU0 CPU1 [ 47.637861] ---- ---- [ 47.639285] lock(&(&sch->q.lock)->rlock); [ 47.640654] local_irq_disable(); [ 47.642451] lock(&(&nvmeq->q_lock)->rlock); [ 47.644521] lock(&(&sch->q.lock)->rlock); [ 47.646480] [ 47.647263] lock(&(&nvmeq->q_lock)->rlock); [ 47.648492] [ 47.648492] *** DEADLOCK *** Using NVMEoF after this patch seems to finally be stable, without it, rxe eventually deadlocks the whole system and causes RCU stalls. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> --- v2->v3: Abandoned the idea of just offloading work to tasklet in irq context. NVMEoF needs the work to always be offloaded in order to function. I am unsure how large an impact this has on other rxe users. drivers/infiniband/sw/rxe/rxe_verbs.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c index 2cb52fd48cf1..564453060c54 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.c +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c @@ -761,7 +761,6 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, struct ib_send_wr *wr, unsigned int mask; unsigned int length = 0; int i; - int must_sched; while (wr) { mask = wr_opcode_mask(wr->opcode, qp); @@ -791,14 +790,7 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, struct ib_send_wr *wr, wr = wr->next; } - /* - * Must sched in case of GSI QP because ib_send_mad() hold irq lock, - * and the requester call ip_local_out_sk() that takes spin_lock_bh. - */ - must_sched = (qp_type(qp) == IB_QPT_GSI) || - (queue_count(qp->sq.queue) > 1); - - rxe_run_task(&qp->req.task, must_sched); + rxe_run_task(&qp->req.task, 1); if (unlikely(qp->req.state == QP_STATE_ERROR)) rxe_run_task(&qp->comp.task, 1); @@ -822,7 +814,7 @@ static int rxe_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, if (qp->is_user) { /* Utilize process context to do protocol processing */ - rxe_run_task(&qp->req.task, 0); + rxe_run_task(&qp->req.task, 1); return 0; } else return rxe_post_send_kernel(qp, wr, bad_wr); -- 2.17.0