Received: by 10.192.165.148 with SMTP id m20csp3799354imm; Mon, 7 May 2018 20:10:43 -0700 (PDT) X-Google-Smtp-Source: AB8JxZo9KHUGKch1cs2dW42i5v9ChsRckvWre9QiiNOjj4ZuHcd+LNDX3pWksFthVXEZQIth2oyB X-Received: by 10.98.35.11 with SMTP id j11mr38225449pfj.177.1525749042923; Mon, 07 May 2018 20:10:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525749042; cv=none; d=google.com; s=arc-20160816; b=jWLAacKH3AmX9l0uUOATE5tMB8IjoQdgpkdelxBPdIkA8THUu+iGHW/myNLJW7m7WY atFl8fjLxV5BpN6zFRdCe54OeeT6pC8eltwrSghHIZe8uBNVfJoBu+Sl3b9OtCjhDORj 5P6mO3GbUqiMPefn/dbrU+ziO+kHvO8xaPwl22N5Pqs/vMb0qroH6G+XMtZx6wav6s0e yaMLFdAPZGwEZgHZkX2vVrwfRH6of4YqbbuuvqweNBTA8ipQl3CHeGEaXfX8BzqbkFiJ iZDic/vPtCY//1d38sm3xp0jxJ+E81d3DdeEmNi1Wof9brp+79Ktn6cYECd8S3pKaTaD Y8PQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:to:subject:dkim-signature :arc-authentication-results; bh=OV+Y/XkqaG5ZeGPaB7A9BMS7wn5DsP2bjujisVgmybA=; b=DvvkEXxcuC65qhkIVtSPmoiNnstai0wZqzZobNcEBq4t1+h6R4n5TMjjb6JejnfBEe uh7IQMLH3AwxO/cgkUnOBdtVW0cRz6LDmO1QqDFYmvKsxtK35eKfjZRGXfhHGvJZSFFM 6mYDu2GCBhu0YZ9Y9i5gpGKQ4ixuDcuL0rnky2WLCds1r96Lyux1kVjhrdvEkdiVrMbN Euwk34a5mXk8MelbKDI0oLa9qjYTdRteZUy4ldF3TeEOc0nzjm2WlAxZ3FT439jzs5M0 /7ZbCigphjlap6Ow3VFXCBOSYEHv51YoPBU2+GZLN7KHgeUxtTXR9Y2t81teLKZdEWjA 7eHQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=K+Tnk8gM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y14-v6si18575109pgo.286.2018.05.07.20.10.28; Mon, 07 May 2018 20:10:42 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=K+Tnk8gM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754014AbeEHDJs (ORCPT + 99 others); Mon, 7 May 2018 23:09:48 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:55036 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753848AbeEHDJq (ORCPT ); Mon, 7 May 2018 23:09:46 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w48362bl050295; Tue, 8 May 2018 03:09:24 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=OV+Y/XkqaG5ZeGPaB7A9BMS7wn5DsP2bjujisVgmybA=; b=K+Tnk8gMw9IZMAptCjOmSxXh735OWoVpTJQ4aGekq8rWinyZVIZD4AqfMyCLNoKXW3mQ anT2HI3hyuQFSF5bTc37ITuu/qqyHs+CF+QQAolRm4SXAR0s0Dkn8djc6NgUoYYwz9S2 HTHezcwKvO070CqKsvlpS0aGdA10XVfTjMB9JohBJv9TzjdeDc7lqKRWSfw02faaBxFq n/Gqkyw53URspketyIjY9eAGahgKEJ41jBK9SklwAafqjjfIJKNrVYvjPCS+1hMSSbpT rH2zugD/NXqsISwj9C2E8R3XAk/GMVNBHeV7cXUQlbpzq9ouaO96IbxUoheeqhgyI2vg fQ== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2hs5936mp9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 08 May 2018 03:09:24 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w4839NT4003675 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 8 May 2018 03:09:24 GMT Received: from abhmp0009.oracle.com (abhmp0009.oracle.com [141.146.116.15]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w4839MIx014044; Tue, 8 May 2018 03:09:23 GMT Received: from [10.182.71.69] (/10.182.71.69) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 07 May 2018 20:09:22 -0700 Subject: Re: [PATCH v3] nvmet,rxe: defer ip datagram sending to tasklet To: Alexandru Moise <00moses.alexander00@gmail.com>, monis@mellanox.com, dledford@redhat.com, jgg@ziepe.ca, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org References: <20180507205858.GA11664@gmail.com> From: Yanjun Zhu Organization: Oracle Corporation Message-ID: <562873bf-075c-f148-c528-dddf188113c2@oracle.com> Date: Tue, 8 May 2018 11:09:18 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180507205858.GA11664@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8886 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1805080029 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018/5/8 4:58, Alexandru Moise wrote: > This addresses 3 separate problems: > > 1. When using NVME over Fabrics we may end up sending IP > packets in interrupt context, we should defer this work > to a tasklet. > > [ 50.939957] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x1f/0xa0 > [ 50.942602] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G W 4.17.0-rc3-ARCH+ #104 > [ 50.945466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 > [ 50.948163] RIP: 0010:__local_bh_enable_ip+0x1f/0xa0 > [ 50.949631] RSP: 0018:ffff88009c183900 EFLAGS: 00010006 > [ 50.951029] RAX: 0000000080010403 RBX: 0000000000000200 RCX: 0000000000000001 > [ 50.952636] RDX: 0000000000000000 RSI: 0000000000000200 RDI: ffffffff817e04ec > [ 50.954278] RBP: ffff88009c183910 R08: 0000000000000001 R09: 0000000000000614 > [ 50.956000] R10: ffffea00021d5500 R11: 0000000000000001 R12: ffffffff817e04ec > [ 50.957779] R13: 0000000000000000 R14: ffff88009566f400 R15: ffff8800956c7000 > [ 50.959402] FS: 0000000000000000(0000) GS:ffff88009c180000(0000) knlGS:0000000000000000 > [ 50.961552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 50.963798] CR2: 000055c4ec0ccac0 CR3: 0000000002209001 CR4: 00000000000606e0 > [ 50.966121] Call Trace: > [ 50.966845] > [ 50.967497] __dev_queue_xmit+0x62d/0x690 > [ 50.968722] dev_queue_xmit+0x10/0x20 > [ 50.969894] neigh_resolve_output+0x173/0x190 > [ 50.971244] ip_finish_output2+0x2b8/0x370 > [ 50.972527] ip_finish_output+0x1d2/0x220 > [ 50.973785] ? ip_finish_output+0x1d2/0x220 > [ 50.975010] ip_output+0xd4/0x100 > [ 50.975903] ip_local_out+0x3b/0x50 > [ 50.976823] rxe_send+0x74/0x120 > [ 50.977702] rxe_requester+0xe3b/0x10b0 > [ 50.978881] ? ip_local_deliver_finish+0xd1/0xe0 > [ 50.980260] rxe_do_task+0x85/0x100 > [ 50.981386] rxe_run_task+0x2f/0x40 > [ 50.982470] rxe_post_send+0x51a/0x550 > [ 50.983591] nvmet_rdma_queue_response+0x10a/0x170 > [ 50.985024] __nvmet_req_complete+0x95/0xa0 > [ 50.986287] nvmet_req_complete+0x15/0x60 > [ 50.987469] nvmet_bio_done+0x2d/0x40 > [ 50.988564] bio_endio+0x12c/0x140 > [ 50.989654] blk_update_request+0x185/0x2a0 > [ 50.990947] blk_mq_end_request+0x1e/0x80 > [ 50.991997] nvme_complete_rq+0x1cc/0x1e0 > [ 50.993171] nvme_pci_complete_rq+0x117/0x120 > [ 50.994355] __blk_mq_complete_request+0x15e/0x180 > [ 50.995988] blk_mq_complete_request+0x6f/0xa0 > [ 50.997304] nvme_process_cq+0xe0/0x1b0 > [ 50.998494] nvme_irq+0x28/0x50 > [ 50.999572] __handle_irq_event_percpu+0xa2/0x1c0 > [ 51.000986] handle_irq_event_percpu+0x32/0x80 > [ 51.002356] handle_irq_event+0x3c/0x60 > [ 51.003463] handle_edge_irq+0x1c9/0x200 > [ 51.004473] handle_irq+0x23/0x30 > [ 51.005363] do_IRQ+0x46/0xd0 > [ 51.006182] common_interrupt+0xf/0xf > [ 51.007129] > > 2. Work must always be offloaded to tasklet when using NVMEoF in order to > solve lock ordering between neigh->ha_lock seqlock and the nvme queue lock: > > [ 77.833783] Possible interrupt unsafe locking scenario: > [ 77.833783] > [ 77.835831] CPU0 CPU1 > [ 77.837129] ---- ---- > [ 77.838313] lock(&(&n->ha_lock)->seqcount); > [ 77.839550] local_irq_disable(); > [ 77.841377] lock(&(&nvmeq->q_lock)->rlock); > [ 77.843222] lock(&(&n->ha_lock)->seqcount); > [ 77.845178] > [ 77.846298] lock(&(&nvmeq->q_lock)->rlock); > [ 77.847986] > [ 77.847986] *** DEADLOCK *** > > 3. Same goes for the lock ordering between sch->q.lock and nvme queue lock: > > [ 47.634271] Possible interrupt unsafe locking scenario: > [ 47.634271] > [ 47.636452] CPU0 CPU1 > [ 47.637861] ---- ---- > [ 47.639285] lock(&(&sch->q.lock)->rlock); > [ 47.640654] local_irq_disable(); > [ 47.642451] lock(&(&nvmeq->q_lock)->rlock); > [ 47.644521] lock(&(&sch->q.lock)->rlock); > [ 47.646480] > [ 47.647263] lock(&(&nvmeq->q_lock)->rlock); > [ 47.648492] > [ 47.648492] *** DEADLOCK *** > > Using NVMEoF after this patch seems to finally be stable, without it, > rxe eventually deadlocks the whole system and causes RCU stalls. > > Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> > --- > v2->v3: Abandoned the idea of just offloading work to tasklet in irq > context. NVMEoF needs the work to always be offloaded in order to > function. I am unsure how large an impact this has on other rxe users. > > drivers/infiniband/sw/rxe/rxe_verbs.c | 12 ++---------- > 1 file changed, 2 insertions(+), 10 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c > index 2cb52fd48cf1..564453060c54 100644 > --- a/drivers/infiniband/sw/rxe/rxe_verbs.c > +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c > @@ -761,7 +761,6 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, struct ib_send_wr *wr, > unsigned int mask; > unsigned int length = 0; > int i; > - int must_sched; > > while (wr) { > mask = wr_opcode_mask(wr->opcode, qp); > @@ -791,14 +790,7 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, struct ib_send_wr *wr, > wr = wr->next; > } > > - /* > - * Must sched in case of GSI QP because ib_send_mad() hold irq lock, > - * and the requester call ip_local_out_sk() that takes spin_lock_bh. > - */ > - must_sched = (qp_type(qp) == IB_QPT_GSI) || > - (queue_count(qp->sq.queue) > 1); > - > - rxe_run_task(&qp->req.task, must_sched); > + rxe_run_task(&qp->req.task, 1); > if (unlikely(qp->req.state == QP_STATE_ERROR)) > rxe_run_task(&qp->comp.task, 1); > > @@ -822,7 +814,7 @@ static int rxe_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, > > if (qp->is_user) { > /* Utilize process context to do protocol processing */ > - rxe_run_task(&qp->req.task, 0); > + rxe_run_task(&qp->req.task, 1); Hi, From your description, it seems that all the problems are caused by NVME kernel. Am I correct? It seems that this block handles process context. If we keep "rxe_run_task(&qp->req.task, 0);", Do all the problems still occur? Zhu Yanjun > return 0; > } else > return rxe_post_send_kernel(qp, wr, bad_wr);