Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752250AbdCAPHs (ORCPT ); Wed, 1 Mar 2017 10:07:48 -0500 Received: from verein.lst.de ([213.95.11.211]:51897 "EHLO newverein.lst.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752091AbdCAPHn (ORCPT ); Wed, 1 Mar 2017 10:07:43 -0500 Date: Wed, 1 Mar 2017 15:51:24 +0100 From: Christoph Hellwig To: Noa Osherovich Cc: hch@lst.de, sagi@grimberg.me, linux-rdma@vger.kernel.org, Majd Dibbiny , tj@kernel.org, linux-kernel@vger.kernel.org Subject: Re: Poll CQ syncing problem Message-ID: <20170301145124.GA12121@lst.de> References: <3ba1baab-e2ac-358d-3b3b-ff4a27405c93@mellanox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3ba1baab-e2ac-358d-3b3b-ff4a27405c93@mellanox.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 817 Lines: 17 On Wed, Mar 01, 2017 at 04:30:26PM +0200, Noa Osherovich wrote: > Analysis: > Since ib_comp_wq isn't single threaded, two works can run in parallel for the same CQ, > executing __ib_process_cq. They shouldn't. Each CQ has a single work_struct, and any given work_struct should only be executing at once: "Note that the flag ``WQ_NON_REENTRANT`` no longer exists as all workqueues are now non-reentrant - any work item is guaranteed to be executed by at most one worker system-wide at any given time." > Since this function isn't thread safe and the wc array is shared, it causes a data corruption > which eventually crashes in the MAD layer due to a double list_del of the same element. This should not be the case. What kernel version are you testing and does it contain any patches touching core kernel code?