Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp922866ybb; Wed, 25 Mar 2020 12:08:29 -0700 (PDT) X-Google-Smtp-Source: ADFU+vtOvb9IRilQgZbJaESUi8e8Z7gpJwavzqV0DeZVTwOMHu9sYykXfr59z/+ttZRIW9znA86T X-Received: by 2002:a4a:d987:: with SMTP id k7mr2480487oou.19.1585163309260; Wed, 25 Mar 2020 12:08:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1585163309; cv=none; d=google.com; s=arc-20160816; b=OErDg1DDchtL+rq0lJSSAKR6EIuBPQqL30UQXzHV6apmqbLw/YmEfAJWyj56/JkDU+ AR+25C9Y6qVJtLIPtogpMYa5lFYBn/lOXSOpC+rWMuwkUmDsZhgBaSb8eovr0cwB3QBQ Nhu8jGecNALGYN5V1RuwCDVK60OLIg+VSlF4iO5qqcRyDmCutRVUNJbqfCuhxUF3VqPA GBNXmaGdSThfL/MGZRShFaIip2p4fhkfyTFQlxQxIx5E7XMXn54abnrBbDG5ZSCxUVIG aLBHqKQlW3x9RSDk0Li2ZGavFzoxbj63OI8tpcc7Nbn+ZnlMbpNReLYZ9rti5gcqff77 RqZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date:dkim-signature; bh=6pDP6lhmdf4Hby+1cno9sqIDFV84d5oAuTEQ4ceGWUs=; b=o6jPQIx+RfQY5Z0NPtDO/Lp4362zWuCN2FFgZfU9Z+JRrlJS6g3qlCQoBkYJE1p680 1AHC614Mm5Kw4V58NuvrwWpmUa+K6PZqZkdqBWnE3cmfHnPHUthrQlFGE7MSCBwfZ5zw BSwglFxdzPiop/T6Wcu8ju1mSORNksRwWTw90+A4qIm4lRDrqJvTZVdOteJY4jWIG2Rv a1/3HbU1j96J57bMCnQXiIzhu7vREHqMZNzv5o3FXaU5RVj3ptzQcC1pWZRL+ljiXO3f r+HYKNpaJYyWlbl549pQSMCVA4nYdqUO8bOeF7p0n/sFPPnxCA7WEQAiHhlu0B3bk15I s92g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=covR7fzi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r184si183693oig.35.2020.03.25.12.08.02; Wed, 25 Mar 2020 12:08:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=covR7fzi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727505AbgCYTHl (ORCPT + 99 others); Wed, 25 Mar 2020 15:07:41 -0400 Received: from mail.kernel.org ([198.145.29.99]:58320 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727399AbgCYTHk (ORCPT ); Wed, 25 Mar 2020 15:07:40 -0400 Received: from paulmck-ThinkPad-P72.home (50-39-105-78.bvtn.or.frontiernet.net [50.39.105.78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8FD3820714; Wed, 25 Mar 2020 19:07:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1585163259; bh=lhy+Eu3yict1wi7voNpWry2+n+0EtZmY4ZR96Tpx/lk=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=covR7fziSsoadZsZ3tfWGwczdeUG39GhVnWXCKG+7tq8E1HYiSz1PYd3ELksGojq2 rPjZ7AvTthbbsa3jKaD1DrCwkSzWKIliKuXK7H0OJEDATfGIy2Ul1oZeLwElqrnj++ VZuuljk7mYTga+VGE220XaFoAartJ1pigV1vzOBo= Received: by paulmck-ThinkPad-P72.home (Postfix, from userid 1000) id 631D33520BDC; Wed, 25 Mar 2020 12:07:39 -0700 (PDT) Date: Wed, 25 Mar 2020 12:07:39 -0700 From: "Paul E. McKenney" To: Cong Wang Cc: Thomas Gleixner , syzbot , David Miller , Jamal Hadi Salim , Jiri Pirko , Jakub Kicinski , LKML , Linux Kernel Network Developers , syzkaller-bugs Subject: Re: WARNING: ODEBUG bug in tcindex_destroy_work (3) Message-ID: <20200325190739.GA19865@paulmck-ThinkPad-P72> Reply-To: paulmck@kernel.org References: <000000000000742e9e05a10170bc@google.com> <87a74arown.fsf@nanos.tec.linutronix.de> <87ftdypyec.fsf@nanos.tec.linutronix.de> <875zeuftwm.fsf@nanos.tec.linutronix.de> <20200324020504.GR3199@paulmck-ThinkPad-P72> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 25, 2020 at 11:53:51AM -0700, Cong Wang wrote: > On Mon, Mar 23, 2020 at 7:05 PM Paul E. McKenney wrote: > > > > On Tue, Mar 24, 2020 at 02:01:13AM +0100, Thomas Gleixner wrote: > > > Cong Wang writes: > > > > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner wrote: > > > >> > We use an ordered workqueue for tc filters, so these two > > > >> > works are executed in the same order as they are queued. > > > >> > > > >> The workqueue is ordered, but look how the work is queued on the work > > > >> queue: > > > >> > > > >> tcf_queue_work() > > > >> queue_rcu_work() > > > >> call_rcu(&rwork->rcu, rcu_work_rcufn); > > > >> > > > >> So after the grace period elapses rcu_work_rcufn() queues it in the > > > >> actual work queue. > > > >> > > > >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be > > > >> invoked from preemtible context. Now assume the following: > > > >> > > > >> CPU0 > > > >> tcf_queue_work() > > > >> tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work); > > > >> > > > >> -> Migration > > > >> > > > >> CPU1 > > > >> tcf_queue_work(&p->rwork, tcindex_destroy_work); > > > >> > > > >> So your RCU callbacks can be placed on different CPUs which obviously > > > >> has no ordering guarantee at all. See also: > > > > > > > > Good catch! > > > > > > > > I thought about this when I added this ordered workqueue, but it > > > > seems I misinterpret max_active, so despite we have max_active==1, > > > > more than 1 work could still be queued on different CPU's here. > > > > > > The workqueue is not the problem. it works perfectly fine. The way how > > > the work gets queued is the issue. > > > > > > > I don't know how to fix this properly, I think essentially RCU work > > > > should be guaranteed the same ordering with regular work. But this > > > > seems impossible unless RCU offers some API to achieve that. > > > > > > I don't think that's possible w/o putting constraints on the flexibility > > > of RCU (Paul of course might disagree). > > > > It is possible, but it does not come for free. > > > > From an RCU/workqueues perspective, if I understand the scenario, you > > can do the following: > > > > tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work); > > > > rcu_barrier(); // Wait for the RCU callback. > > flush_work(...); // Wait for the workqueue handler. > > // But maybe for quite a few of them... > > > > // All the earlier handlers have completed. > > tcf_queue_work(&p->rwork, tcindex_destroy_work); > > > > This of course introduces overhead and latency. Maybe that is not a > > problem at teardown time, or maybe the final tcf_queue_work() can itself > > be dumped into a workqueue in order to get it off of the critical path. > > I personally agree, but nowadays NIC vendors care about tc filter > slow path performance as well. :-/ I bet that they do! ;-) But have you actually tried the rcu_barrier()/flush_work() approach? It is reasonably simple to implement, even if you do have to use multiple flush_work() calls, and who knows? Maybe it is fast enough. So why not try it? Hmmm... Another approach would be to associate a counter with this group of requests, incrementing this count in tcf_queue_work() and capturing the prior value of the count, then in tcindex_destroy_work() waiting for the count to reach the captured value. This would avoid needless waiting in tcindex_destroy_rexts_work(). Such needless waiting would be hard to avoid within the confines of either RCU or workqueues, and such needless waiting might well slow down this slowpath, which again might make the NIC vendors unhappy. > > However, depending on your constraints ... > > > > > I assume that the filters which hang of tcindex_data::perfect and > > > tcindex_data:p must be freed before tcindex_data, right? > > > > > > Refcounting of tcindex_data should do the trick. I.e. any element which > > > you add to a tcindex_data instance takes a refcount and when that is > > > destroyed then the rcu/work callback drops a reference which once it > > > reaches 0 triggers tcindex_data to be freed. > > > > ... reference counts might work much better for you. > > I need to think about how much work is needed for refcnting, given > other filters have the same assumption. Perhaps you can create some common code to handle this situation, which can then be shared among those various filters. Thanx, Paul