Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp967127pxb; Fri, 22 Apr 2022 15:32:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyvqKoSzxmOduVjAJw5DxODCsHlUWzzooQbylS5WVFRnZkZhleGhN6nl+4FDouDcRXaYMhA X-Received: by 2002:a65:418b:0:b0:382:250b:4dda with SMTP id a11-20020a65418b000000b00382250b4ddamr5799134pgq.428.1650666726897; Fri, 22 Apr 2022 15:32:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650666726; cv=none; d=google.com; s=arc-20160816; b=TDnevmCEtvVLJ5+6CKSsjB5AMdbLHXSirqG372kZv4NuecbWHASuCRX9alMKMpE6Dc SFvDLiyRTWabhYoEXZ1Q7wqUTyrqzoHK4qmA9DsCSttqOKEafBKUzIMgyC03kni3LdLh Dk2irp/95ShnBGV8pALwE5ZlzLR7HGKjEk06ASw0kIrFng6HamkTMkMDMPmhHzQ+5XT+ Udpr/p96RDgNjDmym5XbrWf4uh4YN3wcFa1tC6tuPXVrGyIjfXSiSTRC6agdDWo0YCJo yzIYqufE1R7B1tuZCQ65IPzwB+kzG2fNUocvhuLnU4/KXAO2JXSAGfm0WbgLun8mX1yR lKIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=v8LWDDcHCeryoL0yiBGiT7o390KlBFvaYjqfzBcj75E=; b=aneimVDVPa5NIxo8Jdao9xX2+WvA0brfkko4PGL/BEqBS226hjTJfF7QPT7T0O6sbe Hcd9WOSEFhnFqVzGczBkWhN4NrLRsD9MNbq02cgF0Tw35xqA4aU9hEwnXSYOhXBsTTqy BsVxZvDkFqUv6QIhVB0GqlVC877D3Ia+q2AGJoBqEo5QGWfga8LVABL5DJrfNzg2TzU9 Fpt/GU7h+w/8FhBZ5v+EnxJ0gWGfVVj/9wez0rbVH62zSeWaEfyI74uCz7+VbA+fhhdV Dlh3/3Fw80gAeXbUFHMbIZM1FsmmED1Q7PyakomHUzWVqqnZcXskRm0sBg7TvS2GV1u/ vFxQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass (test mode) header.i=@axis.com header.s=axis-central1 header.b=i1MsQnct; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=axis.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id f21-20020a056a0022d500b00505911788d7si10174219pfj.326.2022.04.22.15.32.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Apr 2022 15:32:06 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass (test mode) header.i=@axis.com header.s=axis-central1 header.b=i1MsQnct; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=axis.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 19500414982; Fri, 22 Apr 2022 13:45:24 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376731AbiDTIbI (ORCPT + 99 others); Wed, 20 Apr 2022 04:31:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51082 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356317AbiDTIa7 (ORCPT ); Wed, 20 Apr 2022 04:30:59 -0400 Received: from smtp2.axis.com (smtp2.axis.com [195.60.68.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6164935267 for ; Wed, 20 Apr 2022 01:28:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=axis.com; q=dns/txt; s=axis-central1; t=1650443293; x=1681979293; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=v8LWDDcHCeryoL0yiBGiT7o390KlBFvaYjqfzBcj75E=; b=i1MsQnctOEFDVMfTUkQsN+avotk5hkUEhiwjBNPy6CVVlWsvZlBG9+q5 KxQK1TtYtf3cdFdqRtdxjAOw8yMyaVAx3YFVDbK6DjgZsqXa1uGf6RhvQ 10hZuh2P0bB22qug7k2IwhErSmP/APnIKSLPZmTtw40sAjMmpT+j/Gq75 AcoOhnk2qmGU+GBJHlEkRf2BvTbZn1Ue+0aSA1B8xPM05wZ7nir/EAVzO 2mxoudG83uyl1/J6UF2GAIS4DLqO9ZIlr6pEBWRjh5NKPkNCdzPVwcWdK Ll64TIMDsu13ga05SVkAHTgQ+4cL6jD6kwltNLCT60xtR/BJ6WdiOQhw0 Q==; Message-ID: <487bbd00-d763-0a86-9984-1dfd957fbb38@axis.com> Date: Wed, 20 Apr 2022 10:28:00 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.2 Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock Content-Language: en-US To: Jassi Brar CC: kernel , Linux Kernel Mailing List References: <20220331070115.29421-1-bjorn.ardo@axis.com> From: Bjorn Ardo In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.0.5.60] X-ClientProxiedBy: se-mail02w.axis.com (10.20.40.8) To se-mail04w.axis.com (10.20.40.10) X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_NONE,SPF_HELO_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 4/19/22 16:10, Jassi Brar wrote: > I don't have access to your client driver, but if it submits another > message from rx_callback() that is the problem. > > Please have a look at how other platforms do, for example > imx_dsp_rproc_rx_tx_callback() > > /** > * mbox_chan_received_data - A way for controller driver to push data > * received from remote to the upper layer. > * @chan: Pointer to the mailbox channel on which RX happened. > * @mssg: Client specific message typecasted as void * > * > * After startup and before shutdown any data received on the chan > * is passed on to the API via atomic mbox_chan_received_data(). > * The controller should ACK the RX only after this call returns. > */ > void mbox_chan_received_data(struct mbox_chan *chan, void *mssg) > > If not this, please share your client code as well. > > thanks. In rx_callback() we call tasklet_hi_schedule() to schedule a tasklet and this tasklet calls mbox_send_message(). The mailbox is setup with .tx_block = false. I do not quite understand how calling mbox_send_message() from another contex (like a work thread as in imx_dsp_rproc_rx_tx_callback()) will resolve the race, could you explain this? Or do you mean that it should not be called from any interrupt context at all? Looking at the documentation of mbox_send_message() it states:  * This function could be called from atomic context as it simply  * queues the data and returns a token against the request. If I look at the code in msg_submit() the part that calls hrtimer_active() and hrtimer_start() is completely without a lock. Also the code in txdone_hrtimer() that calls hrtimer_forward_now() is without a lock. So I cannot see anything preventing these two functions to be called concurrently on two different processors (as they do in my trace). And looking at the hr_timer code then hrtimer_active() will return true if the hrtimer is currently executing txdone_hrtimer(). The way I see the race is if the timer has called txdone_hrtimer() and it has passed the part with the for-loop looking at the individual channels when the msg_submit() runs. Then txdone_hrtimer() has decided to not restart the timer, but hrtimer_active() will still return true for a while (until txdone_hrtimer() return completely and the hrtimer code can change its status). In this time there is nothing that prevents msg_submit() from running, adding a new message that the timer should monitor. But if it manages to call hrtimer_active() in the time period before the hrtimer code updates it then the current code will never start the timer. Also the poll_hrt is shared between all channels in the mailbox, so it does not have to be the same mailbox channel that hrtimer is currently waiting for that is calling msg_submit(). So not calling mbox_send_message() from rx_callback() will only alter timing for that channel. There could still be a race between two different mailbox channels. This is my understanding of the current code, please tell me if I have missed or misunderstood any part of it? Our current solution are using 4 different mailbox channels asynchronously. The code is part of a larger system, but I can put down some time and try and extract the relevant parts if you still think this is a client issue? But with my current understanding of the code, the race between msg_submit() and txdone_hrtimer() is quite clear, and with my proposed patch that removes this race we have be able to run for very long time without any problems (that is several days). Without the fix we get the race after 5-10 min. Best Regards, Björn