Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp490757rwb; Thu, 1 Dec 2022 04:59:45 -0800 (PST) X-Google-Smtp-Source: AA0mqf4Ie3taJOrvG1WlMhwnsWOFeLfDXTK7mT3dCuqJghkTJjQeTsC2VPHHhzS2H2dGqBZ6mX7c X-Received: by 2002:a63:4c24:0:b0:476:7742:de17 with SMTP id z36-20020a634c24000000b004767742de17mr58628367pga.345.1669899585036; Thu, 01 Dec 2022 04:59:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669899585; cv=none; d=google.com; s=arc-20160816; b=LiPP8BbV9RGWIheQhlLpvvV80uWwfi0nqIzqwkrJEU7GmFZH/7U4IES6RfNfmxgr3j OBt2wrHa8bIa3p2Iu1YHn6C45zn1nnUBdxb1eyxYHGVJWy5/Hpn+ehCiKa7tOzaPfbtx jIj3CY5tzYYuimriRw90v3kvVgIrAvZvstXEea1mQ4V7SzlF6efnKY0ll045QmRXXHUv 7NqxwFNb8xkk3QSU2AwRlYz3F+TT5dFroOGFvJSqVbic+SGdhCqz1tzsLMcT8gk8jkXI DDjrx6uEpqA8h7zbKdn62kFuwUqlBCKXQG4uvxdlIuw9rLKBkpwYb16HxFE3MOLsFI+1 tgig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:dkim-signature; bh=S005OX22erO/mDOjRPI+EjExCnxZCxKJnETLoyHOjSg=; b=c05EIjY0uPg6yIW885UgPGCg4bZI5UDC3+nQCGy80boZoLLM0xaf976xJq43sfu+O6 x0HogkMiMSA9vCLiZVDZtLCeFbS7hudiNbUBLxgKpIqx6GNwES/0ph+yP9uE7yF7R3Ie GclumWbKcBxsGHdaHyXqi+er8xqACawsHBFaqitKhuOHh/Z17u2Krk0GCr3VED0IuD7G CMhESrOOO/ic5h1sjqAeptEzq/e6VsYHNFC0w5Gc0uev5JamjMG5H9YzzK3KMqUkb63T gpYy8hefJeXgWGBYAFPFf2EUd7AA00BAL5tk4/UsCiTuYg79xZGXhhU5vaxH2Wis20E0 2l1g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=N2ufwnqr; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b="rgV/TDgx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d8-20020a655ac8000000b00476de006c16si4306442pgt.723.2022.12.01.04.59.34; Thu, 01 Dec 2022 04:59:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=N2ufwnqr; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b="rgV/TDgx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231418AbiLAMh5 (ORCPT + 82 others); Thu, 1 Dec 2022 07:37:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231347AbiLAMh4 (ORCPT ); Thu, 1 Dec 2022 07:37:56 -0500 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90238209A9 for ; Thu, 1 Dec 2022 04:37:54 -0800 (PST) Received: from imap1.suse-dmz.suse.de (imap1.suse-dmz.suse.de [192.168.254.73]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 4CF0A21AB4; Thu, 1 Dec 2022 12:37:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1669898273; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=S005OX22erO/mDOjRPI+EjExCnxZCxKJnETLoyHOjSg=; b=N2ufwnqrpal/oS89/0lBSdASQrasYLh25hDoudr94RUhocSWE6Ku6ASqo+rRqBqNssTcJQ MG19UAnNNneVQuAXfMMUUdM95wfLOsvJJaKu3hUYcIkfVw6IPnxsmVc9NaMGVtA7358UL/ ui/MfAv8l8r+8RHIghtAP1f0TG7RwAg= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1669898273; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=S005OX22erO/mDOjRPI+EjExCnxZCxKJnETLoyHOjSg=; b=rgV/TDgxjXP22pKo3M5TYgD9NL8+TJjV7Mzh1Liu/teKsQh8uVdCt3iAyicel/Q3VLgJTg rugDPDPYuixOUvBQ== Received: from imap1.suse-dmz.suse.de (imap1.suse-dmz.suse.de [192.168.254.73]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap1.suse-dmz.suse.de (Postfix) with ESMTPS id 303E913503; Thu, 1 Dec 2022 12:37:53 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap1.suse-dmz.suse.de with ESMTPSA id 6SvsCiGgiGNhdAAAGKfGzw (envelope-from ); Thu, 01 Dec 2022 12:37:53 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id A8647A06E4; Thu, 1 Dec 2022 13:37:52 +0100 (CET) Date: Thu, 1 Dec 2022 13:37:52 +0100 From: Jan Kara To: Pierre Gondois Cc: Sebastian Andrzej Siewior , Will Deacon , Jan Kara , Waiman Long , LKML , Thomas Gleixner , Steven Rostedt , Mel Gorman , Peter Zijlstra , Ingo Molnar , Catalin Marinas Subject: Re: Crash with PREEMPT_RT on aarch64 machine Message-ID: <20221201123752.po5z2qpmitafuzhn@quack3> References: <20221107135636.biouna36osqc4rik@quack3> <359cc93a-fce0-5af2-0fd5-81999fad186b@redhat.com> <20221109125756.GA24388@willie-the-truck> <20221109154023.cx2d4y3e7zqnuo35@quack3> <20221111142742.rh677sdwu55aeeno@quack3> <20221114124147.GA30263@willie-the-truck> <207c79df-79e8-e6c9-d042-b69dea87a355@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <207c79df-79e8-e6c9-d042-b69dea87a355@arm.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 30-11-22 18:20:27, Pierre Gondois wrote: > > On 11/28/22 16:58, Sebastian Andrzej Siewior wrote: > > How about this? > > > > - The fast path is easy… > > > > - The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so > > I made that one _acquire so that it is visible by the unlocker forcing > > everyone into slow path. > > > > - If the lock is acquired, then the owner is written via > > rt_mutex_set_owner(). This happens under wait_lock so it is > > serialized and so a WRITE_ONCE() is used to write the final owner. I > > replaced it with a cmpxchg_acquire() to have the owner there. > > Not sure if I shouldn't make this as you put it: > > | e.g. by making use of dependency ordering where it already exists. > > The other (locking) CPU needs to see the owner not only the WAITER > > bit. I'm not sure if this could be replaced with smp_store_release(). > > > > - After the whole operation completes, fixup_rt_mutex_waiters() cleans > > the WAITER bit and I kept the _acquire semantic here. Now looking at > > it again, I don't think that needs to be done since that shouldn't be > > set here. > > > > - There is rtmutex_spin_on_owner() which (as the name implies) spins on > > the owner as long as it active. It obtains it via READ_ONCE() and I'm > > not sure if any memory barrier is needed. Worst case is that it will > > spin while owner isn't set if it observers a stale value. > > > > - The unlock path first clears the waiter bit if there are no waiters > > recorded (via simple assignments under the wait_lock (every locker > > will fail with the cmpxchg_acquire() and go for the wait_lock)) and > > then finally drop it via rt_mutex_cmpxchg_release(,, NULL). > > Should there be a wait, it will just store the WAITER bit with > > smp_store_release() (setting the owner is NULL but the WAITER bit > > forces everyone into the slow path). > > > > - Added rt_mutex_set_owner_pi() which does simple assignment. This is > > used from the futex code and here everything happens under a lock. > > > > - I added a smp_load_acquire() to rt_mutex_base_is_locked() since I > > *think* want to observe a real waiter and not something stale. > > > > Signed-off-by: Sebastian Andrzej Siewior > > > Hello, > Just to share some debug attempts, I tried Sebastian's patch and could not > reproduce the error. While trying to understand the solution, I could not > reproduce the error if I only took the changes made to > mark_rt_mutex_waiters(), or to rt_mutex_set_owner_pi(). I am not sure I > understand why this would be a rt-mutex issue. > > Without Sebastian's patch, to try adding some synchronization around the > 'i_wb_list', I did the following: > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 443f83382b9b..42ce1f7f8aef 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1271,10 +1271,10 @@ void sb_clear_inode_writeback(struct inode *inode) > struct super_block *sb = inode->i_sb; > unsigned long flags; > - if (!list_empty(&inode->i_wb_list)) { > + if (!list_empty_careful(&inode->i_wb_list)) { In my debug attempts I've actually completely removed this unlocked check and the corruption still triggered. > spin_lock_irqsave(&sb->s_inode_wblist_lock, flags); > - if (!list_empty(&inode->i_wb_list)) { > - list_del_init(&inode->i_wb_list); > + if (!list_empty_careful(&inode->i_wb_list)) { > + list_del_init_careful(&inode->i_wb_list); This shouldn't be needed, at least once unlocked checks are removed. Also even if they stay, the list should never get corrupted because all the modifications are protected by the spinlock. This is why we eventually pointed to the rt_mutex as the problem. It may be possible that your change adds enough memory ordering so that the missing ordering in rt_mutex does not matter anymore. > diff --git a/fs/inode.c b/fs/inode.c > index b608528efd3a..fbe6b4fe5831 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -621,7 +621,7 @@ void clear_inode(struct inode *inode) > BUG_ON(!list_empty(&inode->i_data.private_list)); > BUG_ON(!(inode->i_state & I_FREEING)); > BUG_ON(inode->i_state & I_CLEAR); > - BUG_ON(!list_empty(&inode->i_wb_list)); > + BUG_ON(!list_empty_careful(&inode->i_wb_list)); > /* don't need i_lock here, no concurrent mods to i_state */ > inode->i_state = I_FREEING | I_CLEAR; > } > > I never stepped on the: > BUG_ON(!list_empty_careful(&inode->i_wb_list)) > statement again, but got the dump at [2]. I also regularly end-up > with the following endless logs when trying other things, when rebooting: > > EXT4-fs (nvme0n1p3): sb orphan head is 2840597 > sb_info orphan list: > inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0 > inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0 > ... Looks like another list corruption - in this case ext4 list of inodes that are undergoing truncate. > Also, Jan said that the issue was reproducible on 'two different aarch64 > machines', cf [1]. Would it be possible to know which one ? Both had the same Ampere CPU. Full cpuinfo is here: https://lore.kernel.org/all/20221107135636.biouna36osqc4rik@quack3/ Honza -- Jan Kara SUSE Labs, CR