Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp2626943pxp; Mon, 14 Mar 2022 00:58:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyXqcMRB2oIRNN9nh0l9MyIuebeBRNZOglH0mEmPkfb0RaKUB5nFsbJwJ5QQjyXav2ke99Z X-Received: by 2002:a17:902:c745:b0:151:e8fa:629b with SMTP id q5-20020a170902c74500b00151e8fa629bmr22814203plq.90.1647244683335; Mon, 14 Mar 2022 00:58:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647244683; cv=none; d=google.com; s=arc-20160816; b=axXWZpz/MSVSJ2GRRIAWI7dZ2gqS71xKSPtvKKLUIwFZAhaEG74a7rJYuSHGt+yKwD wg2Fsp+t5z4rvSAfW+mFfjw+0Y3ZAsThapY1ykqOvsdFR5mb5Mu3N5FUn4B0bzdi9fwm eGrRqU2eCuXg7AqhP41hxyJmNp3idkITVKv1boJMTDGd2rydtrMyAlpbgDrD+6Z+mQDz JhtB3NaQsJHzhl8dWddxrG7gqYQYpzslRzONXNXi1lGx/TKN6K2blf5akcSxdKVh2906 6RaUrsUs1PcUOvDnVShv1Luz9Och2sjdDSyUYiI7e9FZ4jgTRU4bJo05wLn1oMH+N9cO t64A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:references:in-reply-to:subject :cc:to:from:mime-version:content-transfer-encoding:dkim-signature :dkim-signature; bh=dla9KEMSVko15AJLZY8yIKkEagzAWjlWUm7tze1A3Ro=; b=NoVi4mFdCFGLKpZyrEUnQv2YrW8qkbwb3kRcuP/I3iU3PJCBi1thrTULuOGTzni2Qe LF+vyRK2nh+Qm04AmNddcvDJ6WRldQORoIhXRt4e+HbVJOLOn1WbmsEw3q1GiNKuzDCg X2h7LaQgDLSxRKDyMxFWE1fPaVl/VUZ9yhkdqJ0qQO8hFDxyGc5xCq0a+TdghW7nro40 QLTEGC/zEknalseU7iW52SlBBWDly7KRMXyJXNABOEKK7uR2nvjjuLqEsIo9Os92UJCq jSpX+N/ctB8TkKqOv1mUUEY1LCkxkuyzuRvgjWSONTnskBglvlQ34HsXmz4gmY56649u /Uvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=jVz9eF8u; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=nlz9zvaR; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f10-20020a62380a000000b004f7d1cc9e66si1991098pfa.196.2022.03.14.00.57.49; Mon, 14 Mar 2022 00:58:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=jVz9eF8u; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=nlz9zvaR; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235671AbiCMXi4 (ORCPT + 99 others); Sun, 13 Mar 2022 19:38:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42062 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234564AbiCMXi4 (ORCPT ); Sun, 13 Mar 2022 19:38:56 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0845F13EB9 for ; Sun, 13 Mar 2022 16:37:45 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 999F121102; Sun, 13 Mar 2022 23:37:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1647214664; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dla9KEMSVko15AJLZY8yIKkEagzAWjlWUm7tze1A3Ro=; b=jVz9eF8uqZVZO4/SmYibTn9/Sc0lqGwYBLRhzBTXuNCerusiklXxUKR+9Pg0ZgE2WO8RZa 9rph5fN83qxZXHfHU21rsEHCAlh/cWUyVP95zR3xrG0wlNrAXrNMSH0WYpWzZ8u1hb/7xn wmgI3HTcFInDaQWwP0h55LNTOejlzCI= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1647214664; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dla9KEMSVko15AJLZY8yIKkEagzAWjlWUm7tze1A3Ro=; b=nlz9zvaR2RP6Lx6IT3anP3qzcg3HLgWznqvvHEwqU2qxzOZTFAPJs59Yws2IqGkQs72E2l +ePVOQQ/rlCdQ9Aw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 7990C13B18; Sun, 13 Mar 2022 23:37:43 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id pqmzDUeALmKJTQAAMHmgww (envelope-from ); Sun, 13 Mar 2022 23:37:43 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 From: "NeilBrown" To: "Trond Myklebust" Cc: "anna@kernel.org" , "linux-nfs@vger.kernel.org" Subject: Re: [PATCH] SUNRPC: avoid race between mod_timer() and del_timer_sync() In-reply-to: References: <164670733789.31932.14711754930977072270@noble.neil.brown.name>, <8af5700311c3ca5ea4931072f0717bb133c79dfb.camel@hammerspace.com>, <164720902908.7120.4478916458622123003@noble.neil.brown.name>, Date: Mon, 14 Mar 2022 10:37:40 +1100 Message-id: <164721466019.11933.16099062109185493938@noble.neil.brown.name> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Mon, 14 Mar 2022, Trond Myklebust wrote: > On Mon, 2022-03-14 at 09:03 +1100, NeilBrown wrote: > > On Sun, 13 Mar 2022, Trond Myklebust wrote: > > > On Tue, 2022-03-08 at 13:42 +1100, NeilBrown wrote: > > > >=20 > > > > xprt_destory() claims XPRT_LOCKED and then calls > > > > del_timer_sync(). > > > > Both xprt_unlock_connect() and xprt_release() call > > > > =C2=A0->release_xprt() > > > > which drops XPRT_LOCKED and *then* xprt_schedule_autodisconnect() > > > > which calls mod_timer(). > > > >=20 > > > > This may result in mod_timer() being called *after* > > > > del_timer_sync(). > > > > When this happens, the timer may fire long after the xprt has > > > > been > > > > freed, > > > > and run_timer_softirq() will probably crash. > > > >=20 > > > > The pairing of ->release_xprt() and > > > > xprt_schedule_autodisconnect() is > > > > always called under ->transport_lock.=C2=A0 So if we take - > > > > >transport_lock > > > > to > > > > call del_timer_sync(), we can be sure that mod_timer() will run > > > > first > > > > (if it runs at all). > > >=20 > > > xprt_destroy() never releases XPRT_LOCKED, so how could the above > > > race > > > happen? > >=20 > > the race would happen as xprt_destroy() is taking the lock. > > One core is in xprt_destroy(), the other in xprt_unlock_connect() > >=20 > > =C2=A0=C2=A0 Core 1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 Core 2 > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_lock(tra= nsport_lock) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 xprt->ops->re= lease_xprt() > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 aka xprt_release_xprt() > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 This clears XPRT_LOCKED > > =C2=A0 wait_on_bit_lock(XPRT_LOCKED) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 This doesn't need to wait > > =C2=A0 del_timer_sync(->timer) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 xprt_schedule= _autodisconnect() > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 c= alls mod_timer(->timer) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_unlock(t= ransport_lock) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wake_up_bit(X= PRT_LOCKED) > >=20 > > The mod_timer() call being after del_timer_sync() is the problem. > >=20 > > If the wait_on_bit_lock() was just a little earlier and had to wait, > > it > > wouldn't be woken until it was safe, so it is a small race window to > > hit. > >=20 > > An alternate fix might be to move the xprt_schedule_autodisconnect() > > call before xprt->ops->release_xprt(), but as a spinlock was already > > used to group these operations, I thought it simpler to use the > > spinlock > > to remove the race. > >=20 > > The only evidence I have that it is possible at all is two crash > > dumps > > with run_timer_softirq() tripping over a corrupt timer. > > The timer function is xprt_init_autodisconnect() > > The ->next pointer is LIST_POISON2, so detach_timer() must have been > > called on the timer, but it was still found in the list of pending > > timers. > >=20 > > It looks like the xprt was freed just after above race, so timer was > > still active, then allocated again so timer was reinitialised, then > > freed again, so del_timer_sync() as called and ->next became > > LIST_POISON2. At the time of the crash, the memory was unallocated. > >=20 > > I cannot be 100% certain this race is happening, but I cannot find > > any > > other path by which it is even vaguely possible. >=20 > The above looks plausible, and I think this patch is the correct one.=C2=A0 Thanks. > The one additional change I suggest is that we also move the > wake_up_bit() in xprt_connect_unlock() inside the spinlock, so that we > avoid a potential use-after-free. I had a look and I don't think this is needed. wake_up_bit() doesn't dereference the pointer. It hashes the address together with the bit to choose a wake_queue to wake up. The wake_bit_function() which might be called as part of that wakeup will test_bit() the given bit at the given address, but only after the address has been found to match an address that something is waiting on. So that something must hold a reference to stop the memory being freed. However, as I was looking, I was reminded of the use of XPRT_LOCKED in sysfs.c which isn't safe. Two sysfs functions use wait_on_bit_lock() to wait for XPRT_LOCKED. However most places that clear XPRT_LOCKED don't generate a wakeup, so these can wait indefinitely. This can easily be reproduced by writing "online" to "xprt_state" in a tight loop while there is concurrent NFS traffic.=20 If we wanted to add a wake_up_bit() whenever XPRT_LOCKED is cleared, that would be remove that race and it would trivially put wake_up_bit() inside the spinlock anyway. I'm not sure all those wakeups are really warranted, but it isn't clear to me what the sysfs functions should be waiting for, so I'm not certain. Thanks, NeilBrown