Received: by 2002:a05:6602:18e:0:0:0:0 with SMTP id m14csp3536373ioo; Mon, 30 May 2022 04:26:38 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy6rHT6nAP0b4FLtErX5ukJI3iVXj4ZR/D8C53xoWg7+gsoBEWVnwE/9mO2VolrXBROHHNE X-Received: by 2002:a17:907:1c14:b0:6fe:d026:3e60 with SMTP id nc20-20020a1709071c1400b006fed0263e60mr35166435ejc.604.1653909998400; Mon, 30 May 2022 04:26:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1653909998; cv=none; d=google.com; s=arc-20160816; b=vqcwiH5JC+KFF5MxONOFyUWRKZ7n0MVnM8ZuhJptNXjeNkBvJD/TImp8RrL5c2rClm t8EgT3XEcgqMx7TL3aCksYYu9nnlMjUsjc2zZGaafvI/n2YJd3AzNSvA1Bq96KMYo5xW 90mwbKcR2RpOyVx9J9qlx0kI32B3cADJdDKq7Ps7evBIXKQfUfTqHiRAI/g0SaN7Nb7e BOrGsGMmzKYeO53e04NqND0BRdudICw3DCbzy0Z61Lpiw0FJ9LAKHDxH0olRIISxu+hL FqrsXjvs4OfebxatH/3NsibYgHesGuQJCdK3pPYomz8Ky8v+6s+j1jQs10QG/aC4n/cm UOKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=OaCiJICg0/zvmBYyo09xoxV2hO1esbGdENAm4ztsDrc=; b=o4JOOug/AQ34npzOq9LMuNFIlSw0jqRQArNc1y2wW6OtQAAZsGHN7TPcAVcBnrs+q2 xW+xOzMiUCVN5psUtdDXAYOOyGj2EH04dvfMpZzXlYLE/tE7gDtOJ6MHLFxKH8CVk5yK av7QX0ShSAmXuBg+jjQDtfxK+ItQb2l0Br6N1dYCt2q63HXAe0HIfV9k6reSY0qGf4de zNCe7t1qNO7LIaxfu/t3Tyrd2G+KGceNc1UzpmzH9Qv+I7rjM6X9ivrV5vamme1GMFJ5 DhVxu0bEDtTOZoYc3y5EBpAc8TDFYxM60WFVgIsB8OCnjGnV9B40hNuKBGNbluHUkaz1 NssQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=orcon.net.nz Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r26-20020a170906c29a00b006fef493f611si6471034ejz.26.2022.05.30.04.26.11; Mon, 30 May 2022 04:26:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=orcon.net.nz Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233815AbiE3IZf (ORCPT + 99 others); Mon, 30 May 2022 04:25:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229503AbiE3IZe (ORCPT ); Mon, 30 May 2022 04:25:34 -0400 Received: from smtp-4.orcon.net.nz (smtp-4.orcon.net.nz [60.234.4.59]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 193B271D87 for ; Mon, 30 May 2022 01:25:32 -0700 (PDT) Received: from [121.99.247.178] (port=35962 helo=creeky) by smtp-4.orcon.net.nz with esmtpa (Exim 4.90_1) (envelope-from ) id 1nvaiK-0001jW-6X; Mon, 30 May 2022 20:25:24 +1200 Date: Mon, 30 May 2022 20:25:22 +1200 From: Michael Cree To: Yu Zhao Cc: Linux-MM , linux-kernel , Hillf Danton , Joonsoo Kim Subject: Re: Alpha: rare random memory corruption/segfault in user space bisected Message-ID: References: <20220507015646.5377-1-hdanton@sina.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-GeoIP: NZ X-Spam_score: -2.9 X-Spam_score_int: -28 X-Spam_bar: -- X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 23, 2022 at 02:56:12PM -0600, Yu Zhao wrote: > On Wed, May 11, 2022 at 2:37 PM Michael Cree wrote: > > > > On Sat, May 07, 2022 at 11:27:15AM -0700, Yu Zhao wrote: > > > On Fri, May 6, 2022 at 6:57 PM Hillf Danton wrote: > > > > > > > > On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote: > > > > > Alpha kernel has been exhibiting rare and random memory > > > > > corruptions/segaults in user space since the 5.9.y kernel. First seen > > > > > on the Debian Ports build daemon when running 5.10.y kernel resulting > > > > > in the occasional (one or two a day) build failures with gcc ICEs either > > > > > due to self detected corrupt memory structures or segfaults. Have been > > > > > running 5.8.y kernel without such problems for over six months. > > > > > > > > > > Tried bisecting last year but went off track with incorrect good/bad > > > > > determinations due to rare nature of bug. After trying a 5.16.y kernel > > > > > early this year and seen the bug is still present retried the bisection > > > > > and have got to: > > > > > > > > > > aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit > > > > > commit aae466b0052e1888edd1d7f473d4310d64936196 > > > > > Author: Joonsoo Kim > > > > > Date: Tue Aug 11 18:30:50 2020 -0700 > > > > > > > > > > mm/swap: implement workingset detection for anonymous LRU > > > > > > This commit seems innocent to me. While not ruling out anything, i.e., > > > this commit, compiler, qemu, userspace itself, etc., my wild guess is > > > the problem is memory barrier related. Two lock/unlock pairs, which > > > imply two full barriers, were removed. This is not a small deal on > > > Alpha, since it imposes no constraints on cache coherency, AFAIK. > > > > > > Can you please try the attached patch on top of this commit? Thanks! > > > > Thanks, I have that running now for a day without any problem showing > > up, but that's not long enough to be sure it has fixed the problem. Will > > get back to you after another day or two of testing. > > Any luck? Thanks! Sorry for the delay in replying. Testing has taken longer due to an unexpected hitch. The patch proved to be good but for a double check I retested the above commit without the patch but it now won't fail which calls into question whether aae466b0052e188 is truly the bad commit. I have gone back to the prior bad commit in the bisection (25788738eb9c) and it failed again confirming it is bad. So it looks like the first bad commit is somewhere between aae466b0052e188 and 25788738eb9c (a total of five commits inclusive, four if we take aae466b0052e188 as good) and I am now building 471e78cc7687337abd1 and will test that. Cheers, Michael.