Received: by 2002:a5d:9c59:0:0:0:0:0 with SMTP id 25csp2221904iof; Tue, 7 Jun 2022 23:39:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxFjAL45GJHZx+MmXu16RbkRRNlnDkCO9sMDMMnlyA0c/SlNxU8hCpIQWBPwfAtrYnGfGlF X-Received: by 2002:aa7:9d04:0:b0:51c:260:108e with SMTP id k4-20020aa79d04000000b0051c0260108emr18540266pfp.4.1654670381191; Tue, 07 Jun 2022 23:39:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654670381; cv=none; d=google.com; s=arc-20160816; b=ymnblmJsBHMoXKBwu7/R0os1iKSoEf92ESDMIr6M9NNsFH1NkkXl45MGVA/YsBZHl/ 6+xEVnWCj8Z4juqeEMo10u1hs5zzwaHFdJueERtbI8wAEVohw7GisVCYrUwaOobACnng 2abdHbaAE2s5Vq1g2j5sp+ZS+O/22kyEL5kyAjCqiqNCUtdxNwRpiCfKug5V7rcVMdh5 aDVv7KpZCg8I9oCqvsgYA4mKqIFNAq4MQHZGAlRqZod6DVZtKr06eUM4Q+/u8hyo78sR 0LhwEnCClMYOORTFJ712ggU1ENA23ZQoSf0choZmnWpn8ivliVl7DrTpj15sLwN/LKUS XrPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=HcoQknaUmHCf5NNiXv3gVYmAHiag+xrcpWJjWHvdOZg=; b=U3FZW4sKJJaF3lD50Ffg7YN4xwKuGF0dLtYnJdIkk9MtMPkCQakJk+G9rmEERAqZMk OJVcnpeIsCTKhbQDa6ged6FniLhBX8DkVpgsQEbeiYCdbON0sr+56/Y6GFj6y0Jy7xzp uO/dGf689vmAI7aVXIQU2Myeaxe5KMiCwLAhu8AmmBBlF39GvXGhXmeIJ/kbdO1M8Lxu fk6Xy44kZc3QtC9V6LNccwe+/jOAkTyVeL5xOFpsVJJXv0Xv1AltcywvNNqN+X7D0i0u gT5XTTRZdhW4Lq4F8euovFRF04hCEZmy8Eq9R7sSPF+HW/oE8rEGk6cGuUn+jNlJNgJH u6CQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=El5lVxwU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id n5-20020a654cc5000000b003dbaa0bd8d5si26564299pgt.537.2022.06.07.23.39.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Jun 2022 23:39:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=El5lVxwU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id C685316F366; Tue, 7 Jun 2022 23:01:36 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1384361AbiFHCqw (ORCPT + 99 others); Tue, 7 Jun 2022 22:46:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1447217AbiFHCl2 (ORCPT ); Tue, 7 Jun 2022 22:41:28 -0400 Received: from mail-vs1-xe36.google.com (mail-vs1-xe36.google.com [IPv6:2607:f8b0:4864:20::e36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D48851912FC for ; Tue, 7 Jun 2022 17:21:00 -0700 (PDT) Received: by mail-vs1-xe36.google.com with SMTP id e11so3728393vsh.13 for ; Tue, 07 Jun 2022 17:21:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=HcoQknaUmHCf5NNiXv3gVYmAHiag+xrcpWJjWHvdOZg=; b=El5lVxwUAtp3lYdRS44FJh/m5iar83RYu20Tydqw/TogWEAsetBxn8OJEoJ4NbUDYw 5SZ18TwrkYHP+tIY4Qi0UQ0ohj5187hpUqw5K/NApRm6tafb74zy30i3RhTxpSAyyEBA wekj3031y/gM+0TLAC59s856hYDsl+SwB5L67Axrvr3jh3I9KsROUiMYMY+x/wS5PRAV RWMZ6cEjDE9+gffEit8/6ip0ISJn8QzjSuKESMbHEI2nqlConWUAcRbFzJdvBlW5S8k2 iT7K2XjiuBzL3Ycg6O9EjbPmXlUXDCakvQpFZeTE9MsjPTcjY38elYLNAFjYPuRnmKIV oUMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=HcoQknaUmHCf5NNiXv3gVYmAHiag+xrcpWJjWHvdOZg=; b=GSmqy5MW6A7Uy8Lyk+OODdGAHGhFcREAq6wg6dflQFBXbqHLE1aCDEQrxzm9ejLEpA elF/DfqLPDx+8zqqlU5yfEGTzlcn5C+YH5F808FrovbsXrHcmJT07hPZOZ3k/wOcoKhI AFklnK0QslUuJRbq68/3Kep5VlfE8KghkEDBRdBQTY03OkZhNHrGyJVkP2isV8m5LzaC 7CauFM8PIPlIQaXiVaerAjh8fuGFMK3lmNqesJ4+/qrRSUg3Q4841hQGp0ruRopThKhj W82WohA6FGAySJt59Ke+8YostN78kYYJfowA8Jp9dMol6+YhiiXa/hm3l/TRJt+nx4+Y rIkg== X-Gm-Message-State: AOAM530zgcotkD1pL7TteIPvNtrjiH8TmEtnuVmpHB7PU0lXmypSYCuf qkGOty7P7eGBaKm8ZiyLbqBvWAFiKA3RAE4nKTjArw== X-Received: by 2002:a05:6102:1356:b0:34b:bc70:7b44 with SMTP id j22-20020a056102135600b0034bbc707b44mr6576079vsl.22.1654647658820; Tue, 07 Jun 2022 17:20:58 -0700 (PDT) MIME-Version: 1.0 References: <20220507015646.5377-1-hdanton@sina.com> In-Reply-To: From: Yu Zhao Date: Tue, 7 Jun 2022 18:20:22 -0600 Message-ID: Subject: Re: Alpha: rare random memory corruption/segfault in user space bisected To: Michael Cree Cc: Linux-MM , linux-kernel , Hillf Danton , Joonsoo Kim Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 30, 2022 at 2:25 AM Michael Cree wrote: > > On Mon, May 23, 2022 at 02:56:12PM -0600, Yu Zhao wrote: > > On Wed, May 11, 2022 at 2:37 PM Michael Cree wrote: > > > > > > On Sat, May 07, 2022 at 11:27:15AM -0700, Yu Zhao wrote: > > > > On Fri, May 6, 2022 at 6:57 PM Hillf Danton wrote: > > > > > > > > > > On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote: > > > > > > Alpha kernel has been exhibiting rare and random memory > > > > > > corruptions/segaults in user space since the 5.9.y kernel. First seen > > > > > > on the Debian Ports build daemon when running 5.10.y kernel resulting > > > > > > in the occasional (one or two a day) build failures with gcc ICEs either > > > > > > due to self detected corrupt memory structures or segfaults. Have been > > > > > > running 5.8.y kernel without such problems for over six months. > > > > > > > > > > > > Tried bisecting last year but went off track with incorrect good/bad > > > > > > determinations due to rare nature of bug. After trying a 5.16.y kernel > > > > > > early this year and seen the bug is still present retried the bisection > > > > > > and have got to: > > > > > > > > > > > > aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit > > > > > > commit aae466b0052e1888edd1d7f473d4310d64936196 > > > > > > Author: Joonsoo Kim > > > > > > Date: Tue Aug 11 18:30:50 2020 -0700 > > > > > > > > > > > > mm/swap: implement workingset detection for anonymous LRU > > > > > > > > This commit seems innocent to me. While not ruling out anything, i.e., > > > > this commit, compiler, qemu, userspace itself, etc., my wild guess is > > > > the problem is memory barrier related. Two lock/unlock pairs, which > > > > imply two full barriers, were removed. This is not a small deal on > > > > Alpha, since it imposes no constraints on cache coherency, AFAIK. > > > > > > > > Can you please try the attached patch on top of this commit? Thanks! > > > > > > Thanks, I have that running now for a day without any problem showing > > > up, but that's not long enough to be sure it has fixed the problem. Will > > > get back to you after another day or two of testing. > > > > Any luck? Thanks! > > Sorry for the delay in replying. Testing has taken longer due to an > unexpected hitch. The patch proved to be good but for a double check I > retested the above commit without the patch but it now won't fail which > calls into question whether aae466b0052e188 is truly the bad commit. I > have gone back to the prior bad commit in the bisection (25788738eb9c) > and it failed again confirming it is bad. So it looks like the first > bad commit is somewhere between aae466b0052e188 and 25788738eb9c (a > total of five commits inclusive, four if we take aae466b0052e188 as > good) and I am now building 471e78cc7687337abd1 and will test that. No worries. Thanks for the update. Were swap devices used when the ICEs happened? If so, 1) What kind of swap devices, e.g., zram, block device, etc.? 2) aae466b0052e188 might have made the kernel swap more frequently and thus the problem easier to reproduce. Assuming this is the case, then setting swappiness to 200 might help reproduce the problem.