Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp4195985rwr; Sat, 22 Apr 2023 23:57:07 -0700 (PDT) X-Google-Smtp-Source: AKy350bFLGhlV+sW1lPvZ+nHgZrxnUlAgZAUH3etg+hE5XRbxSUV+Zd+z9yiRvM00CGsM9PYIw9f X-Received: by 2002:a17:902:e811:b0:1a9:465c:6802 with SMTP id u17-20020a170902e81100b001a9465c6802mr9158408plg.5.1682233026895; Sat, 22 Apr 2023 23:57:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682233026; cv=none; d=google.com; s=arc-20160816; b=KroHuw98gq0v8dih7srxF86f/r2Ol7g/d99AY2M3h8UCWhJHG8dJuUBD6Wb1PcjXAt nCarKobVQQp/Kr6d1Il9d0M6hVab1T8gXoPaY/G/I4ckihcxOxbnemusHczTrvOSWQ+i zZAuzQLnwb53a76rhWx6l3+98OVcbPp6INPLqmvbN5DUKZTgrf/y7B/Xu+ZhMeM5wXH0 sSRSyUYpbrkgQoZLiUr5c/1JtypasNDF9/G2X6vCplaEcAn8d9he6kjzXdX+9uGpEdT2 M00IH8M7fnSgerbK+Lc+2d1ZB8gTfvYRWFrdIg7xoppJIvWc2olBs+iobfhKkdWVfRgU 2j5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=XfJCkDUStFvQlmTy2TXfR/ynfAHG7u0kjCgYu5zu+S0=; b=ykj8IT5Q/QvW6glQecexsDYzootzanbeh+P6Tgm2RfqBjejf5uqMkS+TGcKyrtAold nCcdNgjJE4u6bQ35s3GToVta+smu9TG1aFoQiLPvduKRSDDzVD5vRzOmx7Kirxl+mvh4 XulTdhpqoA0VbHSBApNLGz08m0F/qqWRicT8E8yLEw4ronWrqIkam4r9kfeklPnjEuJD fajruAe5fYuYDELZz5ac7cZSEde7odNvUqVCmkFwhHxPQ/TEz4BmfEvUDcyFjdOwhFx1 SSymie9k4N5smZ/A4nHnu4/yYBdBGJVqyJLy6FcoiRIY0g6pxStb+OxVdHUTS3lO9PpL upKQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=h3K1g2Sl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h1-20020a17090aa88100b0024730b93a6asi10980218pjq.147.2023.04.22.23.56.55; Sat, 22 Apr 2023 23:57:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=h3K1g2Sl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229801AbjDWFpf (ORCPT + 99 others); Sun, 23 Apr 2023 01:45:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42568 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229516AbjDWFpe (ORCPT ); Sun, 23 Apr 2023 01:45:34 -0400 Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com [IPv6:2607:f8b0:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C639819BD; Sat, 22 Apr 2023 22:45:32 -0700 (PDT) Received: by mail-pg1-x52b.google.com with SMTP id 41be03b00d2f7-517bb01bac9so2462668a12.0; Sat, 22 Apr 2023 22:45:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682228731; x=1684820731; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XfJCkDUStFvQlmTy2TXfR/ynfAHG7u0kjCgYu5zu+S0=; b=h3K1g2SlnA1hNafZsDzL73GqQ1fMN3ZSIkf+EqngNBU7zAM6wqFmSPnKPTKawt2WKW iaIRN0/XnDAJVZFtJoF1QHNP+JWbVd+i5p4Q02wtaCKp9b+hV7eaOGEAyK2NV9SDG7D+ ZhqUEZJg5gbLoNJUhcxisMxtOpaB54OBwhNnIV3un4c4nnZ3Ms9zAkJAHuhN28+YmCjg jjxQMc2LXjzWUXmVZNso5aAvjBpQ9wVuiWwGwka1oBs2hS9nyA76a1+Y28gYgaukGDJ9 BpdoqsQ4KqQYA/AFzykeg41Z/9YTlgbUYEofSp2w57GWCOxCdypVmL9NLDLXBFyMmz9+ +6Uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682228732; x=1684820732; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XfJCkDUStFvQlmTy2TXfR/ynfAHG7u0kjCgYu5zu+S0=; b=VV3K1WcIk+vH0d12TQ5DUeBr0QXg+d6fdC3HO4gUbX/FEGcRqQN003aOpkP7vdxTjY 4BIHQMfI/YOTdzEWdAwuG0b04MCSHOYY8IIrdO5Lm5t8ydY3oA1JuJaytHqjKJaaeY/p 73s3y+bwbjXEk+wP0g2Pz3NKUKhKjPOPit710MYEFEVU8jo0eNc8tEDR90mLBo55pdvy U0KNeM6i5K3Lq7q8c+sudOcauPCusck+UYhF8NNU9fqBzMcjjtPkJLA8+4rVUTy7XemR Qo2xZgbl/yiN6VZyYC92g1/xfpBf8e+N/QhLuZDwZ2S9eg0BbNV5nwrQKPgY7W8x7YNI LSYA== X-Gm-Message-State: AAQBX9d7aVYpnyROQzpkOuPHz52Oy+LUICX3mMVlkUCcPf32N2SNAPsX 6WI5bcOeDFeRyp65sXD3a8G1u5SJuCXpi3p2RgjU+yW0Y2A= X-Received: by 2002:a17:90b:3507:b0:247:eae:1783 with SMTP id ls7-20020a17090b350700b002470eae1783mr10044277pjb.45.1682228731562; Sat, 22 Apr 2023 22:45:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Zhouyi Zhou Date: Sun, 23 Apr 2023 13:45:20 +0800 Message-ID: Subject: Re: BUG : PowerPC RCU: torture test failed with __stack_chk_fail To: Joel Fernandes Cc: linuxppc-dev , rcu , linux-kernel , lance@osuosl.org, "Paul E. McKenney" , Michael Ellerman Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,NORMAL_HTTP_TO_IP, NUMERIC_HTTP_ADDR,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Apr 23, 2023 at 9:37=E2=80=AFAM Zhouyi Zhou = wrote: > > On Sun, Apr 23, 2023 at 3:19=E2=80=AFAM Joel Fernandes wrote: > > > > Hi Zhouyi, > Thank Joel for your quick response ;-) > I will gradually provide all the necessary information to facilitate > our chasing. Please do not hesitate email me > if I have ignored any ;-) > > > > On Sat, Apr 22, 2023 at 2:47=E2=80=AFPM Zhouyi Zhou wrote: > > > > > > Dear PowerPC and RCU developers: > > > During the RCU torture test on mainline (on the VM of Opensource Lab > > > of Oregon State University), SRCU-P failed with __stack_chk_fail: > > > [ 264.381952][ T99] [c000000006c7bab0] [c0000000010c67c0] > > > dump_stack_lvl+0x94/0xd8 (unreliable) > > > [ 264.383786][ T99] [c000000006c7bae0] [c00000000014fc94] panic+0x= 19c/0x468 > > > [ 264.385128][ T99] [c000000006c7bb80] [c0000000010fca24] > > > __stack_chk_fail+0x24/0x30 > > > [ 264.386610][ T99] [c000000006c7bbe0] [c0000000002293b4] > > > srcu_gp_start_if_needed+0x5c4/0x5d0 > > > [ 264.388188][ T99] [c000000006c7bc70] [c00000000022f7f4] > > > srcu_torture_call+0x34/0x50 > > > [ 264.389611][ T99] [c000000006c7bc90] [c00000000022b5e8] > > > rcu_torture_fwd_prog+0x8c8/0xa60 > > > [ 264.391439][ T99] [c000000006c7be00] [c00000000018e37c] kthread+= 0x15c/0x170 > > > [ 264.392792][ T99] [c000000006c7be50] [c00000000000df94] > > > ret_from_kernel_thread+0x5c/0x64 > > > The kernel config file can be found in [1]. > > > And I write a bash script to accelerate the bug reproducing [2]. > > > After a week's debugging, I found the cause of the bug is because the > > > register r10 used to judge for stack overflow is not constant between > > > context switches. > > > The assembly code for srcu_gp_start_if_needed is located at [3]: > > > c000000000226eb4: 78 6b aa 7d mr r10,r13 > > > c000000000226eb8: 14 42 29 7d add r9,r9,r8 > > > c000000000226ebc: ac 04 00 7c hwsync > > > c000000000226ec0: 10 00 7b 3b addi r27,r27,16 > > > c000000000226ec4: 14 da 29 7d add r9,r9,r27 > > > c000000000226ec8: a8 48 00 7d ldarx r8,0,r9 > > > c000000000226ecc: 01 00 08 31 addic r8,r8,1 > > > c000000000226ed0: ad 49 00 7d stdcx. r8,0,r9 > > > c000000000226ed4: f4 ff c2 40 bne- c000000000226ec8 > > > > > > c000000000226ed8: 28 00 21 e9 ld r9,40(r1) > > > c000000000226edc: 78 0c 4a e9 ld r10,3192(r10) > > > c000000000226ee0: 79 52 29 7d xor. r9,r9,r10 > > > c000000000226ee4: 00 00 40 39 li r10,0 > > > c000000000226ee8: b8 03 82 40 bne c0000000002272a0 > > > > > > by debugging, I see the r10 is assigned with r13 on c000000000226eb4, > > > but if there is a context-switch before c000000000226edc, a false > > > positive will be reported. > > > > > > [1] http://154.220.3.115/logs/0422/configformainline.txt > > > [2] 154.220.3.115/logs/0422/whilebash.sh > > > [3] http://154.220.3.115/logs/0422/srcu_gp_start_if_needed.txt > > > > > > My analysis and debugging may not be correct, but the bug is easily > > > reproducible. > > > > Could you provide the full kernel log? It is not clear exactly from > > your attachments, but I think this is a stack overflow issue as > > implied by the mention of __stack_chk_fail. One trick might be to turn > > on any available stack debug kernel config options, or check the > > kernel logs for any messages related to shortage of remaining stack > > space. > The full kernel log is [1] > [1] http://154.220.3.115/logs/0422/console.log > > > > Additionally, you could also find out where the kernel crash happened > > in C code following the below notes [1] I wrote (see section "Figuring > > out where kernel crashes happen in C code"). The notes are > > x86-specific but should be generally applicable (In the off chance > > you'd like to improve the notes, feel free to share them ;-)). > Fantastic article!!!, I benefit a lot from reading it. Because we can > reproduce it so easily on powerpc VM, > I can even use gdb to debug it, following is my debug process on > 2e83b879fb91dafe995967b46a1d38a5b0889242(srcu: Create an > srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()). > > [2] http://154.220.3.115/logs/0422/gdb.txt > > > > Lastly, is it a specific kernel release from which you start seeing > > this issue? You should try git bisect if it is easily reproducible in > > a newer release, but goes away in an older one. > I did bisect on powerpc VM, the problem begin to appear on > 2e83b879fb91dafe995967b46a1d38a5b0889242(srcu: Create an > srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()). > > The kernel is good at 5d0f5953b60f5f7a278085b55ddc73e2932f4c33(srcu: > Convert ->srcu_lock_count and ->srcu_unlock_count to atomic) > > But if I apply the following patch [3] to > 5d0f5953b60f5f7a278085b55ddc73e2932f4c33, the bug appears again. > [3] http://154.220.3.115/logs/0422/bug.patch > > Both native gcc on PPC vm (gcc version 9.4.0), and gcc cross compiler > on my x86 laptop (gcc version 10.4.0) will reproduce the bug. update: stress tested on x86 platform for 6 hours, no bug reported (while we can reproduce it on X86 based cross platform powerpc gcc and X86 based cross platform powerpc qemu in less than 3 minute). > > > > I will also join you in your debug efforts soon though I am currently > > in between conferences. > Exciting!! Thank you very much! > I can give you ssh access (based on rsa pub key) to PPC vm on Oregon > State University if you like. > > Thanks again > Zhouyi > > > > [1] https://gist.github.com/joelagnel/ae15c404facee0eb3ebb8aff0e996a68 > > > > thanks, > > > > - Joel > > > > > > > > > > > > > > Thanks > > > Zhouyi