Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp2449845ybl; Thu, 9 Jan 2020 13:00:55 -0800 (PST) X-Google-Smtp-Source: APXvYqyUgrMsyLcxA1KC8MbAAbTpgFEHOthvCv4K/VhePGMBSEYpxuK4rfbnqtFpHtwHPQQG5Gjb X-Received: by 2002:a9d:6e03:: with SMTP id e3mr10203109otr.46.1578603654968; Thu, 09 Jan 2020 13:00:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1578603654; cv=none; d=google.com; s=arc-20160816; b=bUWVhrb26UTyKEp/1Q8U4uRZWLpyyTYiWVKFLGd8nxjwKw2MPV+qy25IabdzJqRzg9 nbstv/2g4QcAGAG+IWPjhDMWQR11as2OtS4gFg98vIJX51YURWsPGJwULNhF4pXlucpF 8Ix24FQ9zsIBjMXgQBfsez/+mpYPB6iBUysbkC0AN94NxNpme5D1F5giGDdYfr0XxsRM nNYR5hDYwTc5WWZDOmMHBXSheW3uOcpU12LA7Y42kQ5YCBVdcNfy3ibyZExXJkoSZZ8O /dbFfwX90BWLWixH61AZov+C3YQc2nXaSJESzfawf3Pxx8jaDV6/A82H543F+hpokPL+ 4U+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=VQMh9njmfCtNXD8iZHUjnxm+asnCt/ee/1OxShUW+4g=; b=SCwQHQ20IlqcN5h2i/w+euY12MwSxGfUZJJ3oAjxCoW9FfFJcgKCBaL1sZssNMYlCn R9JBvlHnUpxL4nzqlep+2SahmVWVdj7LrC3JzoGSFfaMdNN8mYK/YvsrsNrYc2oW+f2+ uEQyZ3N6LO8wRURRDWGSPMfI+ucS0pK2OeuWyqrpKyPbCgTlEl5GiftsvZHwZjFXn8L8 N0BBW6vifb5/tqsCifkmJwR5WQNHUz0XwbYol7gLu4zbfG3wjadAQ/iSadTaQTLh/zvh 4ddAQaNdX3qswGj3IhIallOEeYE+fk0a6ryHb6wiQLDkZceOeZZZUuo5/usXscI6XOED GKYg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Bfe+uK8L; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l18si4711127oth.236.2020.01.09.13.00.42; Thu, 09 Jan 2020 13:00:54 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Bfe+uK8L; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731610AbgAIRRJ (ORCPT + 99 others); Thu, 9 Jan 2020 12:17:09 -0500 Received: from mail-pf1-f193.google.com ([209.85.210.193]:45025 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728951AbgAIRRI (ORCPT ); Thu, 9 Jan 2020 12:17:08 -0500 Received: by mail-pf1-f193.google.com with SMTP id 195so3652241pfw.11 for ; Thu, 09 Jan 2020 09:17:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=VQMh9njmfCtNXD8iZHUjnxm+asnCt/ee/1OxShUW+4g=; b=Bfe+uK8LyXdwet3zu2sxZEgkGyBzF0OJ9CETX3iL2YLy+mTt5fSCoOciGs8+fD42Qh VInBr8yjPWNWxwfPi2P8mZQEga3cZ7v+4WZHQNQHzgIjvHdulTqpcUo+De+qlUJM8tZt /vGbwCnlQIXluu908Z8wT338U3glRjaHzzarg4+Ajuk+2ENP9dYrv4BeHPz+5Qqr30x+ g15DjCVLeM8S0U/gpseBZF4onsu3WaRA09K8r6Lt48MTPjY/VG4PraQgC7lS/GhF2Qan HjvczGI75o2/oCcE/9PS6S0sijC3Pau5fzzrgw2S7p3/oEJlI+RZvEXJGOlnCOCTf3HK ynkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VQMh9njmfCtNXD8iZHUjnxm+asnCt/ee/1OxShUW+4g=; b=OHrEWcuWY9tuPxpMTNLSI4xAquwtm8xY0BQl+odD1cm6pnH/55K30aCgLlbLJAQPtB D7oOsscxKkj6PmopGhzvU6E6yT71zycqpUwU6EaEnZFCSNa9APVicWycgeYYHln71R6L EdRHW4voNqZzp7pA7CdXYc9zXPB7UEB6VMkoe0GgkbVRlFeBA5YtcEYdbUqHErqf3s7x oLtW1NIdCEGh7ipjc4xrYDtsPrqc9Aon6gU3pAji6Ln/jObagtvtXcYmjt+85lsF6mvQ LAPJB9Ub0JkA4Mzwzgt+o5ibXhCOMwAt7IP+LNUgfsCJFJzgsDR9xMlr3aEez2BA6yr3 Ps5g== X-Gm-Message-State: APjAAAXnQLKIO5udFo5JMx3q5eKHzvru1rwBltbMXDZy1Ds0NLqa3rto jHYmOtIrJ66NGU+Ng5TBcGAgoFi4eGzW57vmDIyw4w== X-Received: by 2002:aa7:946a:: with SMTP id t10mr12372748pfq.165.1578590227281; Thu, 09 Jan 2020 09:17:07 -0800 (PST) MIME-Version: 1.0 References: <00000000000036decf0598c8762e@google.com> <87a787ekd0.fsf@dja-thinkpad.axtens.net> <87h81zax74.fsf@dja-thinkpad.axtens.net> <0b60c93e-a967-ecac-07e7-67aea1a0208e@I-love.SAKURA.ne.jp> <6d009462-74d9-96e9-ab3f-396842a58011@schaufler-ca.com> In-Reply-To: From: Nick Desaulniers Date: Thu, 9 Jan 2020 09:16:56 -0800 Message-ID: Subject: Re: INFO: rcu detected stall in sys_kill To: Alexander Potapenko Cc: Dmitry Vyukov , Casey Schaufler , Daniel Axtens , clang-built-linux , Tetsuo Handa , syzbot , kasan-dev , Andrew Morton , LKML , syzkaller-bugs Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 9, 2020 at 8:23 AM 'Alexander Potapenko' via Clang Built Linux wrote: > > On Thu, Jan 9, 2020 at 11:39 AM Dmitry Vyukov wrote: > > > > On Thu, Jan 9, 2020 at 11:05 AM Dmitry Vyukov wrote: > > > > > > > On 1/8/2020 2:25 AM, Tetsuo Handa wrote: > > > > > > > > On 2020/01/08 15:20, Dmitry Vyukov wrote: > > > > > > > >> I temporarily re-enabled smack instance and it produced another 50 > > > > > > > >> stalls all over the kernel, and now keeps spewing a dozen every hour. > > > > > > > > > > > > > > Do I have to be using clang to test this? I'm setting up to work on this, > > > > > > > and don't want to waste time using my current tool chain if the problem > > > > > > > is clang specific. > > > > > > > > > > > > Humm, interesting. Initially I was going to say that most likely it's > > > > > > not clang-related. Bug smack instance is actually the only one that > > > > > > uses clang as well (except for KMSAN of course). So maybe it's indeed > > > > > > clang-related rather than smack-related. Let me try to build a kernel > > > > > > with clang. > > > > > > > > > > +clang-built-linux, glider > > > > > > > > > > [clang-built linux is severe broken since early Dec] Is there automated reporting? Consider adding our mailing list for Clang specific failures. clang-built-linux Our CI looks green, but there's a very long tail of combinations of configs that we don't have coverage of, so bug reports are appreciated: https://github.com/ClangBuiltLinux/linux/issues > > > > > > > > > > Building kernel with clang I can immediately reproduce this locally: > > > > > > > > > > $ syz-manager > > > > > 2020/01/09 09:27:15 loading corpus... > > > > > 2020/01/09 09:27:17 serving http on http://0.0.0.0:50001 > > > > > 2020/01/09 09:27:17 serving rpc on tcp://[::]:45851 > > > > > 2020/01/09 09:27:17 booting test machines... > > > > > 2020/01/09 09:27:17 wait for the connection from test machine... > > > > > 2020/01/09 09:29:23 machine check: > > > > > 2020/01/09 09:29:23 syscalls : 2961/3195 > > > > > 2020/01/09 09:29:23 code coverage : enabled > > > > > 2020/01/09 09:29:23 comparison tracing : enabled > > > > > 2020/01/09 09:29:23 extra coverage : enabled > > > > > 2020/01/09 09:29:23 setuid sandbox : enabled > > > > > 2020/01/09 09:29:23 namespace sandbox : enabled > > > > > 2020/01/09 09:29:23 Android sandbox : /sys/fs/selinux/policy > > > > > does not exist > > > > > 2020/01/09 09:29:23 fault injection : enabled > > > > > 2020/01/09 09:29:23 leak checking : CONFIG_DEBUG_KMEMLEAK is > > > > > not enabled > > > > > 2020/01/09 09:29:23 net packet injection : enabled > > > > > 2020/01/09 09:29:23 net device setup : enabled > > > > > 2020/01/09 09:29:23 concurrency sanitizer : /sys/kernel/debug/kcsan > > > > > does not exist > > > > > 2020/01/09 09:29:23 devlink PCI setup : PCI device 0000:00:10.0 > > > > > is not available > > > > > 2020/01/09 09:29:27 corpus : 50226 (0 deleted) > > > > > 2020/01/09 09:29:27 VMs 20, executed 0, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:29:37 VMs 20, executed 45, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:29:47 VMs 20, executed 74, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:29:57 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:30:07 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:30:17 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:30:27 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:30:37 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:30:47 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:30:57 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:31:07 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:31:17 VMs 20, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:31:26 vm-10: crash: INFO: rcu detected stall in do_idle > > > > > 2020/01/09 09:31:27 VMs 13, executed 80, cover 0, crashes 0, repro 0 > > > > > 2020/01/09 09:31:28 vm-1: crash: INFO: rcu detected stall in sys_futex > > > > > 2020/01/09 09:31:29 vm-4: crash: INFO: rcu detected stall in sys_futex > > > > > 2020/01/09 09:31:31 vm-0: crash: INFO: rcu detected stall in sys_getsockopt > > > > > 2020/01/09 09:31:33 vm-18: crash: INFO: rcu detected stall in sys_clone3 > > > > > 2020/01/09 09:31:35 vm-3: crash: INFO: rcu detected stall in sys_futex > > > > > 2020/01/09 09:31:36 vm-8: crash: INFO: rcu detected stall in do_idle > > > > > 2020/01/09 09:31:37 VMs 7, executed 80, cover 0, crashes 6, repro 0 > > > > > 2020/01/09 09:31:38 vm-19: crash: INFO: rcu detected stall in schedule_tail > > > > > 2020/01/09 09:31:40 vm-6: crash: INFO: rcu detected stall in schedule_tail > > > > > 2020/01/09 09:31:42 vm-2: crash: INFO: rcu detected stall in schedule_tail > > > > > 2020/01/09 09:31:44 vm-12: crash: INFO: rcu detected stall in sys_futex > > > > > 2020/01/09 09:31:46 vm-15: crash: INFO: rcu detected stall in sys_nanosleep > > > > > 2020/01/09 09:31:47 VMs 1, executed 80, cover 0, crashes 11, repro 0 > > > > > 2020/01/09 09:31:48 vm-16: crash: INFO: rcu detected stall in sys_futex > > > > > 2020/01/09 09:31:50 vm-9: crash: INFO: rcu detected stall in schedule > > > > > 2020/01/09 09:31:52 vm-13: crash: INFO: rcu detected stall in schedule_tail > > > > > 2020/01/09 09:31:54 vm-11: crash: INFO: rcu detected stall in schedule_tail > > > > > 2020/01/09 09:31:56 vm-17: crash: INFO: rcu detected stall in sys_futex > > > > > 2020/01/09 09:31:57 VMs 0, executed 80, cover 0, crashes 16, repro 0 > > > > > 2020/01/09 09:31:58 vm-7: crash: INFO: rcu detected stall in sys_futex > > > > > 2020/01/09 09:32:00 vm-5: crash: INFO: rcu detected stall in dput > > > > > 2020/01/09 09:32:02 vm-14: crash: INFO: rcu detected stall in sys_nanosleep > > > > > > > > > > > > > > > Then I switched LSM to selinux and I _still_ can reproduce this. So, > > > > > Casey, you may relax, this is not smack-specific :) > > > > > > > > > > Then I disabled CONFIG_KASAN_VMALLOC and CONFIG_VMAP_STACK and it > > > > > started working normally. > > > > > > > > > > So this is somehow related to both clang and KASAN/VMAP_STACK. > > > > > > > > > > The clang I used is: > > > > > https://storage.googleapis.com/syzkaller/clang-kmsan-362913.tar.gz > > > > > (the one we use on syzbot). > > > > > > > > > > > > Clustering hangs, they all happen within very limited section of the code: > > > > > > > > 1 free_thread_stack+0x124/0x590 kernel/fork.c:284 > > > > 5 free_thread_stack+0x12e/0x590 kernel/fork.c:280 > > > > 39 free_thread_stack+0x12e/0x590 kernel/fork.c:284 > > > > 6 free_thread_stack+0x133/0x590 kernel/fork.c:280 > > > > 5 free_thread_stack+0x13d/0x590 kernel/fork.c:280 > > > > 2 free_thread_stack+0x141/0x590 kernel/fork.c:280 > > > > 6 free_thread_stack+0x14c/0x590 kernel/fork.c:280 > > > > 9 free_thread_stack+0x151/0x590 kernel/fork.c:280 > > > > 3 free_thread_stack+0x15b/0x590 kernel/fork.c:280 > > > > 67 free_thread_stack+0x168/0x590 kernel/fork.c:280 > > > > 6 free_thread_stack+0x16d/0x590 kernel/fork.c:284 > > > > 2 free_thread_stack+0x177/0x590 kernel/fork.c:284 > > > > 1 free_thread_stack+0x182/0x590 kernel/fork.c:284 > > > > 1 free_thread_stack+0x186/0x590 kernel/fork.c:284 > > > > 16 free_thread_stack+0x18b/0x590 kernel/fork.c:284 > > > > 4 free_thread_stack+0x195/0x590 kernel/fork.c:284 > > > > > > > > Here is disass of the function: > > > > https://gist.githubusercontent.com/dvyukov/a283d1aaf2ef7874001d56525279ccbd/raw/ac2478bff6472bc473f57f91a75f827cd72bb6bf/gistfile1.txt > > > > > > > > But if I am not mistaken, the function only ever jumps down. So how > > > > can it loop?... > > > > > > > > > This is a miscompilation related to static branches. > > > > > > objdump shows: > > > > > > ffffffff814878f8: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > > ./arch/x86/include/asm/jump_label.h:25 > > > asm_volatile_goto("1:" > > > > > > However, the actual instruction in memory at the time is: > > > > > > 0xffffffff814878f8 <+408>: jmpq 0xffffffff8148787f > > > > > > Which jumps to a wrong location in free_thread_stack and makes it loop. > > > > > > The static branch is this: > > > > > > static inline bool memcg_kmem_enabled(void) > > > { > > > return static_branch_unlikely(&memcg_kmem_enabled_key); > > > } > > > > > > static inline void memcg_kmem_uncharge(struct page *page, int order) > > > { > > > if (memcg_kmem_enabled()) > > > __memcg_kmem_uncharge(page, order); > > > } > > > > > > I suspect it may have something to do with loop unrolling. It may jump > > > to the right location, but in the wrong unrolled iteration. I disabled loop unrolling and loop unswitching in LLVM when the loop contained asm goto in: https://github.com/llvm/llvm-project/commit/c4f245b40aad7e8627b37a8bf1bdcdbcd541e665 I have a fix for loop unrolling in: https://reviews.llvm.org/D64101 that I should dust off. I haven't looked into loop unswitching yet. > > > > > > Kernel built with clang version 10.0.0 > > (https://github.com/llvm/llvm-project.git > > c2443155a0fb245c8f17f2c1c72b6ea391e86e81) works fine. > > > > Alex, please update clang on syzbot machines. > > Done ~3 hours ago, guess we'll see the results within a day. Please let me know if you otherwise encounter any miscompiles with Clang, particularly `asm goto` I treat as P0. -- Thanks, ~Nick Desaulniers