Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1130910pxb; Thu, 28 Jan 2021 08:41:15 -0800 (PST) X-Google-Smtp-Source: ABdhPJzAjdoypicCJQZyzO62M5Lkj4s8VrfBrC8dHHjH1pv0bhfOrehTSzVBq4qq6KJWC95Vraio X-Received: by 2002:a17:906:447:: with SMTP id e7mr259460eja.172.1611852075397; Thu, 28 Jan 2021 08:41:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611852075; cv=none; d=google.com; s=arc-20160816; b=0kBVHO0G0Kv9rI1NuautJ7zABS24uMyZdr31Oz9X3CNNguZF2ObxPx6ewePFl0IUQy 8HCoYTFMEaGAJ8Hoj/5bbV7VKEoCcxHTgAhd2Tk3WBkxvkaHC9SuWK5+qyiA/4nfI0xJ 9ln2jzs/uoElKQo9FFGWMaJ2GWkdcdgzwi4AGL2c727Y7lON13VgHD7FWtrxda6RrLbl uhOhXarMgKuujrE39JuxuyhecoUDKhhRRhZAwZWC/ZeGiBIxQi3KxzoLmpNO/mrRnV2i VNfwQ1AvJQZUfAq8jFmuwQG096WLpgoDEayh9c2cYfaC0lH2OAbcW/MqaAlTyFTQWV96 Cm7g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=WybSJxoHdYVCgie3d8jjobcsCEqTYehyaUtKcLMGQYU=; b=uUzF/46ssQsg19tcIM8Nkq8Fte71+4Gm5YDeM8glGZPuptERUFf2fquccO98VYBwcq K/rLKYJaeC4E17Xvr98VdUafTtTxMXR2w/Lclr9k7eUNMs9ErAu2VSuqIP/8th/eKnnB pKymfqGVDeuzpr9u1Hy5zNgs1iOwvUPThrR8HSTW2PGr8BINYFZ6vHy/LgZDEY5odOfp rmXoa1ZoP4iCpwtDNwiBTozrn/DocpmC21vVwfrF9JNC5atX8Ilxmh3Lv4UbBbyTl6Ac nwRFgffCrguMZkSR/32XnzfTTizVU952wN70/QIIIZ3b6YNU/HFws440UCtHkwJs48H7 SVzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=tjlLPuZ7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j16si3145937edw.334.2021.01.28.08.40.48; Thu, 28 Jan 2021 08:41:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=tjlLPuZ7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232598AbhA1QiP (ORCPT + 99 others); Thu, 28 Jan 2021 11:38:15 -0500 Received: from mx2.suse.de ([195.135.220.15]:32878 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232616AbhA1Qha (ORCPT ); Thu, 28 Jan 2021 11:37:30 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1611851801; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WybSJxoHdYVCgie3d8jjobcsCEqTYehyaUtKcLMGQYU=; b=tjlLPuZ7K6NqI5z4hap845+6arO6Hc+HKwT3dhjlBDUI9z+//7UkI9hZfL252ARrlc56wr UNNTcpDFOMf2ROzPwstMGcsrb4jT+phJvMAK77jaqZkEx0u1K/PTOJYkvQ3G2tzl45RuO5 cQfazOl9fLb1DdieeSSQUQGRjFewvmA= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 80238AF0E; Thu, 28 Jan 2021 16:36:41 +0000 (UTC) Date: Thu, 28 Jan 2021 17:36:40 +0100 From: Petr Mladek To: John Ogness Cc: Sergey Senozhatsky , kernel test robot , LKML , lkp@lists.01.org, lkp@intel.com, zhengjun.xing@linux.intel.com Subject: Re: [printk] b031a684bf: INFO:rcu_tasks_detected_stalls_on_tasks Message-ID: References: <20210122081311.GA12834@xsang-OptiPlex-9020> <87bldaaxcc.fsf@jogness.linutronix.de> <87czxpmhe1.fsf@jogness.linutronix.de> <877dnxm5ju.fsf@jogness.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <877dnxm5ju.fsf@jogness.linutronix.de> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 2021-01-28 16:48:29, John Ogness wrote: > On 2021-01-28, Petr Mladek wrote: > > If you are able to reproduce this, it might be interesting to see > > backtraces from all CPUs when RCU stall gets detected. Or even > > printing all processes, even the sleeping ones. > > OK, I now have lkp running on my intel core i5 machine and it reliably > reproduces this. First let me say a word about lkp. The instructions in > the report email are a bit misleading. There are lots of steps that it > doesn't mention. > > First I needed to install lkp and its dependencies. I did all of this as > root because lkp seems to get confused when sudo is involved. > > # git clone https://github.com/intel/lkp-tests.git > # cd lkp-tests > # make install > # lkp install > > Then I built the kernel and modules using a helper script to setup the > environment for me: > > ----- BEGIN /tmp/mk.sh ----- > #!/bin/sh > > export INSTALL_MOD_PATH=`pwd` > export HOSTCC=gcc-9 > export CC=gcc-9 > export ARCH=x86_64 > > exec make "$@" > ----- END /tmp/mk.sh ----- > > # cd linux > # git checkout -b lkp b031a684bfd01d633c79d281bd0cf11c2f834ada > # cp /tmp/config-5.10.0-rc5-gb031a684bfd0 .config > # /tmp/mk.sh -j `nproc` olddefconfig prepare modules_prepare bzImage modules modules_install > > Then I created the modules.cgz: > > # find lib/modules | cpio -H newc -o | gzip -9c > modules.cgz > > I put kernel and modules in /tmp/: > > # cp arch/x86/boot/bzImage modules.cgz /tmp/ > > And ran lkp inside script(1) so I have a logfile: > > # script > # lkp qemu -k /tmp/bzImage -m /tmp/modules.cgz /tmp/job-script > > And reliably I see the first stall at about 925 seconds: > > [ 926.386441][ T22] INFO: rcu_tasks detected stalls on tasks: > [ 926.387310][ T22] 0000000041eb5240: .. nvcsw: 0/0 holdout: 1 idle_cpu: -1/1 > [ 926.388196][ T22] task:dmesg state:R running task stack: 0 pid: 1752 ppid: 511 flags:0x2002000 > 0 > [ 926.389509][ T22] Call Trace: > [ 926.389962][ T22] __schedule+0xa12/0xab5 > [ 926.390534][ T22] ? firmware_map_remove+0xd1/0xd1 > [ 926.391190][ T22] ? ksys_read+0x116/0x150 > [ 926.391773][ T22] schedule+0x16c/0x1df > [ 926.392317][ T22] exit_to_user_mode_loop+0x28/0x84 > [ 926.392977][ T22] exit_to_user_mode_prepare+0x1d/0x4a > [ 926.393671][ T22] syscall_exit_to_user_mode+0x41/0x4f > [ 926.394357][ T22] entry_INT80_compat+0x71/0x76 > [ 926.394982][ T22] RIP: 0023:0xf7f0fa02 > [ 926.395528][ T22] RSP: 002b:00000000ff850364 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 > [ 926.396590][ T22] RAX: ffffffffffffffe0 RBX: 0000000000000004 RCX: 00000000565c8234 > [ 926.397607][ T22] RDX: 0000000000001fff RSI: 00000000000001e0 RDI: 00000000565c8234 > [ 926.398617][ T22] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 > [ 926.399629][ T22] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 926.400643][ T22] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > > But actually it is not the rcu stall that interests me so much. I am > more interested in some bizarre console output I am seeing. Here is a > snippet: > > [ 903.008486][ T356] [ 778.732883] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.008496][ T356] > [ 903.014733][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.030232][ T356] [ 778.735197] [ 655.203106] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.030242][ T356] > [ 903.040809][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.044327][ T356] [ 778.746898] [ 655.214214] [ 531.584847] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.044337][ T356] > [ 903.048732][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.066039][ T356] [ 778.751162] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.066050][ T356] > [ 903.070123][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.077137][ T356] [ 778.768744] [ 655.227204] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.077147][ T356] > [ 903.092792][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.095313][ T356] [ 778.773836] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.095323][ T356] > [ 903.106788][ T356] [ 778.781780] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.106798][ T356] > [ 903.121986][ T356] [ 778.791183] [ 655.230287] [ 531.589973] [ 407.116952] [ 283.068815] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.121997][ T356] > [ 903.134120][ T356] [ 778.797950] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.134130][ T356] > [ 903.148741][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.151626][ T356] [ 778.805164] [ 655.241173] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.151637][ T356] > [ 903.160126][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.166936][ T356] [ 778.818740] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.166946][ T356] > [ 903.175785][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.185955][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5 > [ 903.189448][ T356] [ 778.825864] [ 655.250559] [ 531.607066] [ 407.120936] tasks-torture:torture_onoff task: online 0 failed: errno -5 > > These multi-timestamp lines are odd. And they are almost exactly 124 > seconds apart. Yup, this looks suspicious. Here are my few ideas. Feel free to ignore them and do your own debugging. I do not want to distract you. The test produces the same messages over and over again. It is possible that something has overflown after some amount of messages. And the regular intervals are just a coincidence. Are all messages broken this way? Or does it start later? Are this exact messages printed correctly at some points? Is this console output? Or dmesg? Are you able to reproduce it with the current Linus's master. I wonder if it is something that we have already fixed. Best Regards, Petr