Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp694530pxb; Tue, 2 Feb 2021 15:50:54 -0800 (PST) X-Google-Smtp-Source: ABdhPJxdNABIxB6sIulQT1jNFd8ZpVrvCmZAaGEjN7JI9d18nIiv6K0/TUW3/HG+pOeoopI9vgc0 X-Received: by 2002:a17:906:ca04:: with SMTP id jt4mr450401ejb.548.1612309854293; Tue, 02 Feb 2021 15:50:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612309854; cv=none; d=google.com; s=arc-20160816; b=KqkTXfAVxDc4gjAUXXiOaCsK6StQD9MEEN8lIdeAsn+uO8RGYaknTXaBTE93BR+l2z +l6aQtSROyIg/9Mf1vIFP/NxDhq9LIWwWAJdtWxIJm63HmN6Wov9Dh6oODHYfH7z1dZ2 DZ6qXyXneQQ2nZHOsG4A1Ja64di6LyaIoifpeHGm3MUj6hcerb/TTzEp6/OYyWvFictq Q7z1CR0Jr15v8ouMkZ0VwNFmbfnuMcvfq2RPDAMl/LIV2Qylqid0IYBvp09Y+msxtvlJ u+xZcxX6k7ZuKVzKiIjWGWvnEB2FTz45zNiIE1VUAx9aalBQM/3zbI6j9KKNEK9TuhvH lueQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:references :in-reply-to:subject:cc:to:dkim-signature:dkim-signature:from; bh=iZnDnGN6Nqh9IQDCt9rgVBHu3N60IWcA9dxTuagGxdE=; b=Z4KpFQs+hF1q79Emd6+3Aav0quvbqcC0kBs596baKGiSALnZH6IwdWTXMa8kJXuQYb wQETsAYqOQWWOBfuLmOW2kHie+4W9YRU6eDSMyG1RnmEpTxKoYJaCumUEvYQnXgRHuIm D1ycqKAhfpcDzn8x0WixgPLEgfgaiJ78loK/qSlteVoFXPGQRTgbo+wP5r34RW4GmYUT 0LCbwbg+87JFdi7gITQ6lrlO0+OQSVvkqow2cGf2D81j+rdsD8Bw/FT1Tgq/yl4Ck8v9 ogyF+48t1xpCxTsCb7EbSspOKVuKqq1pLNVSNzKxHjfJH9Yk42lDhkepKTHYjBrAoJq1 I6YQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="DYFy/RGi"; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=meCfLMQn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u13si169795edp.96.2021.02.02.15.50.29; Tue, 02 Feb 2021 15:50:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="DYFy/RGi"; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=meCfLMQn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237316AbhBBRRN (ORCPT + 99 others); Tue, 2 Feb 2021 12:17:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236140AbhBBROW (ORCPT ); Tue, 2 Feb 2021 12:14:22 -0500 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DB003C061786 for ; Tue, 2 Feb 2021 09:13:42 -0800 (PST) From: John Ogness DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1612286021; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=iZnDnGN6Nqh9IQDCt9rgVBHu3N60IWcA9dxTuagGxdE=; b=DYFy/RGiU4JPvUc9+htW8LQi4hT15UGsSK4ntVvDNen0l8NVThKWyRMvoSmRHji2d5TRoY ngJnlU4TiOwmNlEoeqZCQaGmRcpvlWlDZTiylWO+FEYn2rE5nsfGTbtGaa+H4a6sWWHJBf ofcl5lWVwp453y0A/+XL+dbN0yybSRWPQ9msP4wAReP9iNzNlwYExgwmArqEjRziozCeeN zpqrycmUFFg+x93pYusPr3rmtg3108bmL6K+09f+t7gUMiGXaMamCUvzfijWFk8q0l4xEt 1VPctWclyyKd+jPS74qiIMJ4M5EJyWvd7DAWw6K1x+c/hamncagMuiSqiSBcQQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1612286021; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=iZnDnGN6Nqh9IQDCt9rgVBHu3N60IWcA9dxTuagGxdE=; b=meCfLMQnDirchwZYercWfnfai/baV2P37liWY7YH5PKRMQroQF24v+ooJqWn9zsvxzMglI hotKs+1eEwt/2yCA== To: kernel test robot Cc: Petr Mladek , Sergey Senozhatsky , LKML , lkp@lists.01.org, lkp@intel.com, zhengjun.xing@linux.intel.com Subject: Re: [printk] b031a684bf: INFO:rcu_tasks_detected_stalls_on_tasks In-Reply-To: <20210122081311.GA12834@xsang-OptiPlex-9020> References: <20210122081311.GA12834@xsang-OptiPlex-9020> Date: Tue, 02 Feb 2021 18:19:40 +0106 Message-ID: <8735yev1dn.fsf@jogness.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2021-01-22, kernel test robot wrote: > FYI, we noticed the following commit (built with gcc-9): > > commit: b031a684bfd01d633c79d281bd0cf11c2f834ada ("printk: remove logbuf_lock writer-protection of ringbuffer") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master I am still investigating this in my spare cycles. I have tried to reduce the changes to determine what is causing the problem. For example, instead of applying the problematic commit, I can reproduce the problem with: diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 150bfde41ba1..17a92a76cae2 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1943,7 +1943,10 @@ int vprintk_store(int facility, int level, * prb_reserve_in_last() and prb_reserve() purposely invalidate the * structure when they fail. */ - prb_rec_init_wr(&r, text_len); + if (text_len < 200) + prb_rec_init_wr(&r, 200); + else + prb_rec_init_wr(&r, text_len); if (!prb_reserve(&e, prb, &r)) { /* truncate the message if it is too long for empty buffer */ truncate_msg(&text_len, &trunc_msg_len); which simply makes sure at least 200-byte datablocks are allocated. This totally different change causes the _exact_ same rcu stall behavior/pattern as the problematic commit that is removing spinlock usage. That pattern is: 1. printk() is getting called about 65 times a second with: "tasks-torture:torture_onoff task: online 0 failed: errno -5" 2. About 60 seconds after "rmmod rcutorture" is called, the following ruc warning appears: [ 390.440235] ??? Writer stall state RTWS_SYNC(7) g0 f0x0 ->state 0x2 cpu 1 [ 390.441250] task:rcu_torture_wri state:D stack: 0 pid: 505 ppid: 2 flags:0x00004000 [ 390.442473] Call Trace: [ 390.442938] __schedule+0xa12/0xab5 [ 390.443495] ? firmware_map_remove+0xd1/0xd1 [ 390.444151] ? ttwu_do_wakeup+0x1d/0x2dd [ 390.444764] schedule+0x16c/0x1df [ 390.445311] schedule_timeout+0x77/0x1b4 [ 390.445911] ? usleep_range+0x130/0x130 [ 390.446522] ? do_raw_spin_lock+0x74/0xbf [ 390.447180] ? queued_read_unlock+0x19/0x19 [ 390.447825] ? __prepare_to_swait+0xa4/0x141 [ 390.448491] __wait_for_common+0x22d/0x337 [ 390.449139] ? usleep_range+0x130/0x130 [ 390.450860] ? _raw_spin_lock_irqsave+0x81/0xd1 [ 390.451543] ? bit_wait_io_timeout+0x8a/0x8a [ 390.452196] __wait_rcu_gp+0x20d/0x229 [ 390.452788] synchronize_rcu_tasks_generic.isra.0+0x7e/0xb2 [ 390.453604] ? rcu_tasks_rude_wait_gp+0x3d/0x3d [ 390.454334] ? call_rcu_tasks_generic+0x134/0x134 [ 390.455055] ? rcu_torture_alloc+0x50/0xbc [rcutorture] [ 390.455847] ? rcu_unexpedite_gp+0x19/0x19 [ 390.456488] ? delay_tsc+0x41/0x7c [ 390.457053] rcu_torture_writer+0x815/0xaea [rcutorture] [ 390.457889] ? rcu_torture_pipe_update+0x100/0x100 [rcutorture] [ 390.458801] ? test_bit+0x15/0x1f [ 390.459349] ? __kthread_parkme+0xb2/0x142 [ 390.459968] kthread+0x2ff/0x30e [ 390.460539] ? rcu_torture_pipe_update+0x100/0x100 [rcutorture] [ 390.461408] ? kthread_create_worker_on_cpu+0xce/0xce [ 390.462171] ret_from_fork+0x22/0x30 3. About 9 minutes later, 2 stalls are reported: [ 926.880657] INFO: rcu_tasks detected stalls on tasks: [ 926.881387] 000000003a288f24: .. nvcsw: 0/0 holdout: 1 idle_cpu: -1/1 [ 926.882313] task:dmesg state:R running task stack: 0 pid: 1717 ppid: 497 flags:0x20020000 [ 926.883727] Call Trace: [ 926.884148] __schedule+0xa12/0xab5 [ 926.884706] ? firmware_map_remove+0xd1/0xd1 [ 926.885355] ? ksys_read+0x116/0x150 [ 926.885930] schedule+0x16c/0x1df [ 926.886458] exit_to_user_mode_loop+0x28/0x84 [ 926.887123] exit_to_user_mode_prepare+0x1d/0x4a [ 926.887823] syscall_exit_to_user_mode+0x41/0x4f [ 926.888513] entry_INT80_compat+0x71/0x76 [ 926.889131] RIP: 0023:0xf7fcca02 [ 926.889658] RSP: 002b:00000000ffd230c4 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ 926.890780] RAX: 0000000000000050 RBX: 0000000000000004 RCX: 0000000056601234 [ 926.891720] RDX: 0000000000001fff RSI: 00000000000001e0 RDI: 0000000056601234 [ 926.892724] RBP: 00000000ffd231d8 R08: 0000000000000000 R09: 0000000000000000 [ 926.893733] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 926.894745] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 and [ 926.931950] 00000000a4109e4e: .. nvcsw: 21326/21326 holdout: 1 idle_cpu: -1/1 [ 926.932883] task:dmesg state:R running task stack: 0 pid: 447 ppid: 442 flags:0x20020000 [ 926.934286] Call Trace: [ 926.934714] __schedule+0xa12/0xab5 [ 926.935268] ? firmware_map_remove+0xd1/0xd1 [ 926.935919] ? ksys_read+0x116/0x150 [ 926.936478] schedule+0x16c/0x1df [ 926.937012] exit_to_user_mode_loop+0x28/0x84 [ 926.937674] exit_to_user_mode_prepare+0x1d/0x4a [ 926.938363] syscall_exit_to_user_mode+0x41/0x4f [ 926.939059] entry_INT80_compat+0x71/0x76 [ 926.939680] RIP: 0023:0xf7f9da02 [ 926.940193] RSP: 002b:00000000ffdb2864 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ 926.941301] RAX: ffffffffffffffe0 RBX: 0000000000000003 RCX: 0000000056659234 [ 926.942307] RDX: 0000000000001fff RSI: 00000000000001e0 RDI: 0000000056659234 [ 926.943312] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 [ 926.944313] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 926.945314] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 This pattern is _always_ the same (using either my simple change or with the problematic commit applied). Obviously the removal of the spinlock usage is not the issue. But I am concerned that the ringbuffer is somehow involved. I have tried to reproduce this problem doing non-ringbuffer activity, but have not had success. Also, the problem disappears if a newer kernel is used. So maybe there was something fixed in rcu or an rcu user. But still, it is very odd that the ringbuffer is triggering it. I will continue investigating this. Also, I plan to send a patch to lkp so that the test script is not doing: dmesg > /dev/kmsg Although this may be a great test for printk, for rcutorture it would be more appropriate to do something like: dmesg > /tmpfile cat /tmpfile > /dev/kmsg to avoid the endless read/feed cycle. John Ogness