Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Fri, 24 Nov 2017 02:23:44 -0500
From:   baiyaowei <baiyaowei@cmss.chinamobile.com>
To:     tj@kernel.org, jiangshanlai@gmail.com
Cc:     akpm@linux-foundation.org, baiyaowei@cmss.chinamobile.com,
        linux-kernel@vger.kernel.org
Subject: NULL pointer dereference in process_one_work
Message-ID: <20171124072344.GA313@byw>
Reply-To: baiyaowei@cmss.chinamobile.com
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Hi,tj and jiangshan,

I build a ceph storage pool to run some benchmarks with 3.10 kernel.
Occasionally, when the cpus' load is very high, some nodes crash with
message below.

[292273.612014] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[292273.612057] IP: [<ffffffff8109d4b1>] process_one_work+0x31/0x470
[292273.612087] PGD 0 
[292273.612099] Oops: 0000 [#1] SMP 
[292273.612117] Modules linked in: rbd(OE) bcache(OE) ip_vs xfs
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack ipt_REJECT tun bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables iptable_filter bonding
intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul
ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper
cryptd mxm_wmi iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf pcspkr
ipmi_ssif mei_me sg lpc_ich mei sb_edac ipmi_si mfd_core edac_core
ipmi_msghandler shpchp wmi acpi_power_meter nfsd auth_rpcgss nfs_acl
lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif
crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit
drm_kms_helper
[292273.612495]  crct10dif_pclmul crct10dif_common ttm crc32c_intel drm
ahci nvme bnx2x libahci i2c_core libata mdio libcrc32c megaraid_sas ptp
pps_core dm_mirror dm_region_hash dm_log dm_mod
[292273.612580] CPU: 16 PID: 353223 Comm: kworker/16:2 Tainted: G
OE  ------------   3.10.0-327.el7.x86_64 #1
[292273.612620] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS
2.4.3 01/17/2017
[292273.612655] task: ffff8801f55e6780 ti: ffff882a199b0000 task.ti:
ffff882a199b0000
[292273.612685] RIP: 0010:[<ffffffff8109d4b1>]  [<ffffffff8109d4b1>]
process_one_work+0x31/0x470
[292273.612721] RSP: 0018:ffff882a199b3e28  EFLAGS: 00010046
[292273.612743] RAX: 0000000000000000 RBX: ffff88088b273028 RCX:
ffff882a199b3fd8
[292273.612771] RDX: 0000000000000000 RSI: ffff88088b273028 RDI:
ffff88088b273000
[292273.612799] RBP: ffff882a199b3e60 R08: 0000000000000000 R09:
0000000000000770
[292273.612827] R10: ffff8822a3bb1f80 R11: ffff8822a3bb1f80 R12:
ffff88088b273000
[292273.612855] R13: ffff881fff313fc0 R14: 0000000000000000 R15:
ffff881fff313fc0
[292273.612883] FS:  0000000000000000(0000) GS:ffff881fff300000(0000)
knlGS:0000000000000000
[292273.612914] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[292273.612937] CR2: 00000000000000b8 CR3: 000000000194a000 CR4:
00000000003407e0
[292273.612965] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[292273.612994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[292273.613021] Stack:
[292273.613031]  00000000ff313fd8 0000000000000000 ffff881fff313fd8
000188088b273030
[292273.613069]  ffff8801f55e6780 ffff88088b273000 ffff881fff313fc0
ffff882a199b3ec0
[292273.613108]  ffffffff8109e4cc ffff882a199b3fd8 ffff882a199b3fd8
ffff8801f55e6780
[292273.613146] Call Trace:
[292273.613160]  [<ffffffff8109e4cc>] worker_thread+0x21c/0x400
[292273.613185]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[292273.613212]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[292273.613234]  [<ffffffff810a5a20>] ?
kthread_create_on_node+0x140/0x140
[292273.613263]  [<ffffffff81645858>] ret_from_fork+0x58/0x90
[292273.613287]  [<ffffffff810a5a20>] ?
kthread_create_on_node+0x140/0x140
[292273.614303] Code: 48 89 e5 41 57 41 56 45 31 f6 41 55 41 54 49 89 fc
53 48 89 f3 48 83 ec 10 48 8b 06 4c 8b 6f 48 48 89 c2 30 d2 a8 04 4c 0f
45 f2 <49> 8b 46 08 44 8b b8 00 01 00 00 41 c1 ef 05 44 89 f8 83 e0 01 
[292273.617971] RIP  [<ffffffff8109d4b1>] process_one_work+0x31/0x470
[292273.620011]  RSP <ffff882a199b3e28>
[292273.621940] CR2: 0000000000000008

Some crash messsages:

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Wed Oct 18 05:21:14 2017
      UPTIME: 3 days, 09:07:25
LOAD AVERAGE: 221.70, 222.22, 224.96
       TASKS: 3115
    NODENAME: node121
     RELEASE: 3.10.0-327.el7.x86_64
     VERSION: #1 SMP Thu Nov 19 22:10:57 UTC 2015
     MACHINE: x86_64  (2099 Mhz)
      MEMORY: 255.9 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at
0000000000000008"
crash> bt
PID: 353223  TASK: ffff8801f55e6780  CPU: 16  COMMAND: "kworker/16:2"
 #0 [ffff882a199b3af0] machine_kexec at ffffffff81051beb
 #1 [ffff882a199b3b50] crash_kexec at ffffffff810f2542
 #2 [ffff882a199b3c20] oops_end at ffffffff8163e1a8
 #3 [ffff882a199b3c48] no_context at ffffffff8162e2b8
 #4 [ffff882a199b3c98] __bad_area_nosemaphore at ffffffff8162e34e
 #5 [ffff882a199b3ce0] bad_area_nosemaphore at ffffffff8162e4b8
 #6 [ffff882a199b3cf0] __do_page_fault at ffffffff81640fce
 #7 [ffff882a199b3d48] do_page_fault at ffffffff81641113
 #8 [ffff882a199b3d70] page_fault at ffffffff8163d408
    [exception RIP: process_one_work+49]
    RIP: ffffffff8109d4b1  RSP: ffff882a199b3e28  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffff88088b273028  RCX: ffff882a199b3fd8
    RDX: 0000000000000000  RSI: ffff88088b273028  RDI: ffff88088b273000
    RBP: ffff882a199b3e60   R8: 0000000000000000   R9: 0000000000000770
    R10: ffff8822a3bb1f80  R11: ffff8822a3bb1f80  R12: ffff88088b273000
    R13: ffff881fff313fc0  R14: 0000000000000000  R15: ffff881fff313fc0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff882a199b3e68] worker_thread at ffffffff8109e4cc
#10 [ffff882a199b3ec8] kthread at ffffffff810a5aef
#11 [ffff882a199b3f50] ret_from_fork at ffffffff81645858
crash> dis process_one_work
0xffffffff8109d480 <process_one_work>:  nopl   0x0(%rax,%rax,1) [FTRACE
NOP]
0xffffffff8109d485 <process_one_work+5>:        push   %rbp
0xffffffff8109d486 <process_one_work+6>:        mov    %rsp,%rbp
0xffffffff8109d489 <process_one_work+9>:        push   %r15
0xffffffff8109d48b <process_one_work+11>:       push   %r14
0xffffffff8109d48d <process_one_work+13>:       xor    %r14d,%r14d
0xffffffff8109d490 <process_one_work+16>:       push   %r13
0xffffffff8109d492 <process_one_work+18>:       push   %r12
0xffffffff8109d494 <process_one_work+20>:       mov    %rdi,%r12
0xffffffff8109d497 <process_one_work+23>:       push   %rbx
0xffffffff8109d498 <process_one_work+24>:       mov    %rsi,%rbx
0xffffffff8109d49b <process_one_work+27>:       sub    $0x10,%rsp
0xffffffff8109d49f <process_one_work+31>:       mov    (%rsi),%rax
0xffffffff8109d4a2 <process_one_work+34>:       mov    0x48(%rdi),%r13
0xffffffff8109d4a6 <process_one_work+38>:       mov    %rax,%rdx
0xffffffff8109d4a9 <process_one_work+41>:       xor    %dl,%dl
0xffffffff8109d4ab <process_one_work+43>:       test   $0x4,%al
0xffffffff8109d4ad <process_one_work+45>:       cmovne %rdx,%r14
0xffffffff8109d4b1 <process_one_work+49>:       mov    0x8(%r14),%rax
0xffffffff8109d4b5 <process_one_work+53>:       mov    0x100(%rax),%r15d
0xffffffff8109d4bc <process_one_work+60>:       shr    $0x5,%r15d
0xffffffff8109d4c0 <process_one_work+64>:       mov    %r15d,%eax
0xffffffff8109d4c3 <process_one_work+67>:       and    $0x1,%eax
0xffffffff8109d4c6 <process_one_work+70>:       testb  $0x80,0x58(%rdi)
0xffffffff8109d4ca <process_one_work+74>:       mov    %al,-0x30(%rbp)
0xffffffff8109d4cd <process_one_work+77>:       jne
0xffffffff8109d4da <process_one_work+90>
crash> *work_struct ffff88088b273028
struct work_struct {
  data = {
    counter = 0
  }, 
  entry = {
    next = 0xffff88088b273030, 
    prev = 0xffff88088b273030
  }, 
  func = 0xffff8801f55e6780
}
crash> *worker ffff88088b273000
struct worker {
  {
    entry = {
      next = 0x0, 
      prev = 0x0
    }, 
    hentry = {
      next = 0x0, 
      pprev = 0x0
    }
  }, 
  current_work = 0x0, 
  current_func = 0x0, 
  current_pwq = 0x0, 
  desc_valid = false, 
  scheduled = {
    next = 0xffff88088b273030, 
    prev = 0xffff88088b273030
  }, 
  task = 0xffff8801f55e6780, 
  pool = 0xffff881fff313fc0, 
  last_active = 4586712874, 
  flags = 1, 
  id = 2, 
  desc =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  rescue_wq = 0x0
}
crash> 

It looks like the work parameter is mistakenly pointing into the worker
parameter. I noticed there was one similar case like this before, but not
exactly same.

https://lists.gt.net/linux/kernel/2022989

I wonder how this situation could happen. When process_one_work is called,
the work being processed shall be on worker_pool->worklist, but not
here.

crash> *worker_pool 0xffff881fff313fc0
struct worker_pool {
  lock = {
    {
      rlock = {
        raw_lock = {
          {
            head_tail = 4142331618, 
            tickets = {
              head = 63202, 
              tail = 63206
            }
          }
        }
      }
    }
  }, 
  cpu = 16, 
  node = 0, 
  id = 32, 
  flags = 0, 
  worklist = {
    next = 0xffff881fff313fd8, 
    prev = 0xffff881fff313fd8
  }, 
  nr_workers = 3, 
  nr_idle = 2, 
  idle_list = {
    next = 0xffff880245a18280, 
    prev = 0xffff88078d3c5780
  }, 
...

Maybe we'd add if(get_work_pwq(work)) in process_one_work or add
BUG_ON like this:

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dde6298..82a92e0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2012,11 +2012,15 @@ static void process_one_work(struct worker
*worker, struct work_struct *work)
 __releases(&pool->lock)
 __acquires(&pool->lock)
 {
-       struct pool_workqueue *pwq = get_work_pwq(work);
+       struct pool_workqueue *pwq;
        struct worker_pool *pool = worker->pool;
-       bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
+       bool cpu_intensive;
        int work_color;
        struct worker *collision;
+
+       BUG_ON(!(pwq = get_work_pwq(work)));
+       bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
+
 #ifdef CONFIG_LOCKDEP

I really appreciate your any ideas and please let me know if you want 
any more information from the crashed system. 

Thanks,
Yaowei,Bai


From 1584878081380987454@xxx Thu Nov 23 17:18:40 +0000 2017
X-GM-THRID: 1584878081380987454
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread