Received: by 2002:a25:f815:0:0:0:0:0 with SMTP id u21csp2428879ybd; Thu, 27 Jun 2019 12:10:27 -0700 (PDT) X-Google-Smtp-Source: APXvYqyYdPoiPIKo9wwhTOudn/VJvfZuX7Ux3O10gCCXHS3I9Vbi60UYU/uxpBbnANkvtUWMvx0P X-Received: by 2002:a17:902:74c7:: with SMTP id f7mr6361963plt.329.1561662627883; Thu, 27 Jun 2019 12:10:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1561662627; cv=none; d=google.com; s=arc-20160816; b=MZJHGcxZFz5v7Z9A3UfrzcsjYOVupFbp9FgQsAH2REu/s54cwZ8viH8RamL4CecVBZ EkgG/MKg7dfhqwo5WrTEA5QQjuInfd2/kl9mMw70+dm6mFmKGontPt4ehNM8gbRPEoz/ vx690JzC0gWneq5IpugqnMWSxar/KUlmfIHO4OXtm+YZ/zbobIvsG3DQ4O4hOHAOhG/g W5rC5Jw/O6hkDhiH2/gKy5hSuivGJW+iH4f20VI9HiSu2HBcElbUriHrOgsA8lLPRJAL GIjvD+9K2LPIs09RjfHrRbDNPDWWOv55cePjabn4T0NCxyD/bMoHuLWqql330zYmIGZJ RpYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=k2KHjr7/uOoaKaN4FsCUsVIyjGeXFM36sAU3ENM4gTM=; b=0oVsOzSM+WnJWjB5N+CI4lkqXO56xnD/kASfOos+J3XNWbpEsIvraOcXuecGRPayJs WCsmf3LLzpP2nKy/v45YVhJ3/VBtnc5KjoAnSex8NFdPi3EjgtIfBGfgXkVhmJeM5hur QuQdzWIygHu3YZPGpxIXWntA8okdTx04ttXEDxcQMIEl76PjF7KObJTucLyBVSUvtW/k GDplIOtalBN9+fVBTU9l+kTxMkiC2ocj1wWfTQF7qlnRQk3FgoygagZvF2IzFiKKgrRX e4PrFrk0yyx2QUsrhXvjZCtTlosl73A9f7JNBlwNkTHjnLKOp0Tng44N8BvE40LpDLII ojKA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=wAbpdlca; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g16si377267pfi.31.2019.06.27.12.10.11; Thu, 27 Jun 2019 12:10:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=wAbpdlca; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726572AbfF0TJm (ORCPT + 99 others); Thu, 27 Jun 2019 15:09:42 -0400 Received: from mail-ot1-f67.google.com ([209.85.210.67]:44893 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726472AbfF0TJm (ORCPT ); Thu, 27 Jun 2019 15:09:42 -0400 Received: by mail-ot1-f67.google.com with SMTP id b7so3389719otl.11 for ; Thu, 27 Jun 2019 12:09:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=k2KHjr7/uOoaKaN4FsCUsVIyjGeXFM36sAU3ENM4gTM=; b=wAbpdlca0EKsW2yHrf1ZVTqFB6YWxEM4ES195/wgZcpY3cOJrxzeIJXnnmzdOAxz5n 7NEjWWNzxHB/GfA813AFGs+mOTPVOfu82N10BWQ2ff9/S4YCWvxZa4TvUNV3LcZGlqsX puYd+CFwlzBsCPIazhqtANoKStwdYRP+TL4Aq2KhuPFnsw5m6FnGzXxcE8x3s6gSbxaB ennqoalEoM4TieY3eez/YB1tk7AlWnuVhKg6dVWRCawMcFrSHQw8gXhIbV3/NfwsSLuS /StfrAwHqwGiz65dBPAn+Dc7zVYjfLsezG4OqeezWBlBVfflFYAmxUIFM/Koy2nutC4Y SlKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=k2KHjr7/uOoaKaN4FsCUsVIyjGeXFM36sAU3ENM4gTM=; b=tRk/cRTSpuUfZRuFnxn0M+LaEFPCyBhYRWi85nnqFFKG2oKynpFeBxSFe/k7zHys1M QU3k45OcJlVXaI6266z9dQFashZyaJ310Rn6M2S26EnSQfkT1DXePhGGEgIgudHcOTpR /4ubamsVfZF0qzxWfsxpyHEZtYSivWN/qE39rSsBvzFvvKz/fs8GzCkKUkR84iaEuKbn lEx7ez8y+vO4RihLPJix75MBzl9evz9phBx0JFtSqoL9JM2ChVg+mXY0y94nbXFfyKX5 KNYzJ5LhCKgB9i5XTxT6NvI3G6k3m/cZuUCVFt0Az1Pfcoq/v6iIOGq0NCsGAZbuV51U k2ow== X-Gm-Message-State: APjAAAUyFQd4D1clynUrIJTTfYRy6pfdpDxT2QURrKLPtZpbyup3i/rE lVe7okqi3wqoMf7gVsWC2Jw5c3xAbMTe8t9JOwf20w== X-Received: by 2002:a9d:7b48:: with SMTP id f8mr4605518oto.207.1561662581151; Thu, 27 Jun 2019 12:09:41 -0700 (PDT) MIME-Version: 1.0 References: <156159454541.2964018.7466991316059381921.stgit@dwillia2-desk3.amr.corp.intel.com> <20190627123415.GA4286@bombadil.infradead.org> In-Reply-To: From: Dan Williams Date: Thu, 27 Jun 2019 12:09:29 -0700 Message-ID: Subject: Re: [PATCH] filesystem-dax: Disable PMD support To: Matthew Wilcox Cc: linux-nvdimm , Jan Kara , stable , Robert Barror , Seema Pandit , linux-fsdevel , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 27, 2019 at 11:58 AM Dan Williams wrote: > > On Thu, Jun 27, 2019 at 11:29 AM Dan Williams wrote: > > > > On Thu, Jun 27, 2019 at 9:06 AM Dan Williams wrote: > > > > > > On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox wrote: > > > > > > > > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote: > > > > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has > > > > > been encountering intermittent lockups. The backtraces always include > > > > > the filesystem-DAX PMD path, multi-order entries have been a source of > > > > > bugs in the past, and disabling the PMD path allows a test that fails in > > > > > minutes to run for an hour. > > > > > > > > On May 4th, I asked you: > > > > > > > > Since this is provoked by a fatal signal, it must have something to do > > > > with a killable or interruptible sleep. There's only one of those in the > > > > DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do > > > > I/O with write() or through a writable mmap()? I'd like to know before > > > > I chase too far down this fault tree analysis. > > > > > > RocksDB in this case is using write() for writes and mmap() for reads. > > > > It's not clear to me that a fatal signal is a component of the failure > > as much as it's the way to detect that the benchmark has indeed locked > > up. > > Even though db_bench is run with the mmap_read=1 option: > > cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \ > --use_existing_db=1 \ > --mmap_read=1 \ > --num=$num_keys \ > --threads=$num_read_threads \ > > When the lockup occurs there are db_bench processes in the write fault path: > > [ 1666.635212] db_bench D 0 2492 2435 0x00000000 > [ 1666.641339] Call Trace: > [ 1666.644072] ? __schedule+0x24f/0x680 > [ 1666.648162] ? __switch_to_asm+0x34/0x70 > [ 1666.652545] schedule+0x29/0x90 > [ 1666.656054] get_unlocked_entry+0xcd/0x120 > [ 1666.660629] ? dax_iomap_actor+0x270/0x270 > [ 1666.665206] grab_mapping_entry+0x14f/0x230 > [ 1666.669878] dax_iomap_pmd_fault.isra.42+0x14d/0x950 > [ 1666.675425] ? futex_wait+0x122/0x230 > [ 1666.679518] ext4_dax_huge_fault+0x16f/0x1f0 > [ 1666.684288] __handle_mm_fault+0x411/0x1350 > [ 1666.688961] ? do_futex+0xca/0xbb0 > [ 1666.692760] ? __switch_to_asm+0x34/0x70 > [ 1666.697144] handle_mm_fault+0xbe/0x1e0 > [ 1666.701429] __do_page_fault+0x249/0x4f0 > [ 1666.705811] do_page_fault+0x32/0x110 > [ 1666.709903] ? page_fault+0x8/0x30 > [ 1666.713702] page_fault+0x1e/0x30 > > ...where __handle_mm_fault+0x411 is in wp_huge_pmd(): > > (gdb) li *(__handle_mm_fault+0x411) > 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800). > 3795 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, > pmd_t orig_pmd) > 3796 { > 3797 if (vma_is_anonymous(vmf->vma)) > 3798 return do_huge_pmd_wp_page(vmf, orig_pmd); > 3799 if (vmf->vma->vm_ops->huge_fault) > 3800 return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); > 3801 > 3802 /* COW handled on pte level: split pmd */ > 3803 VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma); > 3804 __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); > > This bug feels like we failed to unlock, or unlocked the wrong entry > and this hunk in the bisected commit looks suspect to me. Why do we > still need to drop the lock now that the radix_tree_preload() calls > are gone? Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.