Received: by 2002:a25:f815:0:0:0:0:0 with SMTP id u21csp2417168ybd; Thu, 27 Jun 2019 11:59:05 -0700 (PDT) X-Google-Smtp-Source: APXvYqxj+WDTANkpSB+dOAC+PFiyp03PZHlNr/yKWWVzaM0Wb0v3fVOi4S7alK0S3gOowz8zoe96 X-Received: by 2002:a17:902:8ec3:: with SMTP id x3mr6283602plo.313.1561661945407; Thu, 27 Jun 2019 11:59:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1561661945; cv=none; d=google.com; s=arc-20160816; b=vkB6eCb/imX5IpWydmigBqLQmaCFtfy3vwSn3kgZKNs7blGUHMbVnKUZSD+8AXNhpA o/kN2Zo/XGm4qwJ+i0XsKmXWeDJNKO1mhTlDKMjXOcveG9NyKVqIAzh4+uPW/nyztNtx jjRU74WEHaPC73RfpnL/sqMktD0SlCQiroutI+6HqaBsxt2SNlkVI/hlaniP+taT0ZBx xkR7ZThu1HsicuvWcdekuTXugEKPbGQ4SeG194zo9JldBjyz5DIEwqkqrwDm6+hymuxI bnIx066PpcVhsi5bZ5TgdNWyD+AfL4rLjvimE9otkN56zGHf1c/kx7cgS8qg2X5VZ/j5 tj3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Buh4vYpE7LbYuKttCx2Cpr/a4GmgOyG3kH9utY0SGvg=; b=d3cYDtqRRNph+pnzfWz/6NNy5mBfWuuoD21m+Vm0UpPUmwPpdV4bCHdfDZaJg10FML BmUPF7Wqpo9ruG6bTDSeWHJmCHBjBHOvfEDAaF/hbmRXITFFJdbAExFB9c9nVx6cAi78 1vP4S7SZgGiyk4SOR5nsICpU68UoYCffGjyakE5mR0U5hKs+S/4q4vtrgeZU95GLXiZC BsyBP1Imc0u2NFwP/WLPr7Aelyx5QkFMLRosTzlDvOV0b7e9vDENgdBeaQhaNS7ei0II j5EpTrT4BxPS0HInvaxKiELVjW7/Ib7Eu3bUwfDgN7ph3C4ol6mmUk5sabXPvmUPYjqC JMgQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=UVzm5lEw; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s19si2750985pgg.20.2019.06.27.11.58.48; Thu, 27 Jun 2019 11:59:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=UVzm5lEw; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726632AbfF0S6Y (ORCPT + 99 others); Thu, 27 Jun 2019 14:58:24 -0400 Received: from mail-ot1-f67.google.com ([209.85.210.67]:46176 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726476AbfF0S6X (ORCPT ); Thu, 27 Jun 2019 14:58:23 -0400 Received: by mail-ot1-f67.google.com with SMTP id z23so3345776ote.13 for ; Thu, 27 Jun 2019 11:58:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Buh4vYpE7LbYuKttCx2Cpr/a4GmgOyG3kH9utY0SGvg=; b=UVzm5lEwjmSVdXZ/dYYxW4VxfyylgZoSDFKKznMhT/DZQDkRdkfD4dF/nEqGh68zaM +3YpvYStDIekqEbuXuBZN58wRAUHTVoiBHVKl+9NGKGF/kpEmlvFH15MPa+nTCB+74r8 V/EpgUV/qh9XsZwcMwyXnBdQeDazJdE03aVafP+sfQpPouimkrNjlyd7/gYgRIZu5SZ9 nLrnX7WKLJGDNHyGUs9E1D2GFupQzbXMXIh3hcXuarKwBOdUsZk92+XY2xehCf36rcsP GTTy6vXV6xZOaQ3hr/cm0XuWVAUgklXz1YC4JsFJrKhPgZI1KsLncL6rQ/rn3/Vt15LE scxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Buh4vYpE7LbYuKttCx2Cpr/a4GmgOyG3kH9utY0SGvg=; b=uSlGRVzWd+WuGOhFXuAcqGWX6uJ6fmx5UGmUhGCixTX8HOvzvPMCBwptKC2Yg4yLba Jaw6x2r2/ctEX+Eo1eJHatINWZnFIkmBeBhIG+inofdnGwTz/ZP1sIcLAr+XInP5TWCB +YZ9cqtQes0LpcAnBPMXUe1CQ2Q0Ylzi+YPUveTufOy0N6CmuY87kT5kOToke2W+xB+R U2c2z8A1sLvE9FYUmHDyfVXKe95q9CHPUjgl9boy0xQmxgrJMGv4+hYKWb14Y1KsubZy NmAv6CXUqz2Z9lx1GMRpPAtFiAqS81WImxYZIBuhiG5DG/ID6jOq13IBQuCRPKaHlGnP S2yg== X-Gm-Message-State: APjAAAXx5flzS72lJhblsQlcStC7wvXs0StXzZ+jTP2sJ+MKIfXNYJVy Gjm+IwCNz4m8lAkcBgVoV7u7qimfbTgDttsfEaK4QQ== X-Received: by 2002:a9d:7b48:: with SMTP id f8mr4560030oto.207.1561661902921; Thu, 27 Jun 2019 11:58:22 -0700 (PDT) MIME-Version: 1.0 References: <156159454541.2964018.7466991316059381921.stgit@dwillia2-desk3.amr.corp.intel.com> <20190627123415.GA4286@bombadil.infradead.org> In-Reply-To: From: Dan Williams Date: Thu, 27 Jun 2019 11:58:12 -0700 Message-ID: Subject: Re: [PATCH] filesystem-dax: Disable PMD support To: Matthew Wilcox Cc: linux-nvdimm , Jan Kara , stable , Robert Barror , Seema Pandit , linux-fsdevel , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 27, 2019 at 11:29 AM Dan Williams wrote: > > On Thu, Jun 27, 2019 at 9:06 AM Dan Williams wrote: > > > > On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox wrote: > > > > > > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote: > > > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has > > > > been encountering intermittent lockups. The backtraces always include > > > > the filesystem-DAX PMD path, multi-order entries have been a source of > > > > bugs in the past, and disabling the PMD path allows a test that fails in > > > > minutes to run for an hour. > > > > > > On May 4th, I asked you: > > > > > > Since this is provoked by a fatal signal, it must have something to do > > > with a killable or interruptible sleep. There's only one of those in the > > > DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do > > > I/O with write() or through a writable mmap()? I'd like to know before > > > I chase too far down this fault tree analysis. > > > > RocksDB in this case is using write() for writes and mmap() for reads. > > It's not clear to me that a fatal signal is a component of the failure > as much as it's the way to detect that the benchmark has indeed locked > up. Even though db_bench is run with the mmap_read=1 option: cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \ --use_existing_db=1 \ --mmap_read=1 \ --num=$num_keys \ --threads=$num_read_threads \ When the lockup occurs there are db_bench processes in the write fault path: [ 1666.635212] db_bench D 0 2492 2435 0x00000000 [ 1666.641339] Call Trace: [ 1666.644072] ? __schedule+0x24f/0x680 [ 1666.648162] ? __switch_to_asm+0x34/0x70 [ 1666.652545] schedule+0x29/0x90 [ 1666.656054] get_unlocked_entry+0xcd/0x120 [ 1666.660629] ? dax_iomap_actor+0x270/0x270 [ 1666.665206] grab_mapping_entry+0x14f/0x230 [ 1666.669878] dax_iomap_pmd_fault.isra.42+0x14d/0x950 [ 1666.675425] ? futex_wait+0x122/0x230 [ 1666.679518] ext4_dax_huge_fault+0x16f/0x1f0 [ 1666.684288] __handle_mm_fault+0x411/0x1350 [ 1666.688961] ? do_futex+0xca/0xbb0 [ 1666.692760] ? __switch_to_asm+0x34/0x70 [ 1666.697144] handle_mm_fault+0xbe/0x1e0 [ 1666.701429] __do_page_fault+0x249/0x4f0 [ 1666.705811] do_page_fault+0x32/0x110 [ 1666.709903] ? page_fault+0x8/0x30 [ 1666.713702] page_fault+0x1e/0x30 ...where __handle_mm_fault+0x411 is in wp_huge_pmd(): (gdb) li *(__handle_mm_fault+0x411) 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800). 3795 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) 3796 { 3797 if (vma_is_anonymous(vmf->vma)) 3798 return do_huge_pmd_wp_page(vmf, orig_pmd); 3799 if (vmf->vma->vm_ops->huge_fault) 3800 return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); 3801 3802 /* COW handled on pte level: split pmd */ 3803 VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma); 3804 __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); This bug feels like we failed to unlock, or unlocked the wrong entry and this hunk in the bisected commit looks suspect to me. Why do we still need to drop the lock now that the radix_tree_preload() calls are gone? /* * Besides huge zero pages the only other thing that gets * downgraded are empty entries which don't need to be * unmapped. */ - if (pmd_downgrade && dax_is_zero_entry(entry)) - unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); - - err = radix_tree_preload( - mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM); - if (err) { - if (pmd_downgrade) - put_locked_mapping_entry(mapping, index); - return ERR_PTR(err); - } - xa_lock_irq(&mapping->i_pages); - - if (!entry) { - /* - * We needed to drop the i_pages lock while calling - * radix_tree_preload() and we didn't have an entry to - * lock. See if another thread inserted an entry at - * our index during this time. - */ - entry = __radix_tree_lookup(&mapping->i_pages, index, - NULL, &slot); - if (entry) { - radix_tree_preload_end(); - xa_unlock_irq(&mapping->i_pages); - goto restart; - } + if (dax_is_zero_entry(entry)) { + xas_unlock_irq(xas); + unmap_mapping_pages(mapping, + xas->xa_index & ~PG_PMD_COLOUR, + PG_PMD_NR, false); + xas_reset(xas); + xas_lock_irq(xas); } - if (pmd_downgrade) { - dax_disassociate_entry(entry, mapping, false); - radix_tree_delete(&mapping->i_pages, index); - mapping->nrexceptional--; - dax_wake_mapping_entry_waiter(&mapping->i_pages, - index, entry, true); - } + dax_disassociate_entry(entry, mapping, false); + xas_store(xas, NULL); /* undo the PMD join */ + dax_wake_entry(xas, entry, true); + mapping->nrexceptional--;