Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932251AbdHWP2o (ORCPT ); Wed, 23 Aug 2017 11:28:44 -0400 Received: from mga11.intel.com ([192.55.52.93]:15079 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932090AbdHWP2l (ORCPT ); Wed, 23 Aug 2017 11:28:41 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,417,1498546800"; d="scan'208";a="127423749" Date: Wed, 23 Aug 2017 09:28:39 -0600 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Christoph Hellwig , Dan Williams , Dave Chinner , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org, "Slusarz, Marcin" , stable@vger.kernel.org Subject: Re: [PATCH v2 1/2] dax: fix deadlock due to misaligned PMD faults Message-ID: <20170823152839.GA25999@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , Andrew Morton , linux-kernel@vger.kernel.org, Alexander Viro , Christoph Hellwig , Dan Williams , Dave Chinner , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org, "Slusarz, Marcin" , stable@vger.kernel.org References: <20170822220926.13799-1-ross.zwisler@linux.intel.com> <20170822222436.18926-1-ross.zwisler@linux.intel.com> <20170823095733.GA2100@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170823095733.GA2100@quack2.suse.cz> User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5736 Lines: 125 On Wed, Aug 23, 2017 at 11:57:33AM +0200, Jan Kara wrote: > On Tue 22-08-17 16:24:35, Ross Zwisler wrote: > > In DAX there are two separate places where the 2MiB range of a PMD is > > defined. > > > > The first is in the page tables, where a PMD mapping inserted for a given > > address spans from (vmf->address & PMD_MASK) to > > ((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB > > boundary below the address to the 2MiB boundary above the address. > > > > So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD > > that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000). > > > > The second PMD range is in the mapping->page_tree, where a given file > > offset is covered by a radix tree entry that spans from one 2MiB aligned > > file offset to another 2MiB aligned file offset. > > > > So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD > > range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to > > 4MiB (pgoff 1024). > > > > This system works so long as the addresses and file offsets for a given > > mapping both have the same offsets relative to the start of each PMD. > > > > Consider the case where the starting address for a given file isn't 2MiB > > aligned - say our faulting address is 3 MiB (0x30 0000), but that > > corresponds to the beginning of our file (pgoff 0). Now all the PMDs in > > the mapping are misaligned so that the 2MiB range defined in the page > > tables never matches up with the 2MiB range defined in the radix tree. > > > > The current code notices this case for DAX faults to storage with the > > following test in dax_pmd_insert_mapping(): > > > > if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR) > > goto unlock_fallback; > > > > This test makes sure that the pfn we get from the driver is 2MiB aligned, > > and relies on the assumption that the 2MiB alignment of the pfn we get back > > from the driver matches the 2MiB alignment of the faulting address. > > > > However, faults to holes were not checked and we could hit the problem > > described above. > > > > This was reported in response to the NVML nvml/src/test/pmempool_sync > > TEST5: > > > > $ cd nvml/src/test/pmempool_sync > > $ make TEST5 > > > > You can grab NVML here: > > > > https://github.com/pmem/nvml/ > > > > The dmesg warning you see when you hit this error is: > > > > WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310 > > > > Where we notice in dax_insert_mapping_entry() that the radix tree entry we > > are about to replace doesn't match the locked entry that we had previously > > inserted into the tree. This happens because the initial insertion was > > done in grab_mapping_entry() using a pgoff calculated from the faulting > > address (vmf->address), and the replacement in > > dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff. > > > > In our failure case those two page offsets (one calculated from > > vmf->address, one using vmf->pgoff) point to different order 9 radix tree > > entries. > > > > This failure case can result in a deadlock because the radix tree unlock > > also happens on the pgoff calculated from vmf->address. This means that > > the locked radix tree entry that we swapped in to the tree in > > dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all > > future faults to that 2MiB range will block forever. > > > > Fix this by validating that the faulting address's PMD offset matches the > > PMD offset from the start of the file. This check is done at the very > > beginning of the fault and covers faults that would have mapped to storage > > as well as faults to holes. I left the COLOUR check in > > dax_pmd_insert_mapping() in place in case we ever hit the insanity > > condition where the alignment of the pfn we get from the driver doesn't > > match the alignment of the userspace address. > > Hum, I'm confused with all these offsets and their alignment requirements. > So let me try to get this straight. We have three different things here: > > 1) virtual address A where we fault > 2) file offset FA corresponding to the virtual address - computed as > linear_page_index(vma, A) = (A - vma->start) >> PAGE_SHIFT + vma->pgoff > - and stored in vmf->pgoff > 3) physical address (or sector in filesystem terminology) the file offset > maps to. > > and then we have the aligned virtual address B = (A & PMD_MASK) and > corresponding file offset FB we've got from linear_page_index(vma, B). Now > if I've correctly understood what you've written above, the problem is that > although B is aligned, FB doesn't have to be. That can happen either when > vma->start is not aligned (which is the example you give above?) or when > vma->pgoff is non aligned. And as a result FA >> 9 != FB >> 9 leading to > issues. > > OK, makes sense. You can add: > > Reviewed-by: Jan Kara Yep, your understanding matches mine. What was happening in one specific failure in the NVML test was: vmf->vm_pgoff = 0x1 /* the vma's page offset from the start of the file */ file offset FA = vmf->pgoff = 0x1200 address A = 0x7f7a8edff000 So, as you say the 0x1200 pgoff is calculated via vmf->pgoff = (A - vma->start) >> PAGE_SHIFT + vma->vm_pgoff 0x1200 = (0x7f7a8edff000 - 0x7f7a8dc00000) >> PAGE_SHIFT + 1 So, when we get the aligned virtual address B = (A & PMD_MASK) in dax_iomap_pmd_fault() and use that to get the aligned page offset: aligned pgoff FB = (B - vma->start) >> PAGE_SHIFT + vma->vm_pgoff 0x1001 = (0x7f7a8ec00000 - 0x7f7a8dc00000) >> PAGE_SHIFT + 1 The aligned pgoff FB of 0x1001 is a different PMD in the radix tree than we get when we use the vmf->pgoff FA of 0x1200. - Ross