In DAX there are two separate places where the 2MiB range of a PMD is
defined.
The first is in the page tables, where a PMD mapping inserted for a given
address spans from (vmf->address & PMD_MASK) to
((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB
boundary below the address to the 2MiB boundary above the address.
So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD
that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
The second PMD range is in the mapping->page_tree, where a given file
offset is covered by a radix tree entry that spans from one 2MiB aligned
file offset to another 2MiB aligned file offset.
So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD
range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to
4MiB (pgoff 1024).
This system works so long as the addresses and file offsets for a given
mapping both have the same offsets relative to the start of each PMD.
Consider the case where the starting address for a given file isn't 2MiB
aligned - say our faulting address is 3 MiB (0x30 0000), but that
corresponds to the beginning of our file (pgoff 0). Now all the PMDs in
the mapping are misaligned so that the 2MiB range defined in the page
tables never matches up with the 2MiB range defined in the radix tree.
The current code notices this case for DAX faults to storage with the
following test in dax_pmd_insert_mapping():
if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
goto unlock_fallback;
This test makes sure that the pfn we get from the driver is 2MiB aligned,
and relies on the assumption that the 2MiB alignment of the pfn we get back
from the driver matches the 2MiB alignment of the faulting address.
However, faults to holes were not checked and we could hit the problem
described above.
This was reported in response to the NVML nvml/src/test/pmempool_sync
TEST5:
$ cd nvml/src/test/pmempool_sync
$ make TEST5
You can grab NVML here:
https://github.com/pmem/nvml/
The dmesg warning you see when you hit this error is:
WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
Where we notice in dax_insert_mapping_entry() that the radix tree entry we
are about to replace doesn't match the locked entry that we had previously
inserted into the tree. This happens because the initial insertion was
done in grab_mapping_entry() using a pgoff calculated from the faulting
address (vmf->address), and the replacement in
dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff.
In our failure case those two page offsets (one calculated from
vmf->address, one using vmf->pgoff) point to different order 9 radix tree
entries.
Fix this by validating that the faulting address's PMD offset matches the
PMD offset from the start of the file. This check is done at the very
beginning of the fault and covers faults that would have mapped to storage
as well as faults to holes. I left the COLOUR check in
dax_pmd_insert_mapping() in place in case we ever hit the insanity
condition where the alignment of the pfn we get from the driver doesn't
match the alignment of the userspace address.
Because faults to actual storage (which are the only radix tree entries
that could possibly be dirty) correctly detect this misalignment and would
fall back to 4k entries, I don't *think* that this situation can result in
data corruption, but the fix is simple and unlikely to have a negative
impact so I think it's worth applying to stable.
Signed-off-by: Ross Zwisler <[email protected]>
Reported-by: "Slusarz, Marcin" <[email protected]>
Cc: <[email protected]>
---
This applies cleanly to the current v4.13-rc6 based linux/master.
This fix is simple and I think we really don't want to have zero page PMDs
where the radix tree entry doesn't match the page table entry, so in my
opinion this should be be merged for v4.13 (pending reviews, of course).
Patch 2 is just a cleanup and can wait for v4.14 if anyone is worried about
it.
This series has passed my regression testing using xfstests and the NVML
test that was used to initially find the problem.
---
fs/dax.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/dax.c b/fs/dax.c
index 306c2b6..865d42c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1383,6 +1383,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
trace_dax_pmd_fault(inode, vmf, max_pgoff, 0);
+ /*
+ * Make sure that the faulting address's PMD offset (color) matches
+ * the PMD offset from the start of the file. This is necessary so
+ * that a PMD range in the page table overlaps exactly with a PMD
+ * range in the radix tree.
+ */
+ if ((vmf->pgoff & PG_PMD_COLOUR) !=
+ ((vmf->address >> PAGE_SHIFT) & PG_PMD_COLOUR))
+ goto fallback;
+
/* Fall back to PTEs if we're going to COW */
if (write && !(vma->vm_flags & VM_SHARED))
goto fallback;
--
2.9.5
Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
equivalent page offset mask.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/dax.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 865d42c..0e2a1fd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -42,6 +42,9 @@
#define DAX_WAIT_TABLE_BITS 12
#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+/* The 'colour' (ie low bits) within a PMD of a page offset. */
+#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
+
static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
static int __init init_dax_wait_table(void)
@@ -98,7 +101,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
* the range covered by the PMD map to the same bit lock.
*/
if (dax_is_pmd_entry(entry))
- index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+ index &= ~PG_PMD_COLOUR;
key->mapping = mapping;
key->entry_start = index;
@@ -1262,12 +1265,6 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
}
#ifdef CONFIG_FS_DAX_PMD
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
- * more often than one might expect in the below functions.
- */
-#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
-
static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
loff_t pos, void **entryp)
{
--
2.9.5
Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
equivalent page offset mask.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/dax.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 865d42c..0e2a1fd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -42,6 +42,9 @@
#define DAX_WAIT_TABLE_BITS 12
#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+/* The 'colour' (ie low bits) within a PMD of a page offset. */
+#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
+
static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
static int __init init_dax_wait_table(void)
@@ -98,7 +101,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
* the range covered by the PMD map to the same bit lock.
*/
if (dax_is_pmd_entry(entry))
- index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+ index &= ~PG_PMD_COLOUR;
key->mapping = mapping;
key->entry_start = index;
@@ -1262,12 +1265,6 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
}
#ifdef CONFIG_FS_DAX_PMD
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
- * more often than one might expect in the below functions.
- */
-#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
-
static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
loff_t pos, void **entryp)
{
--
2.9.5
In DAX there are two separate places where the 2MiB range of a PMD is
defined.
The first is in the page tables, where a PMD mapping inserted for a given
address spans from (vmf->address & PMD_MASK) to
((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB
boundary below the address to the 2MiB boundary above the address.
So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD
that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
The second PMD range is in the mapping->page_tree, where a given file
offset is covered by a radix tree entry that spans from one 2MiB aligned
file offset to another 2MiB aligned file offset.
So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD
range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to
4MiB (pgoff 1024).
This system works so long as the addresses and file offsets for a given
mapping both have the same offsets relative to the start of each PMD.
Consider the case where the starting address for a given file isn't 2MiB
aligned - say our faulting address is 3 MiB (0x30 0000), but that
corresponds to the beginning of our file (pgoff 0). Now all the PMDs in
the mapping are misaligned so that the 2MiB range defined in the page
tables never matches up with the 2MiB range defined in the radix tree.
The current code notices this case for DAX faults to storage with the
following test in dax_pmd_insert_mapping():
if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
goto unlock_fallback;
This test makes sure that the pfn we get from the driver is 2MiB aligned,
and relies on the assumption that the 2MiB alignment of the pfn we get back
from the driver matches the 2MiB alignment of the faulting address.
However, faults to holes were not checked and we could hit the problem
described above.
This was reported in response to the NVML nvml/src/test/pmempool_sync
TEST5:
$ cd nvml/src/test/pmempool_sync
$ make TEST5
You can grab NVML here:
https://github.com/pmem/nvml/
The dmesg warning you see when you hit this error is:
WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
Where we notice in dax_insert_mapping_entry() that the radix tree entry we
are about to replace doesn't match the locked entry that we had previously
inserted into the tree. This happens because the initial insertion was
done in grab_mapping_entry() using a pgoff calculated from the faulting
address (vmf->address), and the replacement in
dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff.
In our failure case those two page offsets (one calculated from
vmf->address, one using vmf->pgoff) point to different order 9 radix tree
entries.
This failure case can result in a deadlock because the radix tree unlock
also happens on the pgoff calculated from vmf->address. This means that
the locked radix tree entry that we swapped in to the tree in
dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
future faults to that 2MiB range will block forever.
Fix this by validating that the faulting address's PMD offset matches the
PMD offset from the start of the file. This check is done at the very
beginning of the fault and covers faults that would have mapped to storage
as well as faults to holes. I left the COLOUR check in
dax_pmd_insert_mapping() in place in case we ever hit the insanity
condition where the alignment of the pfn we get from the driver doesn't
match the alignment of the userspace address.
Signed-off-by: Ross Zwisler <[email protected]>
Reported-by: "Slusarz, Marcin" <[email protected]>
Cc: <[email protected]>
---
Sorry for the quick v2, I realized that I didn't talk about the deadlock
that results from this issue. Patch title and changelog have been updated,
the code is the same.
This applies cleanly to the current v4.13-rc6 based linux/master.
In my opinion this should be be merged for v4.13 (pending reviews, of course).
Patch 2 is just a cleanup and can wait for v4.14 if anyone is worried about
it.
This series has passed my regression testing using xfstests and the NVML
test that was used to initially find the problem.
---
fs/dax.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/dax.c b/fs/dax.c
index 306c2b6..865d42c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1383,6 +1383,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
trace_dax_pmd_fault(inode, vmf, max_pgoff, 0);
+ /*
+ * Make sure that the faulting address's PMD offset (color) matches
+ * the PMD offset from the start of the file. This is necessary so
+ * that a PMD range in the page table overlaps exactly with a PMD
+ * range in the radix tree.
+ */
+ if ((vmf->pgoff & PG_PMD_COLOUR) !=
+ ((vmf->address >> PAGE_SHIFT) & PG_PMD_COLOUR))
+ goto fallback;
+
/* Fall back to PTEs if we're going to COW */
if (write && !(vma->vm_flags & VM_SHARED))
goto fallback;
--
2.9.5
On Tue 22-08-17 16:24:35, Ross Zwisler wrote:
> In DAX there are two separate places where the 2MiB range of a PMD is
> defined.
>
> The first is in the page tables, where a PMD mapping inserted for a given
> address spans from (vmf->address & PMD_MASK) to
> ((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB
> boundary below the address to the 2MiB boundary above the address.
>
> So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD
> that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
>
> The second PMD range is in the mapping->page_tree, where a given file
> offset is covered by a radix tree entry that spans from one 2MiB aligned
> file offset to another 2MiB aligned file offset.
>
> So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD
> range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to
> 4MiB (pgoff 1024).
>
> This system works so long as the addresses and file offsets for a given
> mapping both have the same offsets relative to the start of each PMD.
>
> Consider the case where the starting address for a given file isn't 2MiB
> aligned - say our faulting address is 3 MiB (0x30 0000), but that
> corresponds to the beginning of our file (pgoff 0). Now all the PMDs in
> the mapping are misaligned so that the 2MiB range defined in the page
> tables never matches up with the 2MiB range defined in the radix tree.
>
> The current code notices this case for DAX faults to storage with the
> following test in dax_pmd_insert_mapping():
>
> if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
> goto unlock_fallback;
>
> This test makes sure that the pfn we get from the driver is 2MiB aligned,
> and relies on the assumption that the 2MiB alignment of the pfn we get back
> from the driver matches the 2MiB alignment of the faulting address.
>
> However, faults to holes were not checked and we could hit the problem
> described above.
>
> This was reported in response to the NVML nvml/src/test/pmempool_sync
> TEST5:
>
> $ cd nvml/src/test/pmempool_sync
> $ make TEST5
>
> You can grab NVML here:
>
> https://github.com/pmem/nvml/
>
> The dmesg warning you see when you hit this error is:
>
> WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
>
> Where we notice in dax_insert_mapping_entry() that the radix tree entry we
> are about to replace doesn't match the locked entry that we had previously
> inserted into the tree. This happens because the initial insertion was
> done in grab_mapping_entry() using a pgoff calculated from the faulting
> address (vmf->address), and the replacement in
> dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff.
>
> In our failure case those two page offsets (one calculated from
> vmf->address, one using vmf->pgoff) point to different order 9 radix tree
> entries.
>
> This failure case can result in a deadlock because the radix tree unlock
> also happens on the pgoff calculated from vmf->address. This means that
> the locked radix tree entry that we swapped in to the tree in
> dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
> future faults to that 2MiB range will block forever.
>
> Fix this by validating that the faulting address's PMD offset matches the
> PMD offset from the start of the file. This check is done at the very
> beginning of the fault and covers faults that would have mapped to storage
> as well as faults to holes. I left the COLOUR check in
> dax_pmd_insert_mapping() in place in case we ever hit the insanity
> condition where the alignment of the pfn we get from the driver doesn't
> match the alignment of the userspace address.
Hum, I'm confused with all these offsets and their alignment requirements.
So let me try to get this straight. We have three different things here:
1) virtual address A where we fault
2) file offset FA corresponding to the virtual address - computed as
linear_page_index(vma, A) = (A - vma->start) >> PAGE_SHIFT + vma->pgoff
- and stored in vmf->pgoff
3) physical address (or sector in filesystem terminology) the file offset
maps to.
and then we have the aligned virtual address B = (A & PMD_MASK) and
corresponding file offset FB we've got from linear_page_index(vma, B). Now
if I've correctly understood what you've written above, the problem is that
although B is aligned, FB doesn't have to be. That can happen either when
vma->start is not aligned (which is the example you give above?) or when
vma->pgoff is non aligned. And as a result FA >> 9 != FB >> 9 leading to
issues.
OK, makes sense. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
>
> Signed-off-by: Ross Zwisler <[email protected]>
> Reported-by: "Slusarz, Marcin" <[email protected]>
> Cc: <[email protected]>
>
> ---
>
> Sorry for the quick v2, I realized that I didn't talk about the deadlock
> that results from this issue. Patch title and changelog have been updated,
> the code is the same.
>
> This applies cleanly to the current v4.13-rc6 based linux/master.
>
> In my opinion this should be be merged for v4.13 (pending reviews, of course).
>
> Patch 2 is just a cleanup and can wait for v4.14 if anyone is worried about
> it.
>
> This series has passed my regression testing using xfstests and the NVML
> test that was used to initially find the problem.
>
> ---
> fs/dax.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 306c2b6..865d42c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1383,6 +1383,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>
> trace_dax_pmd_fault(inode, vmf, max_pgoff, 0);
>
> + /*
> + * Make sure that the faulting address's PMD offset (color) matches
> + * the PMD offset from the start of the file. This is necessary so
> + * that a PMD range in the page table overlaps exactly with a PMD
> + * range in the radix tree.
> + */
> + if ((vmf->pgoff & PG_PMD_COLOUR) !=
> + ((vmf->address >> PAGE_SHIFT) & PG_PMD_COLOUR))
> + goto fallback;
> +
> /* Fall back to PTEs if we're going to COW */
> if (write && !(vma->vm_flags & VM_SHARED))
> goto fallback;
> --
> 2.9.5
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 22-08-17 16:24:36, Ross Zwisler wrote:
> Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
> equivalent page offset mask.
>
> Signed-off-by: Ross Zwisler <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/dax.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 865d42c..0e2a1fd 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -42,6 +42,9 @@
> #define DAX_WAIT_TABLE_BITS 12
> #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
>
> +/* The 'colour' (ie low bits) within a PMD of a page offset. */
> +#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
> +
> static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
>
> static int __init init_dax_wait_table(void)
> @@ -98,7 +101,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> * the range covered by the PMD map to the same bit lock.
> */
> if (dax_is_pmd_entry(entry))
> - index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> + index &= ~PG_PMD_COLOUR;
>
> key->mapping = mapping;
> key->entry_start = index;
> @@ -1262,12 +1265,6 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
> }
>
> #ifdef CONFIG_FS_DAX_PMD
> -/*
> - * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
> - * more often than one might expect in the below functions.
> - */
> -#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
> -
> static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
> loff_t pos, void **entryp)
> {
> --
> 2.9.5
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Aug 23, 2017 at 11:57:33AM +0200, Jan Kara wrote:
> On Tue 22-08-17 16:24:35, Ross Zwisler wrote:
> > In DAX there are two separate places where the 2MiB range of a PMD is
> > defined.
> >
> > The first is in the page tables, where a PMD mapping inserted for a given
> > address spans from (vmf->address & PMD_MASK) to
> > ((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB
> > boundary below the address to the 2MiB boundary above the address.
> >
> > So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD
> > that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
> >
> > The second PMD range is in the mapping->page_tree, where a given file
> > offset is covered by a radix tree entry that spans from one 2MiB aligned
> > file offset to another 2MiB aligned file offset.
> >
> > So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD
> > range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to
> > 4MiB (pgoff 1024).
> >
> > This system works so long as the addresses and file offsets for a given
> > mapping both have the same offsets relative to the start of each PMD.
> >
> > Consider the case where the starting address for a given file isn't 2MiB
> > aligned - say our faulting address is 3 MiB (0x30 0000), but that
> > corresponds to the beginning of our file (pgoff 0). Now all the PMDs in
> > the mapping are misaligned so that the 2MiB range defined in the page
> > tables never matches up with the 2MiB range defined in the radix tree.
> >
> > The current code notices this case for DAX faults to storage with the
> > following test in dax_pmd_insert_mapping():
> >
> > if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
> > goto unlock_fallback;
> >
> > This test makes sure that the pfn we get from the driver is 2MiB aligned,
> > and relies on the assumption that the 2MiB alignment of the pfn we get back
> > from the driver matches the 2MiB alignment of the faulting address.
> >
> > However, faults to holes were not checked and we could hit the problem
> > described above.
> >
> > This was reported in response to the NVML nvml/src/test/pmempool_sync
> > TEST5:
> >
> > $ cd nvml/src/test/pmempool_sync
> > $ make TEST5
> >
> > You can grab NVML here:
> >
> > https://github.com/pmem/nvml/
> >
> > The dmesg warning you see when you hit this error is:
> >
> > WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
> >
> > Where we notice in dax_insert_mapping_entry() that the radix tree entry we
> > are about to replace doesn't match the locked entry that we had previously
> > inserted into the tree. This happens because the initial insertion was
> > done in grab_mapping_entry() using a pgoff calculated from the faulting
> > address (vmf->address), and the replacement in
> > dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff.
> >
> > In our failure case those two page offsets (one calculated from
> > vmf->address, one using vmf->pgoff) point to different order 9 radix tree
> > entries.
> >
> > This failure case can result in a deadlock because the radix tree unlock
> > also happens on the pgoff calculated from vmf->address. This means that
> > the locked radix tree entry that we swapped in to the tree in
> > dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
> > future faults to that 2MiB range will block forever.
> >
> > Fix this by validating that the faulting address's PMD offset matches the
> > PMD offset from the start of the file. This check is done at the very
> > beginning of the fault and covers faults that would have mapped to storage
> > as well as faults to holes. I left the COLOUR check in
> > dax_pmd_insert_mapping() in place in case we ever hit the insanity
> > condition where the alignment of the pfn we get from the driver doesn't
> > match the alignment of the userspace address.
>
> Hum, I'm confused with all these offsets and their alignment requirements.
> So let me try to get this straight. We have three different things here:
>
> 1) virtual address A where we fault
> 2) file offset FA corresponding to the virtual address - computed as
> linear_page_index(vma, A) = (A - vma->start) >> PAGE_SHIFT + vma->pgoff
> - and stored in vmf->pgoff
> 3) physical address (or sector in filesystem terminology) the file offset
> maps to.
>
> and then we have the aligned virtual address B = (A & PMD_MASK) and
> corresponding file offset FB we've got from linear_page_index(vma, B). Now
> if I've correctly understood what you've written above, the problem is that
> although B is aligned, FB doesn't have to be. That can happen either when
> vma->start is not aligned (which is the example you give above?) or when
> vma->pgoff is non aligned. And as a result FA >> 9 != FB >> 9 leading to
> issues.
>
> OK, makes sense. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
Yep, your understanding matches mine. What was happening in one specific
failure in the NVML test was:
vmf->vm_pgoff = 0x1 /* the vma's page offset from the start of the file */
file offset FA = vmf->pgoff = 0x1200
address A = 0x7f7a8edff000
So, as you say the 0x1200 pgoff is calculated via
vmf->pgoff = (A - vma->start) >> PAGE_SHIFT + vma->vm_pgoff
0x1200 = (0x7f7a8edff000 - 0x7f7a8dc00000) >> PAGE_SHIFT + 1
So, when we get the aligned virtual address B = (A & PMD_MASK) in
dax_iomap_pmd_fault() and use that to get the aligned page offset:
aligned pgoff FB = (B - vma->start) >> PAGE_SHIFT + vma->vm_pgoff
0x1001 = (0x7f7a8ec00000 - 0x7f7a8dc00000) >> PAGE_SHIFT + 1
The aligned pgoff FB of 0x1001 is a different PMD in the radix tree than we
get when we use the vmf->pgoff FA of 0x1200.
- Ross