If mmap() maps a file, it can be passed an offset into the file at
which the mapping is to start. Offset could be a negative value when
represented as a loff_t. The offset plus length will be used to
update the file size (i_size) which is also a loff_t. Validate the
value of offset and offset + length to make sure they do not overflow
and appear as negative.
Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
region_abort if region_chg fails") applied. Prior to this commit, the
overflow would still occur but we would luckily return ENOMEM.
To reproduce:
mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);
Resulted in,
kernel BUG at mm/hugetlb.c:742!
Call Trace:
hugetlbfs_evict_inode+0x80/0xa0
? hugetlbfs_setattr+0x3c0/0x3c0
evict+0x24a/0x620
iput+0x48f/0x8c0
dentry_unlink_inode+0x31f/0x4d0
__dentry_kill+0x292/0x5e0
dput+0x730/0x830
__fput+0x438/0x720
____fput+0x1a/0x20
task_work_run+0xfe/0x180
exit_to_usermode_loop+0x133/0x150
syscall_return_slowpath+0x184/0x1c0
entry_SYSCALL_64_fastpath+0xab/0xad
Reported-by: Vegard Nossum <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7163fe0..dde8613 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
vma->vm_ops = &hugetlb_vm_ops;
+ /*
+ * Offset passed to mmap (before page shift) could have been
+ * negative when represented as a (l)off_t.
+ */
+ if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
+ return -EINVAL;
+
if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
return -EINVAL;
vma_len = (loff_t)(vma->vm_end - vma->vm_start);
+ len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+ /* check for overflow */
+ if (len < vma_len)
+ return -EINVAL;
inode_lock(inode);
file_accessed(file);
ret = -ENOMEM;
- len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
-
if (hugetlb_reserve_pages(inode,
vma->vm_pgoff >> huge_page_order(h),
len >> huge_page_shift(h), vma,
@@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
ret = 0;
if (vma->vm_flags & VM_WRITE && inode->i_size < len)
- inode->i_size = len;
+ i_size_write(inode, len);
out:
inode_unlock(inode);
--
2.7.4
On April 12, 2017 6:52 AM Mike Kravetz wrote:
>
> If mmap() maps a file, it can be passed an offset into the file at
> which the mapping is to start. Offset could be a negative value when
> represented as a loff_t. The offset plus length will be used to
> update the file size (i_size) which is also a loff_t. Validate the
> value of offset and offset + length to make sure they do not overflow
> and appear as negative.
>
> Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
> region_abort if region_chg fails") applied. Prior to this commit, the
> overflow would still occur but we would luckily return ENOMEM.
> To reproduce:
> mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);
>
> Resulted in,
> kernel BUG at mm/hugetlb.c:742!
> Call Trace:
> hugetlbfs_evict_inode+0x80/0xa0
> ? hugetlbfs_setattr+0x3c0/0x3c0
> evict+0x24a/0x620
> iput+0x48f/0x8c0
> dentry_unlink_inode+0x31f/0x4d0
> __dentry_kill+0x292/0x5e0
> dput+0x730/0x830
> __fput+0x438/0x720
> ____fput+0x1a/0x20
> task_work_run+0xfe/0x180
> exit_to_usermode_loop+0x133/0x150
> syscall_return_slowpath+0x184/0x1c0
> entry_SYSCALL_64_fastpath+0xab/0xad
>
> Reported-by: Vegard Nossum <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
On 12 April 2017 at 00:51, Mike Kravetz <[email protected]> wrote:
> If mmap() maps a file, it can be passed an offset into the file at
> which the mapping is to start. Offset could be a negative value when
> represented as a loff_t. The offset plus length will be used to
> update the file size (i_size) which is also a loff_t. Validate the
> value of offset and offset + length to make sure they do not overflow
> and appear as negative.
>
> Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
> region_abort if region_chg fails") applied. Prior to this commit, the
> overflow would still occur but we would luckily return ENOMEM.
> To reproduce:
> mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);
>
> Resulted in,
> kernel BUG at mm/hugetlb.c:742!
> Call Trace:
> hugetlbfs_evict_inode+0x80/0xa0
> ? hugetlbfs_setattr+0x3c0/0x3c0
> evict+0x24a/0x620
> iput+0x48f/0x8c0
> dentry_unlink_inode+0x31f/0x4d0
> __dentry_kill+0x292/0x5e0
> dput+0x730/0x830
> __fput+0x438/0x720
> ____fput+0x1a/0x20
> task_work_run+0xfe/0x180
> exit_to_usermode_loop+0x133/0x150
> syscall_return_slowpath+0x184/0x1c0
> entry_SYSCALL_64_fastpath+0xab/0xad
>
> Reported-by: Vegard Nossum <[email protected]>
Please use <[email protected]> if possible :-)
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 7163fe0..dde8613 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
> vma->vm_ops = &hugetlb_vm_ops;
>
> + /*
> + * Offset passed to mmap (before page shift) could have been
> + * negative when represented as a (l)off_t.
> + */
> + if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
> + return -EINVAL;
> +
This is strictly speaking undefined behaviour in C and would get
flagged by e.g. UBSAN. The kernel does compile with
-fno-strict-overflow when supported, though, so maybe it's more of a
theoretical issue.
Another thing: wouldn't we want to detect all truncations, not just
the ones that happen to end up negative?
For example (with -fno-strict-overflow), (0x12345678 << 12) ==
0x45678000, which is still a positive integer, but obviously
truncated.
We can easily avoid the UB by moving the cast out (since ->vm_pgoff is
unsigned and unsigned shifts are always defined IIRC), but that still
doesn't reliably detect the positive-result truncation/overflow.
> if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
> return -EINVAL;
>
> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> + /* check for overflow */
> + if (len < vma_len)
> + return -EINVAL;
Also strictly speaking UB. You can avoid it by casting vma_len to
unsigned and dropping the loff_t cast, but it's admittedly somewhat
verbose. There also isn't an "unsigned loff_t" AFAIK, but don't we
have some helpers to safely check for overflows? Surely this isn't the
only place that does loff_t arithmetic.
>
> inode_lock(inode);
> file_accessed(file);
>
> ret = -ENOMEM;
> - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> -
> if (hugetlb_reserve_pages(inode,
> vma->vm_pgoff >> huge_page_order(h),
> len >> huge_page_shift(h), vma,
> @@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>
> ret = 0;
> if (vma->vm_flags & VM_WRITE && inode->i_size < len)
> - inode->i_size = len;
> + i_size_write(inode, len);
> out:
> inode_unlock(inode);
This hunk seems a bit out of place in the sense that I don't see how
it relates to the overflow checking. I think this either belongs in a
separate patch or it deserves a mention in the changelog.
Vegard
On 04/12/2017 01:58 AM, Vegard Nossum wrote:
> On 12 April 2017 at 00:51, Mike Kravetz <[email protected]> wrote:
>> If mmap() maps a file, it can be passed an offset into the file at
>> which the mapping is to start. Offset could be a negative value when
>> represented as a loff_t. The offset plus length will be used to
>> update the file size (i_size) which is also a loff_t. Validate the
>> value of offset and offset + length to make sure they do not overflow
>> and appear as negative.
>>
>> Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
>> region_abort if region_chg fails") applied. Prior to this commit, the
>> overflow would still occur but we would luckily return ENOMEM.
>> To reproduce:
>> mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);
>>
>> Resulted in,
>> kernel BUG at mm/hugetlb.c:742!
>> Call Trace:
>> hugetlbfs_evict_inode+0x80/0xa0
>> ? hugetlbfs_setattr+0x3c0/0x3c0
>> evict+0x24a/0x620
>> iput+0x48f/0x8c0
>> dentry_unlink_inode+0x31f/0x4d0
>> __dentry_kill+0x292/0x5e0
>> dput+0x730/0x830
>> __fput+0x438/0x720
>> ____fput+0x1a/0x20
>> task_work_run+0xfe/0x180
>> exit_to_usermode_loop+0x133/0x150
>> syscall_return_slowpath+0x184/0x1c0
>> entry_SYSCALL_64_fastpath+0xab/0xad
>>
>> Reported-by: Vegard Nossum <[email protected]>
>
> Please use <[email protected]> if possible :-)
>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>> fs/hugetlbfs/inode.c | 15 ++++++++++++---
>> 1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 7163fe0..dde8613 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>> vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
>> vma->vm_ops = &hugetlb_vm_ops;
>>
>> + /*
>> + * Offset passed to mmap (before page shift) could have been
>> + * negative when represented as a (l)off_t.
>> + */
>> + if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
>> + return -EINVAL;
>> +
>
> This is strictly speaking undefined behaviour in C and would get
> flagged by e.g. UBSAN. The kernel does compile with
> -fno-strict-overflow when supported, though, so maybe it's more of a
> theoretical issue.
>
> Another thing: wouldn't we want to detect all truncations, not just
> the ones that happen to end up negative?
>
> For example (with -fno-strict-overflow), (0x12345678 << 12) ==
> 0x45678000, which is still a positive integer, but obviously
> truncated.
>
> We can easily avoid the UB by moving the cast out (since ->vm_pgoff is
> unsigned and unsigned shifts are always defined IIRC), but that still
> doesn't reliably detect the positive-result truncation/overflow.
The value in vm_pgoff was indirectly provided by the user. This is the
'off_t offset' value provided in the mmap system call. Before, getting
to this hugetlbfs mmap routine the value is shifted right (>> PAGE_SHIFT)
so that the value in bytes provided by the user is converted to a page
offset. This shift right is done with an unsigned type so there is no
sign extension. As a result, I do not think we have to worry about
anything but the negative check here. Let me know if my thinking is
not valid.
As for the undefined behavior, I guess we can do the shift on the unsigned
type and then cast to loff_t. The reason for using loff_t is that this
value may eventually be used in the calculation of a value assigned to
i_size which is of this type.
In the patch, I just used the same cast/assignment that previously existed
in the code. It could be changed to,
if (((loff_t)(vma->vm_pgoff << PAGE_SHIFT)) < 0)
>> if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
>> return -EINVAL;
>>
>> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
>> + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>> + /* check for overflow */
>> + if (len < vma_len)
>> + return -EINVAL;
>
> Also strictly speaking UB. You can avoid it by casting vma_len to
> unsigned and dropping the loff_t cast, but it's admittedly somewhat
> verbose. There also isn't an "unsigned loff_t" AFAIK, but don't we
> have some helpers to safely check for overflows? Surely this isn't the
> only place that does loff_t arithmetic.
I came up empty in my search for helpers. Actually, I spent more time
trying to figure out how this was handled in other filesystems. Then I
quickly discovered that hugetlbfs is 'special' and appears to be the
only one which has to deal with this situation.
The only 'similar code' is in the vfs layer when the offset argument to
fallocate is validated. There it is a simple check for negative value.
I am not sure if it make much sense to eliminate the shifting of signed
values in this patch. Certainly, this is strictly UB as you say. But,
after calculating these values the loff_t values are once again shifted
only a few lines later. I'm afraid this happens is several places in
this code.
>>
>> inode_lock(inode);
>> file_accessed(file);
>>
>> ret = -ENOMEM;
>> - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>> -
>> if (hugetlb_reserve_pages(inode,
>> vma->vm_pgoff >> huge_page_order(h),
>> len >> huge_page_shift(h), vma,
>> @@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>>
>> ret = 0;
>> if (vma->vm_flags & VM_WRITE && inode->i_size < len)
>> - inode->i_size = len;
>> + i_size_write(inode, len);
>> out:
>> inode_unlock(inode);
>
> This hunk seems a bit out of place in the sense that I don't see how
> it relates to the overflow checking. I think this either belongs in a
> separate patch or it deserves a mention in the changelog.
When looking at this bug, I was really concerned that i_size might be set
to a negative value. It was my hope that the helper routine inode_newsize_ok
would validate the value, but it does not. I left in the use of this helper,
since the code was already being changed. I can change as people see fit.
--
Mike Kravetz
On Tue, Apr 11, 2017 at 03:51:58PM -0700, Mike Kravetz wrote:
...
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 7163fe0..dde8613 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
> vma->vm_ops = &hugetlb_vm_ops;
>
> + /*
> + * Offset passed to mmap (before page shift) could have been
> + * negative when represented as a (l)off_t.
> + */
> + if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
> + return -EINVAL;
> +
> if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
> return -EINVAL;
>
> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> + /* check for overflow */
> + if (len < vma_len)
> + return -EINVAL;
Andrew sent this patch to Linus today, so I know it's a little too late, but
I think that getting len directly from vma like below might be a simpler fix.
len = (loff_t)(vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT));
This shouldn't overflow because vma->vm_{end|start|pgoff} are unsigned long,
but if worried you can add VM_BUG_ON_VMA(len < 0, vma).
Thanks,
Naoya Horiguchi
>
> inode_lock(inode);
> file_accessed(file);
>
> ret = -ENOMEM;
> - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> -
> if (hugetlb_reserve_pages(inode,
> vma->vm_pgoff >> huge_page_order(h),
> len >> huge_page_shift(h), vma,
> @@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>
> ret = 0;
> if (vma->vm_flags & VM_WRITE && inode->i_size < len)
> - inode->i_size = len;
> + i_size_write(inode, len);
> out:
> inode_unlock(inode);
>
> --
> 2.7.4
>
>
On 04/13/2017 08:32 PM, Naoya Horiguchi wrote:
> On Tue, Apr 11, 2017 at 03:51:58PM -0700, Mike Kravetz wrote:
> ...
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 7163fe0..dde8613 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>> vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
>> vma->vm_ops = &hugetlb_vm_ops;
>>
>> + /*
>> + * Offset passed to mmap (before page shift) could have been
>> + * negative when represented as a (l)off_t.
>> + */
>> + if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
>> + return -EINVAL;
>> +
>> if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
>> return -EINVAL;
>>
>> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
>> + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>> + /* check for overflow */
>> + if (len < vma_len)
>> + return -EINVAL;
>
> Andrew sent this patch to Linus today, so I know it's a little too late, but
> I think that getting len directly from vma like below might be a simpler fix.
>
> len = (loff_t)(vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT));
>
> This shouldn't overflow because vma->vm_{end|start|pgoff} are unsigned long,
> but if worried you can add VM_BUG_ON_VMA(len < 0, vma).
Thanks Naoya,
I am pretty sure the checks are necessary. You are correct in that
vma->vm_{end|start|pgoff} are unsigned long. However, pgoff can be
a REALLY big value that becomes negative when shifted.
Note that pgoff is simply the off_t offset value passed from the user cast
to unsigned long and shifted right by PAGE_SHIFT. There is nothing to
prevent a user from passing a 'signed' negative value. In the reproducer
provided, the value passed from user space is 0x8000000000000000ULL.
--
Mike Kravetz
On Sat, Apr 15, 2017 at 03:58:59PM -0700, Mike Kravetz wrote:
> On 04/13/2017 08:32 PM, Naoya Horiguchi wrote:
> > On Tue, Apr 11, 2017 at 03:51:58PM -0700, Mike Kravetz wrote:
> > ...
> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> >> index 7163fe0..dde8613 100644
> >> --- a/fs/hugetlbfs/inode.c
> >> +++ b/fs/hugetlbfs/inode.c
> >> @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> >> vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
> >> vma->vm_ops = &hugetlb_vm_ops;
> >>
> >> + /*
> >> + * Offset passed to mmap (before page shift) could have been
> >> + * negative when represented as a (l)off_t.
> >> + */
> >> + if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
> >> + return -EINVAL;
> >> +
> >> if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
> >> return -EINVAL;
> >>
> >> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> >> + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> >> + /* check for overflow */
> >> + if (len < vma_len)
> >> + return -EINVAL;
> >
> > Andrew sent this patch to Linus today, so I know it's a little too late, but
> > I think that getting len directly from vma like below might be a simpler fix.
> >
> > len = (loff_t)(vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT));
> >
> > This shouldn't overflow because vma->vm_{end|start|pgoff} are unsigned long,
> > but if worried you can add VM_BUG_ON_VMA(len < 0, vma).
>
> Thanks Naoya,
>
> I am pretty sure the checks are necessary. You are correct in that
> vma->vm_{end|start|pgoff} are unsigned long. However, pgoff can be
> a REALLY big value that becomes negative when shifted.
>
> Note that pgoff is simply the off_t offset value passed from the user cast
> to unsigned long and shifted right by PAGE_SHIFT. There is nothing to
> prevent a user from passing a 'signed' negative value. In the reproducer
> provided, the value passed from user space is 0x8000000000000000ULL.
OK, thank you for explanation. You're right.
- Naoya