2008-12-05 14:48:10

by Duane Griffin

[permalink] [raw]
Subject: Checking link targets are NULL-terminated

Hi folks,

I am looking at a report of an intermittent BUG caused by an
intentionally corrupted ext2 filesystem:
http://bugzilla.kernel.org/show_bug.cgi?id=11412

What I think is happening is generic_readlink gets the name via
i_ops->follow_link and passes it into vfs_readlink, without it
necessarily being validating anywhere. If the name is not
NULL-terminated the strlen call in vfs_readlink may run off past the end
of the page. I think this is potentially happening in
page_follow_link_light, as well as ext2_follow_link, so it isn't just
ext* that is affected.

Does this sound correct, or have I missed something?

Assuming this is a real problem, does anyone have a better solution than
scanning the name for a \0 (in ext2_follow_link and
page_follow_link_light) and returning -ENAMETOOLONG if we can't find
one? I.e. something like this:

diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c
index 4e2426e..9b01af2 100644
--- a/fs/ext2/symlink.c
+++ b/fs/ext2/symlink.c
@@ -24,8 +24,14 @@
static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
{
struct ext2_inode_info *ei = EXT2_I(dentry->d_inode);
- nd_set_link(nd, (char *)ei->i_data);
- return NULL;
+ void *err = NULL;
+
+ if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL)
+ err = ERR_PTR(-ENAMETOOLONG);
+ else
+ nd_set_link(nd, (char *)ei->i_data);
+
+ return err;
}

const struct inode_operations ext2_symlink_inode_operations = {
diff --git a/fs/namei.c b/fs/namei.c
index d34e0f9..f20e94b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2750,29 +2750,49 @@ static char *page_getlink(struct dentry * dentry, struct page **ppage)
{
struct page * page;
struct address_space *mapping = dentry->d_inode->i_mapping;
+ char *kaddr;
+
page = read_mapping_page(mapping, 0, NULL);
if (IS_ERR(page))
return (char*)page;
+
+ kaddr = kmap(page);
+ if (memchr(kaddr, 0, PAGE_SIZE) == NULL) {
+ kunmap(kaddr);
+ page_cache_release(page);
+ return ERR_PTR(-ENAMETOOLONG);
+ }
+
*ppage = page;
- return kmap(page);
+ return kaddr;
}

int page_readlink(struct dentry *dentry, char __user *buffer, int buflen)
{
+ int res;
struct page *page = NULL;
char *s = page_getlink(dentry, &page);
- int res = vfs_readlink(dentry,buffer,buflen,s);
+
+ if (IS_ERR(s))
+ return PTR_ERR(s);
+
+ res = vfs_readlink(dentry, buffer, buflen, s);
if (page) {
kunmap(page);
page_cache_release(page);
}
+
return res;
}

void *page_follow_link_light(struct dentry *dentry, struct nameidata *nd)
{
struct page *page = NULL;
- nd_set_link(nd, page_getlink(dentry, &page));
+ char *name = page_getlink(dentry, &page);
+ if (IS_ERR(name))
+ return name;
+
+ nd_set_link(nd, name);
return page;
}

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan


2008-12-08 22:30:03

by Andrew Morton

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

On Fri, 5 Dec 2008 14:48:10 +0000
"Duane Griffin" <[email protected]> wrote:

> Hi folks,
>
> I am looking at a report of an intermittent BUG caused by an
> intentionally corrupted ext2 filesystem:
> http://bugzilla.kernel.org/show_bug.cgi?id=11412
>
> What I think is happening is generic_readlink gets the name via
> i_ops->follow_link and passes it into vfs_readlink, without it
> necessarily being validating anywhere. If the name is not
> NULL-terminated the strlen call in vfs_readlink may run off past the end
> of the page. I think this is potentially happening in
> page_follow_link_light, as well as ext2_follow_link, so it isn't just
> ext* that is affected.
>
> Does this sound correct, or have I missed something?
>
> Assuming this is a real problem, does anyone have a better solution than
> scanning the name for a \0 (in ext2_follow_link and
> page_follow_link_light) and returning -ENAMETOOLONG if we can't find
> one? I.e. something like this:

It would be nice to fix this in a single place, for all filesystems,
for all time. But how to do that?


> diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c
> index 4e2426e..9b01af2 100644
> --- a/fs/ext2/symlink.c
> +++ b/fs/ext2/symlink.c
> @@ -24,8 +24,14 @@
> static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
> {
> struct ext2_inode_info *ei = EXT2_I(dentry->d_inode);
> - nd_set_link(nd, (char *)ei->i_data);
> - return NULL;
> + void *err = NULL;
> +
> + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL)
> + err = ERR_PTR(-ENAMETOOLONG);
> + else
> + nd_set_link(nd, (char *)ei->i_data);
> +
> + return err;
> }

Perhaps nd_set_link() is a suitable place? Change that function so
that it is passed a third argument (max_len) and then check that within
nd_set_link(). Change nd_set_link() to return a __must_check-marked
errno, change callers to handle errors appropriately.

Or something totally different ;) But along those lines?



2008-12-09 15:30:57

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

Duane Griffin wrote:
> Hi folks,
>
> I am looking at a report of an intermittent BUG caused by an
> intentionally corrupted ext2 filesystem:
> http://bugzilla.kernel.org/show_bug.cgi?id=11412
>
> What I think is happening is generic_readlink gets the name via
> i_ops->follow_link and passes it into vfs_readlink, without it
> necessarily being validating anywhere. If the name is not
> NULL-terminated the strlen call in vfs_readlink may run off past the end
> of the page. I think this is potentially happening in
> page_follow_link_light, as well as ext2_follow_link, so it isn't just
> ext* that is affected.
>
> Does this sound correct, or have I missed something?
>
> Assuming this is a real problem, does anyone have a better solution than
> scanning the name for a \0 (in ext2_follow_link and
> page_follow_link_light) and returning -ENAMETOOLONG if we can't find
> one? I.e. something like this:
>
> diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c
> index 4e2426e..9b01af2 100644
> --- a/fs/ext2/symlink.c
> +++ b/fs/ext2/symlink.c
> @@ -24,8 +24,14 @@
> static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
> {
> struct ext2_inode_info *ei = EXT2_I(dentry->d_inode);
> - nd_set_link(nd, (char *)ei->i_data);
> - return NULL;
> + void *err = NULL;
> +
> + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL)
> + err = ERR_PTR(-ENAMETOOLONG);
> + else
> + nd_set_link(nd, (char *)ei->i_data);
> +
> + return err;
> }

Here (Like below) Just zero the very last byte in the buffer.
The first time this buffer was strcpy to, it was including the null terminated
string. then written to inode on disk. When read, at most it could be,
is as space allocated at inode (including null). If intentionally damaged, the symlink
will be corrupted but Kernel is safe.

>
> const struct inode_operations ext2_symlink_inode_operations = {
> diff --git a/fs/namei.c b/fs/namei.c
> index d34e0f9..f20e94b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2750,29 +2750,49 @@ static char *page_getlink(struct dentry * dentry, struct page **ppage)
> {
> struct page * page;
> struct address_space *mapping = dentry->d_inode->i_mapping;
> + char *kaddr;
> +
> page = read_mapping_page(mapping, 0, NULL);
> if (IS_ERR(page))
> return (char*)page;
> +
> + kaddr = kmap(page);
> + if (memchr(kaddr, 0, PAGE_SIZE) == NULL) {
> + kunmap(kaddr);
> + page_cache_release(page);
> + return ERR_PTR(-ENAMETOOLONG);
> + }
> +

You don't need to search and fail here. All you need is to NULL terminate on
read_i_size() + 1 of inode. The length of the written string was set on write-time
the buffer is big since it is a full page and I think symlinks are limited to
less then that.

If a person damaged the symlink inode on disk (set different size) then
the data is corrupted but Kernel is still safe.

> *ppage = page;
> - return kmap(page);
> + return kaddr;
> }
>
> int page_readlink(struct dentry *dentry, char __user *buffer, int buflen)
> {
> + int res;
> struct page *page = NULL;
> char *s = page_getlink(dentry, &page);
> - int res = vfs_readlink(dentry,buffer,buflen,s);
> +
> + if (IS_ERR(s))
> + return PTR_ERR(s);
> +

Above will not fail this change is not needed

> + res = vfs_readlink(dentry, buffer, buflen, s);
> if (page) {
> kunmap(page);
> page_cache_release(page);
> }
> +
> return res;
> }
>
> void *page_follow_link_light(struct dentry *dentry, struct nameidata *nd)
> {
> struct page *page = NULL;
> - nd_set_link(nd, page_getlink(dentry, &page));
> + char *name = page_getlink(dentry, &page);
> + if (IS_ERR(name))
> + return name;
> +

Same here

> + nd_set_link(nd, name);
> return page;
> }
>
> Cheers,
> Duane.
>

I hit this problem too, while developing a filesystem that was based
on ext2. The reason that it works is because the remainder of a page is always
Zero'ed out on writes. Then when read, you receive back your zero terminated link.
(Which means that if you have a symlink exactly 4k it will BUG but I guess
that is not possible).

The solution is to use the i_size information for the string length, and zero
terminate at i_size + 1.

The way I fixed it is that I Zero out the last page's remainder on read and not
on write like ext2 and other do it. (A symlink is less then 4k, right?)

Boaz


2008-12-09 16:20:43

by Duane Griffin

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

Hi Boaz, thanks for your review and comments...

2008/12/9 Boaz Harrosh <[email protected]>:
>> diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c
>> index 4e2426e..9b01af2 100644
>> --- a/fs/ext2/symlink.c
>> +++ b/fs/ext2/symlink.c
>> @@ -24,8 +24,14 @@
>> static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
>> {
>> struct ext2_inode_info *ei = EXT2_I(dentry->d_inode);
>> - nd_set_link(nd, (char *)ei->i_data);
>> - return NULL;
>> + void *err = NULL;
>> +
>> + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL)
>> + err = ERR_PTR(-ENAMETOOLONG);
>> + else
>> + nd_set_link(nd, (char *)ei->i_data);
>> +
>> + return err;
>> }
>
> Here (Like below) Just zero the very last byte in the buffer.
> The first time this buffer was strcpy to, it was including the null terminated
> string. then written to inode on disk. When read, at most it could be,
> is as space allocated at inode (including null). If intentionally damaged, the symlink
> will be corrupted but Kernel is safe.

I considered this approach. Filesystems that allocate buffers for the
name (e.g. XFS) already tend to unconditionally NULL-terminate it, so
this is a non-issue for them. However others (including ext2) do not
allocate a buffer, instead pointing to the in-memory data representing
the on-disk data. If we NULL-terminate in those cases the in-memory
and on-disk data would differ. If the kernel writes out the data for
some other reason (say after updating atime) then we may
unintentionally modify the link target. That may not be a serious
problem in practice, but it doesn't feel right.

However, if the FS maintainers don't have a problem with it, it will
certainly be cleaner and easier to implement than scanning. Opinions?

[snip]

> I hit this problem too, while developing a filesystem that was based
> on ext2. The reason that it works is because the remainder of a page is always
> Zero'ed out on writes. Then when read, you receive back your zero terminated link.
> (Which means that if you have a symlink exactly 4k it will BUG but I guess
> that is not possible).

It is not possible for an uncorrupted symlink :)

> The solution is to use the i_size information for the string length, and zero
> terminate at i_size + 1.
>
> The way I fixed it is that I Zero out the last page's remainder on read and not
> on write like ext2 and other do it. (A symlink is less then 4k, right?)

Right. If PATH_MAX is larger than PAGE_SIZE no doubt all sorts of
things would start going horribly wrong.

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2008-12-09 16:46:28

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

Duane Griffin wrote:
> Hi Boaz, thanks for your review and comments...
>
> 2008/12/9 Boaz Harrosh <[email protected]>:
>>> diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c
>>> index 4e2426e..9b01af2 100644
>>> --- a/fs/ext2/symlink.c
>>> +++ b/fs/ext2/symlink.c
>>> @@ -24,8 +24,14 @@
>>> static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
>>> {
>>> struct ext2_inode_info *ei = EXT2_I(dentry->d_inode);
>>> - nd_set_link(nd, (char *)ei->i_data);
>>> - return NULL;
>>> + void *err = NULL;
>>> +
>>> + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL)
>>> + err = ERR_PTR(-ENAMETOOLONG);
>>> + else
>>> + nd_set_link(nd, (char *)ei->i_data);
>>> +
>>> + return err;
>>> }
>> Here (Like below) Just zero the very last byte in the buffer.
>> The first time this buffer was strcpy to, it was including the null terminated
>> string. then written to inode on disk. When read, at most it could be,
>> is as space allocated at inode (including null). If intentionally damaged, the symlink
>> will be corrupted but Kernel is safe.
>
> I considered this approach. Filesystems that allocate buffers for the
> name (e.g. XFS) already tend to unconditionally NULL-terminate it, so
> this is a non-issue for them. However others (including ext2) do not
> allocate a buffer, instead pointing to the in-memory data representing
> the on-disk data. If we NULL-terminate in those cases the in-memory
> and on-disk data would differ. If the kernel writes out the data for
> some other reason (say after updating atime) then we may
> unintentionally modify the link target. That may not be a serious
> problem in practice, but it doesn't feel right.
>

I just want to make sure that you understand the code above and convince
you that this can/should be done and will damage nothing.

The code you see above is only for links that are shorter then some constant.
The ext2 (and other fs's) will cache this case and write the symlink directly
into the inode that will then have 0 number of data blocks. The space allocated
at inode is constant and is chosen for good inode packing on disk. The inode
starts empty then if a symlink is short the string is strcpy to above buffer.
So even if intentional damage was done to on-disk data, putting another null
at the end will never hurt. At most it is redundant since there is another
one preceding. But in the case of damage the damage is fixed. There can never
be an information lost.

For symlinks that are longer then above constant 1 data block is allocated
and the symlink is written, padded by zeros. This is taken care of by the
generic layer in the code you patched at fs/namei.c. Terminating at
i_size + 1 will never reach the disk since only i_size bytes are ever written.

> However, if the FS maintainers don't have a problem with it, it will
> certainly be cleaner and easier to implement than scanning. Opinions?
>
> [snip]
>
>> I hit this problem too, while developing a filesystem that was based
>> on ext2. The reason that it works is because the remainder of a page is always
>> Zero'ed out on writes. Then when read, you receive back your zero terminated link.
>> (Which means that if you have a symlink exactly 4k it will BUG but I guess
>> that is not possible).
>
> It is not possible for an uncorrupted symlink :)
>
>> The solution is to use the i_size information for the string length, and zero
>> terminate at i_size + 1.
>>
>> The way I fixed it is that I Zero out the last page's remainder on read and not
>> on write like ext2 and other do it. (A symlink is less then 4k, right?)
>
> Right. If PATH_MAX is larger than PAGE_SIZE no doubt all sorts of
> things would start going horribly wrong.

Right that's what I thought. So my approach should be safe. Zero out at
i_size + 1

>
> Cheers,
> Duane.
>

Thanks
Boaz

2008-12-09 17:18:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

On Mon, Dec 08, 2008 at 02:30:03PM -0800, Andrew Morton wrote:
> Perhaps nd_set_link() is a suitable place? Change that function so
> that it is passed a third argument (max_len) and then check that within
> nd_set_link(). Change nd_set_link() to return a __must_check-marked
> errno, change callers to handle errors appropriately.
>
> Or something totally different ;) But along those lines?

Note that XFS and possibly other filesystem don't store the NULL
termination on disk. So having a follow_link interface that uses a
counted string would be a nice little optimization for the XFS
follow_link / readlink implementation. But I'm not really sure it's
worth complicating the VFS for that little gem.


2008-12-09 18:00:12

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

Christoph Hellwig wrote:
> On Mon, Dec 08, 2008 at 02:30:03PM -0800, Andrew Morton wrote:
>> Perhaps nd_set_link() is a suitable place? Change that function so
>> that it is passed a third argument (max_len) and then check that within
>> nd_set_link(). Change nd_set_link() to return a __must_check-marked
>> errno, change callers to handle errors appropriately.
>>
>> Or something totally different ;) But along those lines?
>
> Note that XFS and possibly other filesystem don't store the NULL
> termination on disk.

Note that ext2, for example, also only writes the string bytes without
any NULLs. It only happen to be zero padded because any last-page is zero-padded
from i_size to end of page.

> So having a follow_link interface that uses a
> counted string would be a nice little optimization for the XFS
> follow_link / readlink implementation. But I'm not really sure it's
> worth complicating the VFS for that little gem.
>

The inode's i_size already holds the string count so at the higher level
we have that information. But I'm convinced, nd_set_link() should receive
a new max_len, all users should be changed as a matter of code audit.
nd_set_link() should then proceed to truncate the string at that length
unconditionally no need for error returns.

My $0.017
Boaz

2008-12-09 18:04:58

by Duane Griffin

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

2008/12/9 Boaz Harrosh <[email protected]>:
> I just want to make sure that you understand the code above and convince
> you that this can/should be done and will damage nothing.

No problem, it never hurts to spell things out in detail :)

> The code you see above is only for links that are shorter then some constant.
> The ext2 (and other fs's) will cache this case and write the symlink directly
> into the inode that will then have 0 number of data blocks. The space allocated
> at inode is constant and is chosen for good inode packing on disk. The inode
> starts empty then if a symlink is short the string is strcpy to above buffer.

Sure, I understand this.

> So even if intentional damage was done to on-disk data, putting another null
> at the end will never hurt. At most it is redundant since there is another
> one preceding. But in the case of damage the damage is fixed. There can never
> be an information lost.

If the link name on disk is corrupted and the last character is not a
zero then it may be changed. That is information lost, albeit probably
not *useful* information. I wouldn't like to make such a change
without an OK from the FS maintainers, but I'll happily do it with
one. So far we seem to have one vote in favour and none against. :)

Note that the corruption doesn't have to be intentional, of course,
although it was in the particular bug report I was looking at
originally.

> For symlinks that are longer then above constant 1 data block is allocated
> and the symlink is written, padded by zeros. This is taken care of by the
> generic layer in the code you patched at fs/namei.c. Terminating at
> i_size + 1 will never reach the disk since only i_size bytes are ever written.

If the data on disk is corrupted such that i_size == PAGE_SIZE then
again we would have different data in-memory and on-disk. However in
this case the page would only contain the link name and so wouldn't be
dirtied and written out unless the name was changed anyway. So I
agree, it should be safe in this case, regardless of my concerns
above.

Thanks,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2008-12-11 16:06:42

by Duane Griffin

[permalink] [raw]
Subject: Re: Checking link targets are NULL-terminated

2008/12/9 Boaz Harrosh <[email protected]>:
> Christoph Hellwig wrote:
>> On Mon, Dec 08, 2008 at 02:30:03PM -0800, Andrew Morton wrote:
>>> Perhaps nd_set_link() is a suitable place? Change that function so
>>> that it is passed a third argument (max_len) and then check that within
>>> nd_set_link(). Change nd_set_link() to return a __must_check-marked
>>> errno, change callers to handle errors appropriately.
>>>
>>> Or something totally different ;) But along those lines?
>>
>> Note that XFS and possibly other filesystem don't store the NULL
>> termination on disk.
>
> Note that ext2, for example, also only writes the string bytes without
> any NULLs. It only happen to be zero padded because any last-page is zero-padded
> from i_size to end of page.
>
>> So having a follow_link interface that uses a
>> counted string would be a nice little optimization for the XFS
>> follow_link / readlink implementation. But I'm not really sure it's
>> worth complicating the VFS for that little gem.
>
> The inode's i_size already holds the string count so at the higher level
> we have that information. But I'm convinced, nd_set_link() should receive
> a new max_len, all users should be changed as a matter of code audit.
> nd_set_link() should then proceed to truncate the string at that length
> unconditionally no need for error returns.

I've looked at a few alternative options: scanning for NULLs,
NULL-terminating in nd_set_link, NULL-terminating in the FS code
(where it is necessary and not already being done), and passing the
length around explicitly.

NULL-terminating is definitely cleaner and easier than scanning.
Unfortunately, as Christoph indicated, passing the length around
explicitly does rather complicate the code. So the question is whether
to NULL-terminate in nd_set_link or earlier in the FS code.

Having tried both options, I'm inclined to do it in the FS code and
leave nd_set_link as it is. Many of the filesystems already take pains
to ensure the links are NULL-terminated and the minimal change of
fixing the others seems the safest option. However, this way we won't
solve things for all filesystems for all time, as Andrew wanted. I'll
post my preferred patches shortly, but if anyone would like to see
what the full nd_set_link change would look like let me know and I'll
post them for comparison.

FYI, here are the diffstats for the two options:

Terminating in FS code:
fs/9p/vfs_inode.c | 5 +++--
fs/befs/linuxvfs.c | 5 ++++-
fs/ecryptfs/inode.c | 3 ++-
fs/ext2/symlink.c | 4 +++-
fs/ext3/symlink.c | 4 +++-
fs/ext4/symlink.c | 4 +++-
fs/freevxfs/vxfs_immed.c | 1 +
fs/jfs/symlink.c | 2 ++
fs/namei.c | 8 ++++++--
fs/sysv/symlink.c | 4 +++-
fs/ufs/symlink.c | 4 +++-
11 files changed, 33 insertions(+), 11 deletions(-)

Adding length param and terminating in nd_set_link (but not removing
all the existing FS termination code):
fs/9p/vfs_inode.c | 10 +++++-----
fs/autofs/symlink.c | 2 +-
fs/autofs4/symlink.c | 2 +-
fs/befs/linuxvfs.c | 14 ++++++++++++--
fs/cifs/link.c | 8 ++------
fs/configfs/symlink.c | 4 ++--
fs/debugfs/file.c | 2 +-
fs/ecryptfs/inode.c | 20 ++++++++++----------
fs/ext2/symlink.c | 2 +-
fs/ext3/symlink.c | 2 +-
fs/ext4/symlink.c | 2 +-
fs/freevxfs/vxfs_immed.c | 2 +-
fs/fuse/dir.c | 2 +-
fs/jffs2/symlink.c | 2 +-
fs/jfs/symlink.c | 3 ++-
fs/namei.c | 11 +++++++++--
fs/nfs/symlink.c | 4 ++--
fs/proc/generic.c | 2 +-
fs/smbfs/symlink.c | 8 ++++----
fs/sysfs/symlink.c | 2 +-
fs/sysv/symlink.c | 3 ++-
fs/ubifs/file.c | 2 +-
fs/ufs/symlink.c | 2 +-
fs/xfs/linux-2.6/xfs_iops.c | 4 ++--
include/linux/namei.h | 4 +++-
mm/shmem.c | 4 ++--
26 files changed, 70 insertions(+), 53 deletions(-)

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan