2009-09-14 03:29:13

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH] devmem: handle partial kmem write/read

Current vwrite silently ignores vwrite() to vmap holes.
Better behavior should be stop the write and return
on such holes.

Also return on partial read, which may happen in future
(eg. hit hwpoison pages).

CC: Andi Kleen <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
drivers/char/mem.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)

--- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
+++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
@@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
if (!kbuf)
return -ENOMEM;
while (count > 0) {
+ unsigned long n;
+
sz = size_inside_page(p, count);
- sz = vread(kbuf, (char *)p, sz);
- if (!sz)
+ n = vread(kbuf, (char *)p, sz);
+ if (!n)
break;
- if (copy_to_user(buf, kbuf, sz)) {
+ if (copy_to_user(buf, kbuf, n)) {
free_page((unsigned long)kbuf);
return -EFAULT;
}
- count -= sz;
- buf += sz;
- read += sz;
- p += sz;
+ count -= n;
+ buf += n;
+ read += n;
+ p += n;
+ if (n < sz)
+ break;
}
free_page((unsigned long)kbuf);
}
@@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
free_page((unsigned long)kbuf);
return -EFAULT;
}
- sz = vwrite(kbuf, (char *)p, sz);
- count -= sz;
- buf += sz;
- virtr += sz;
- p += sz;
+ n = vwrite(kbuf, (char *)p, sz);
+ count -= n;
+ buf += n;
+ virtr += n;
+ p += n;
+ if (n < sz)
+ break;
}
free_page((unsigned long)kbuf);
}


2009-09-14 04:34:58

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] devmem: handle partial kmem write/read

Hi Kame,

This patch needs more work. I first intent to fix a bug:

sz = vwrite(kbuf, (char *)p, sz);
p += sz;
}

So if the returned len is 0, the kbuf/p pointers will mismatch.

Then I realize it changed the write behavior. The current vwrite()
behavior is strange, it returns 0 if _whole range_ is hole, otherwise
ignores the hole silently. So holes can be treated differently even in
the original code.

I'm not really sure about the right behavior. KAME-san, do you have
any suggestions?

Thanks,
Fengguang

On Mon, Sep 14, 2009 at 11:29:01AM +0800, Wu Fengguang wrote:
> Current vwrite silently ignores vwrite() to vmap holes.
> Better behavior should be stop the write and return
> on such holes.
>
> Also return on partial read, which may happen in future
> (eg. hit hwpoison pages).
>
> CC: Andi Kleen <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> drivers/char/mem.c | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
> --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> if (!kbuf)
> return -ENOMEM;
> while (count > 0) {
> + unsigned long n;
> +
> sz = size_inside_page(p, count);
> - sz = vread(kbuf, (char *)p, sz);
> - if (!sz)
> + n = vread(kbuf, (char *)p, sz);
> + if (!n)
> break;
> - if (copy_to_user(buf, kbuf, sz)) {
> + if (copy_to_user(buf, kbuf, n)) {
> free_page((unsigned long)kbuf);
> return -EFAULT;
> }
> - count -= sz;
> - buf += sz;
> - read += sz;
> - p += sz;
> + count -= n;
> + buf += n;
> + read += n;
> + p += n;
> + if (n < sz)
> + break;
> }
> free_page((unsigned long)kbuf);
> }
> @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> free_page((unsigned long)kbuf);
> return -EFAULT;
> }
> - sz = vwrite(kbuf, (char *)p, sz);
> - count -= sz;
> - buf += sz;
> - virtr += sz;
> - p += sz;
> + n = vwrite(kbuf, (char *)p, sz);
> + count -= n;
> + buf += n;
> + virtr += n;
> + p += n;
> + if (n < sz)
> + break;
> }
> free_page((unsigned long)kbuf);
> }

2009-09-15 00:27:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] devmem: handle partial kmem write/read

On Mon, 14 Sep 2009 12:34:44 +0800
Wu Fengguang <[email protected]> wrote:

> Hi Kame,
>
> This patch needs more work. I first intent to fix a bug:
>
> sz = vwrite(kbuf, (char *)p, sz);
> p += sz;
> }
>
> So if the returned len is 0, the kbuf/p pointers will mismatch.
>
> Then I realize it changed the write behavior. The current vwrite()
> behavior is strange, it returns 0 if _whole range_ is hole, otherwise
> ignores the hole silently. So holes can be treated differently even in
> the original code.
>
Ah, ok...

> I'm not really sure about the right behavior. KAME-san, do you have
> any suggestions?
>
maybe following make sense.
=
written = vwrite(kbuf, (char *p), sz);
if (!written) // whole vmem was a hole
written = sz;
==
needs fix.

Anyway, I wonder kmem is broken. It's should be totally rewritten.

For example, this doesn't check anything.
==
if (p < (unsigned long) high_memory) {

==

But....are there users ?
If necessary, I'll write some...

Thanks,
-Kame


> Thanks,
> Fengguang
>
> On Mon, Sep 14, 2009 at 11:29:01AM +0800, Wu Fengguang wrote:
> > Current vwrite silently ignores vwrite() to vmap holes.
> > Better behavior should be stop the write and return
> > on such holes.
> >
> > Also return on partial read, which may happen in future
> > (eg. hit hwpoison pages).
> >
> > CC: Andi Kleen <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > ---
> > drivers/char/mem.c | 30 ++++++++++++++++++------------
> > 1 file changed, 18 insertions(+), 12 deletions(-)
> >
> > --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> > +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> > @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> > if (!kbuf)
> > return -ENOMEM;
> > while (count > 0) {
> > + unsigned long n;
> > +
> > sz = size_inside_page(p, count);
> > - sz = vread(kbuf, (char *)p, sz);
> > - if (!sz)
> > + n = vread(kbuf, (char *)p, sz);
> > + if (!n)
> > break;
> > - if (copy_to_user(buf, kbuf, sz)) {
> > + if (copy_to_user(buf, kbuf, n)) {
> > free_page((unsigned long)kbuf);
> > return -EFAULT;
> > }
> > - count -= sz;
> > - buf += sz;
> > - read += sz;
> > - p += sz;
> > + count -= n;
> > + buf += n;
> > + read += n;
> > + p += n;
> > + if (n < sz)
> > + break;
> > }
> > free_page((unsigned long)kbuf);
> > }
> > @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> > free_page((unsigned long)kbuf);
> > return -EFAULT;
> > }
> > - sz = vwrite(kbuf, (char *)p, sz);
> > - count -= sz;
> > - buf += sz;
> > - virtr += sz;
> > - p += sz;
> > + n = vwrite(kbuf, (char *)p, sz);
> > + count -= n;
> > + buf += n;
> > + virtr += n;
> > + p += n;
> > + if (n < sz)
> > + break;
> > }
> > free_page((unsigned long)kbuf);
> > }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-15 01:55:08

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] devmem: handle partial kmem write/read

On Tue, 15 Sep 2009 09:24:48 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 14 Sep 2009 12:34:44 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > Hi Kame,
> >
> > This patch needs more work. I first intent to fix a bug:
> >
> > sz = vwrite(kbuf, (char *)p, sz);
> > p += sz;
> > }
> >
> > So if the returned len is 0, the kbuf/p pointers will mismatch.
> >
> > Then I realize it changed the write behavior. The current vwrite()
> > behavior is strange, it returns 0 if _whole range_ is hole, otherwise
> > ignores the hole silently. So holes can be treated differently even in
> > the original code.
> >
> Ah, ok...
>

I'd like to post a fix...but it seems you are woriking.
Hmm, I'll post a patch including your fix (and your sign.)
Is it ok ?

Thanks,
-Kame


> > I'm not really sure about the right behavior. KAME-san, do you have
> > any suggestions?
> >
> maybe following make sense.
> =
> written = vwrite(kbuf, (char *p), sz);
> if (!written) // whole vmem was a hole
> written = sz;
> ==
> needs fix.
>
> Anyway, I wonder kmem is broken. It's should be totally rewritten.
>
> For example, this doesn't check anything.
> ==
> if (p < (unsigned long) high_memory) {
>
> ==
>
> But....are there users ?
> If necessary, I'll write some...
>
> Thanks,
> -Kame
>
>
> > Thanks,
> > Fengguang
> >
> > On Mon, Sep 14, 2009 at 11:29:01AM +0800, Wu Fengguang wrote:
> > > Current vwrite silently ignores vwrite() to vmap holes.
> > > Better behavior should be stop the write and return
> > > on such holes.
> > >
> > > Also return on partial read, which may happen in future
> > > (eg. hit hwpoison pages).
> > >
> > > CC: Andi Kleen <[email protected]>
> > > Signed-off-by: Wu Fengguang <[email protected]>
> > > ---
> > > drivers/char/mem.c | 30 ++++++++++++++++++------------
> > > 1 file changed, 18 insertions(+), 12 deletions(-)
> > >
> > > --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> > > +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> > > @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> > > if (!kbuf)
> > > return -ENOMEM;
> > > while (count > 0) {
> > > + unsigned long n;
> > > +
> > > sz = size_inside_page(p, count);
> > > - sz = vread(kbuf, (char *)p, sz);
> > > - if (!sz)
> > > + n = vread(kbuf, (char *)p, sz);
> > > + if (!n)
> > > break;
> > > - if (copy_to_user(buf, kbuf, sz)) {
> > > + if (copy_to_user(buf, kbuf, n)) {
> > > free_page((unsigned long)kbuf);
> > > return -EFAULT;
> > > }
> > > - count -= sz;
> > > - buf += sz;
> > > - read += sz;
> > > - p += sz;
> > > + count -= n;
> > > + buf += n;
> > > + read += n;
> > > + p += n;
> > > + if (n < sz)
> > > + break;
> > > }
> > > free_page((unsigned long)kbuf);
> > > }
> > > @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> > > free_page((unsigned long)kbuf);
> > > return -EFAULT;
> > > }
> > > - sz = vwrite(kbuf, (char *)p, sz);
> > > - count -= sz;
> > > - buf += sz;
> > > - virtr += sz;
> > > - p += sz;
> > > + n = vwrite(kbuf, (char *)p, sz);
> > > + count -= n;
> > > + buf += n;
> > > + virtr += n;
> > > + p += n;
> > > + if (n < sz)
> > > + break;
> > > }
> > > free_page((unsigned long)kbuf);
> > > }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-15 02:02:28

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] devmem: handle partial kmem write/read

On Tue, Sep 15, 2009 at 08:24:48AM +0800, KAMEZAWA Hiroyuki wrote:
> On Mon, 14 Sep 2009 12:34:44 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > Hi Kame,
> >
> > This patch needs more work. I first intent to fix a bug:
> >
> > sz = vwrite(kbuf, (char *)p, sz);
> > p += sz;
> > }
> >
> > So if the returned len is 0, the kbuf/p pointers will mismatch.
> >
> > Then I realize it changed the write behavior. The current vwrite()
> > behavior is strange, it returns 0 if _whole range_ is hole, otherwise
> > ignores the hole silently. So holes can be treated differently even in
> > the original code.
> >
> Ah, ok...
>
> > I'm not really sure about the right behavior. KAME-san, do you have
> > any suggestions?
> >
> maybe following make sense.
> =
> written = vwrite(kbuf, (char *p), sz);
> if (!written) // whole vmem was a hole
> written = sz;

Since the 0 return value won't be used at all, it would be simpler to
tell vwrite() return the untouched buflen/sz, like this. It will ignore
_all_ the holes silently. Need to update comments too.

--- linux-mm.orig/mm/vmalloc.c 2009-09-15 09:40:08.000000000 +0800
+++ linux-mm/mm/vmalloc.c 2009-09-15 09:43:33.000000000 +0800
@@ -1834,7 +1834,6 @@ long vwrite(char *buf, char *addr, unsig
struct vm_struct *tmp;
char *vaddr;
unsigned long n, buflen;
- int copied = 0;

/* Don't allow overflow */
if ((unsigned long) addr + count < count)
@@ -1858,7 +1857,6 @@ long vwrite(char *buf, char *addr, unsig
n = count;
if (!(tmp->flags & VM_IOREMAP)) {
aligned_vwrite(buf, addr, n);
- copied++;
}
buf += n;
addr += n;
@@ -1866,8 +1864,6 @@ long vwrite(char *buf, char *addr, unsig
}
finished:
read_unlock(&vmlist_lock);
- if (!copied)
- return 0;
return buflen;
}

> ==
> needs fix.
>
> Anyway, I wonder kmem is broken. It's should be totally rewritten.
>
> For example, this doesn't check anything.
> ==
> if (p < (unsigned long) high_memory) {
>
> ==
>
> But....are there users ?
> If necessary, I'll write some...

I'm trying to stop possible mem/kmem users to access hwpoison pages..
I'm not the user, but rather a tester ;)

Thanks,
Fengguang

> Thanks,
> -Kame
>
>
> > Thanks,
> > Fengguang
> >
> > On Mon, Sep 14, 2009 at 11:29:01AM +0800, Wu Fengguang wrote:
> > > Current vwrite silently ignores vwrite() to vmap holes.
> > > Better behavior should be stop the write and return
> > > on such holes.
> > >
> > > Also return on partial read, which may happen in future
> > > (eg. hit hwpoison pages).
> > >
> > > CC: Andi Kleen <[email protected]>
> > > Signed-off-by: Wu Fengguang <[email protected]>
> > > ---
> > > drivers/char/mem.c | 30 ++++++++++++++++++------------
> > > 1 file changed, 18 insertions(+), 12 deletions(-)
> > >
> > > --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> > > +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> > > @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> > > if (!kbuf)
> > > return -ENOMEM;
> > > while (count > 0) {
> > > + unsigned long n;
> > > +
> > > sz = size_inside_page(p, count);
> > > - sz = vread(kbuf, (char *)p, sz);
> > > - if (!sz)
> > > + n = vread(kbuf, (char *)p, sz);
> > > + if (!n)
> > > break;
> > > - if (copy_to_user(buf, kbuf, sz)) {
> > > + if (copy_to_user(buf, kbuf, n)) {
> > > free_page((unsigned long)kbuf);
> > > return -EFAULT;
> > > }
> > > - count -= sz;
> > > - buf += sz;
> > > - read += sz;
> > > - p += sz;
> > > + count -= n;
> > > + buf += n;
> > > + read += n;
> > > + p += n;
> > > + if (n < sz)
> > > + break;
> > > }
> > > free_page((unsigned long)kbuf);
> > > }
> > > @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> > > free_page((unsigned long)kbuf);
> > > return -EFAULT;
> > > }
> > > - sz = vwrite(kbuf, (char *)p, sz);
> > > - count -= sz;
> > > - buf += sz;
> > > - virtr += sz;
> > > - p += sz;
> > > + n = vwrite(kbuf, (char *)p, sz);
> > > + count -= n;
> > > + buf += n;
> > > + virtr += n;
> > > + p += n;
> > > + if (n < sz)
> > > + break;
> > > }
> > > free_page((unsigned long)kbuf);
> > > }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2009-09-15 02:05:35

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] devmem: handle partial kmem write/read

On Tue, Sep 15, 2009 at 09:52:55AM +0800, KAMEZAWA Hiroyuki wrote:
> On Tue, 15 Sep 2009 09:24:48 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Mon, 14 Sep 2009 12:34:44 +0800
> > Wu Fengguang <[email protected]> wrote:
> >
> > > Hi Kame,
> > >
> > > This patch needs more work. I first intent to fix a bug:
> > >
> > > sz = vwrite(kbuf, (char *)p, sz);
> > > p += sz;
> > > }
> > >
> > > So if the returned len is 0, the kbuf/p pointers will mismatch.
> > >
> > > Then I realize it changed the write behavior. The current vwrite()
> > > behavior is strange, it returns 0 if _whole range_ is hole, otherwise
> > > ignores the hole silently. So holes can be treated differently even in
> > > the original code.
> > >
> > Ah, ok...
> >
>
> I'd like to post a fix...but it seems you are woriking.
> Hmm, I'll post a patch including your fix (and your sign.)
> Is it ok ?

Sure OK. You can work based on my patches. Many of them are in -mm now
and there are three more. I'll show you in a series.

Thanks,
Fengguang

> > > I'm not really sure about the right behavior. KAME-san, do you have
> > > any suggestions?
> > >
> > maybe following make sense.
> > =
> > written = vwrite(kbuf, (char *p), sz);
> > if (!written) // whole vmem was a hole
> > written = sz;
> > ==
> > needs fix.
> >
> > Anyway, I wonder kmem is broken. It's should be totally rewritten.
> >
> > For example, this doesn't check anything.
> > ==
> > if (p < (unsigned long) high_memory) {
> >
> > ==
> >
> > But....are there users ?
> > If necessary, I'll write some...
> >
> > Thanks,
> > -Kame
> >
> >
> > > Thanks,
> > > Fengguang
> > >
> > > On Mon, Sep 14, 2009 at 11:29:01AM +0800, Wu Fengguang wrote:
> > > > Current vwrite silently ignores vwrite() to vmap holes.
> > > > Better behavior should be stop the write and return
> > > > on such holes.
> > > >
> > > > Also return on partial read, which may happen in future
> > > > (eg. hit hwpoison pages).
> > > >
> > > > CC: Andi Kleen <[email protected]>
> > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > ---
> > > > drivers/char/mem.c | 30 ++++++++++++++++++------------
> > > > 1 file changed, 18 insertions(+), 12 deletions(-)
> > > >
> > > > --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> > > > +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> > > > @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> > > > if (!kbuf)
> > > > return -ENOMEM;
> > > > while (count > 0) {
> > > > + unsigned long n;
> > > > +
> > > > sz = size_inside_page(p, count);
> > > > - sz = vread(kbuf, (char *)p, sz);
> > > > - if (!sz)
> > > > + n = vread(kbuf, (char *)p, sz);
> > > > + if (!n)
> > > > break;
> > > > - if (copy_to_user(buf, kbuf, sz)) {
> > > > + if (copy_to_user(buf, kbuf, n)) {
> > > > free_page((unsigned long)kbuf);
> > > > return -EFAULT;
> > > > }
> > > > - count -= sz;
> > > > - buf += sz;
> > > > - read += sz;
> > > > - p += sz;
> > > > + count -= n;
> > > > + buf += n;
> > > > + read += n;
> > > > + p += n;
> > > > + if (n < sz)
> > > > + break;
> > > > }
> > > > free_page((unsigned long)kbuf);
> > > > }
> > > > @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> > > > free_page((unsigned long)kbuf);
> > > > return -EFAULT;
> > > > }
> > > > - sz = vwrite(kbuf, (char *)p, sz);
> > > > - count -= sz;
> > > > - buf += sz;
> > > > - virtr += sz;
> > > > - p += sz;
> > > > + n = vwrite(kbuf, (char *)p, sz);
> > > > + count -= n;
> > > > + buf += n;
> > > > + virtr += n;
> > > > + p += n;
> > > > + if (n < sz)
> > > > + break;
> > > > }
> > > > free_page((unsigned long)kbuf);
> > > > }
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> > >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2009-09-15 02:33:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] devmem: handle partial kmem write/read

On Tue, 15 Sep 2009 10:02:08 +0800
Wu Fengguang <[email protected]> wrote:

> On Tue, Sep 15, 2009 at 08:24:48AM +0800, KAMEZAWA Hiroyuki wrote:
> > On Mon, 14 Sep 2009 12:34:44 +0800
> > Wu Fengguang <[email protected]> wrote:
> >
> > > Hi Kame,
> > >
> > > This patch needs more work. I first intent to fix a bug:
> > >
> > > sz = vwrite(kbuf, (char *)p, sz);
> > > p += sz;
> > > }
> > >
> > > So if the returned len is 0, the kbuf/p pointers will mismatch.
> > >
> > > Then I realize it changed the write behavior. The current vwrite()
> > > behavior is strange, it returns 0 if _whole range_ is hole, otherwise
> > > ignores the hole silently. So holes can be treated differently even in
> > > the original code.
> > >
> > Ah, ok...
> >
> > > I'm not really sure about the right behavior. KAME-san, do you have
> > > any suggestions?
> > >
> > maybe following make sense.
> > =
> > written = vwrite(kbuf, (char *p), sz);
> > if (!written) // whole vmem was a hole
> > written = sz;
>
> Since the 0 return value won't be used at all, it would be simpler to
> tell vwrite() return the untouched buflen/sz, like this. It will ignore
> _all_ the holes silently. Need to update comments too.
>

Hmm. IIUC the original kmem code returns immediately if vread/vwrite returns 0.
But it seems this depends on wrong assumption that vmap area is continuous,
at the first look.

In man(4) mem,kmem
==
Byte addresses in mem are interpreted as physical memory addresses.
References to nonexistent locations cause errors to be returned.
.....
The file kmem is the same as mem, except that the kernel virtual memory
rather than physical memory is accessed.

==

Then, we have to return error for accesses to "nonexistent locations".

memory-hole in vmap area is ....."nonexistent" ?
I think it's nonexistent if there are no overlaps between requested [pos, pos+len)
and registerred vmalloc area.

But, hmm, there are no way for users to know "existing vmalloc area".
Then, my above idea may be wrong.

Then, I'd like to modify as following,

- If is_vmalloc_or_module_addr(requested address) is false, return -EFAULT.
- If is_vmalloc_or_module_addr(requested address) is true, return no error.
Even if specified range doesn't include no exsiting vmalloc area.

How do you think ?

Thanks,
-Kame
p.s. I wonder current /dev/kmem cannot read text area of kernel if it's not
directly mapped.



> --- linux-mm.orig/mm/vmalloc.c 2009-09-15 09:40:08.000000000 +0800
> +++ linux-mm/mm/vmalloc.c 2009-09-15 09:43:33.000000000 +0800
> @@ -1834,7 +1834,6 @@ long vwrite(char *buf, char *addr, unsig
> struct vm_struct *tmp;
> char *vaddr;
> unsigned long n, buflen;
> - int copied = 0;
>
> /* Don't allow overflow */
> if ((unsigned long) addr + count < count)
> @@ -1858,7 +1857,6 @@ long vwrite(char *buf, char *addr, unsig
> n = count;
> if (!(tmp->flags & VM_IOREMAP)) {
> aligned_vwrite(buf, addr, n);
> - copied++;
> }
> buf += n;
> addr += n;
> @@ -1866,8 +1864,6 @@ long vwrite(char *buf, char *addr, unsig
> }
> finished:
> read_unlock(&vmlist_lock);
> - if (!copied)
> - return 0;
> return buflen;
> }
>
> > ==
> > needs fix.
> >
> > Anyway, I wonder kmem is broken. It's should be totally rewritten.
> >
> > For example, this doesn't check anything.
> > ==
> > if (p < (unsigned long) high_memory) {
> >
> > ==
> >
> > But....are there users ?
> > If necessary, I'll write some...
>
> I'm trying to stop possible mem/kmem users to access hwpoison pages..
> I'm not the user, but rather a tester ;)
>
> Thanks,
> Fengguang
>
> > Thanks,
> > -Kame
> >
> >
> > > Thanks,
> > > Fengguang
> > >
> > > On Mon, Sep 14, 2009 at 11:29:01AM +0800, Wu Fengguang wrote:
> > > > Current vwrite silently ignores vwrite() to vmap holes.
> > > > Better behavior should be stop the write and return
> > > > on such holes.
> > > >
> > > > Also return on partial read, which may happen in future
> > > > (eg. hit hwpoison pages).
> > > >
> > > > CC: Andi Kleen <[email protected]>
> > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > ---
> > > > drivers/char/mem.c | 30 ++++++++++++++++++------------
> > > > 1 file changed, 18 insertions(+), 12 deletions(-)
> > > >
> > > > --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> > > > +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> > > > @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> > > > if (!kbuf)
> > > > return -ENOMEM;
> > > > while (count > 0) {
> > > > + unsigned long n;
> > > > +
> > > > sz = size_inside_page(p, count);
> > > > - sz = vread(kbuf, (char *)p, sz);
> > > > - if (!sz)
> > > > + n = vread(kbuf, (char *)p, sz);
> > > > + if (!n)
> > > > break;
> > > > - if (copy_to_user(buf, kbuf, sz)) {
> > > > + if (copy_to_user(buf, kbuf, n)) {
> > > > free_page((unsigned long)kbuf);
> > > > return -EFAULT;
> > > > }
> > > > - count -= sz;
> > > > - buf += sz;
> > > > - read += sz;
> > > > - p += sz;
> > > > + count -= n;
> > > > + buf += n;
> > > > + read += n;
> > > > + p += n;
> > > > + if (n < sz)
> > > > + break;
> > > > }
> > > > free_page((unsigned long)kbuf);
> > > > }
> > > > @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> > > > free_page((unsigned long)kbuf);
> > > > return -EFAULT;
> > > > }
> > > > - sz = vwrite(kbuf, (char *)p, sz);
> > > > - count -= sz;
> > > > - buf += sz;
> > > > - virtr += sz;
> > > > - p += sz;
> > > > + n = vwrite(kbuf, (char *)p, sz);
> > > > + count -= n;
> > > > + buf += n;
> > > > + virtr += n;
> > > > + p += n;
> > > > + if (n < sz)
> > > > + break;
> > > > }
> > > > free_page((unsigned long)kbuf);
> > > > }
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> > >
>

2009-09-15 02:57:30

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] devmem: handle partial kmem write/read

On Tue, Sep 15, 2009 at 10:31:13AM +0800, KAMEZAWA Hiroyuki wrote:
> On Tue, 15 Sep 2009 10:02:08 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > On Tue, Sep 15, 2009 at 08:24:48AM +0800, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 14 Sep 2009 12:34:44 +0800
> > > Wu Fengguang <[email protected]> wrote:
> > >
> > > > Hi Kame,
> > > >
> > > > This patch needs more work. I first intent to fix a bug:
> > > >
> > > > sz = vwrite(kbuf, (char *)p, sz);
> > > > p += sz;
> > > > }
> > > >
> > > > So if the returned len is 0, the kbuf/p pointers will mismatch.
> > > >
> > > > Then I realize it changed the write behavior. The current vwrite()
> > > > behavior is strange, it returns 0 if _whole range_ is hole, otherwise
> > > > ignores the hole silently. So holes can be treated differently even in
> > > > the original code.
> > > >
> > > Ah, ok...
> > >
> > > > I'm not really sure about the right behavior. KAME-san, do you have
> > > > any suggestions?
> > > >
> > > maybe following make sense.
> > > =
> > > written = vwrite(kbuf, (char *p), sz);
> > > if (!written) // whole vmem was a hole
> > > written = sz;
> >
> > Since the 0 return value won't be used at all, it would be simpler to
> > tell vwrite() return the untouched buflen/sz, like this. It will ignore
> > _all_ the holes silently. Need to update comments too.
> >
>
> Hmm. IIUC the original kmem code returns immediately if vread/vwrite returns 0.

Agreed for vread. It seems vwrite don't do that for both kmem and
vmalloc part.

> But it seems this depends on wrong assumption that vmap area is continuous,
> at the first look.

You mean it is possible that not all pages are mapped in a vmlist addr,size range?
Is vmalloc areas guaranteed to be page aligned? Sorry for the dumb questions.

> In man(4) mem,kmem
> ==
> Byte addresses in mem are interpreted as physical memory addresses.
> References to nonexistent locations cause errors to be returned.
> .....
> The file kmem is the same as mem, except that the kernel virtual memory
> rather than physical memory is accessed.
>
> ==
>
> Then, we have to return error for accesses to "nonexistent locations".

That's one important factor to consider. On the other hand, the
original kmem read/write implementation don't return error code for
holes. Instead kmem read returns early, while kmem write ignores holes
but is buggy..

> memory-hole in vmap area is ....."nonexistent" ?
> I think it's nonexistent if there are no overlaps between requested [pos, pos+len)
> and registerred vmalloc area.
> But, hmm, there are no way for users to know "existing vmalloc area".
> Then, my above idea may be wrong.
>
> Then, I'd like to modify as following,
>
> - If is_vmalloc_or_module_addr(requested address) is false, return -EFAULT.
> - If is_vmalloc_or_module_addr(requested address) is true, return no error.
> Even if specified range doesn't include no exsiting vmalloc area.
>
> How do you think ?

Looks reasonable to me. But it's good to hear more wide opinions..

> Thanks,
> -Kame
> p.s. I wonder current /dev/kmem cannot read text area of kernel if it's not
> directly mapped.

Attached is a small tool (from LKML) for reading 'modprobe_path' from kmem,
it's not text, but is close..

Thanks,
Fengguang

>
>
> > --- linux-mm.orig/mm/vmalloc.c 2009-09-15 09:40:08.000000000 +0800
> > +++ linux-mm/mm/vmalloc.c 2009-09-15 09:43:33.000000000 +0800
> > @@ -1834,7 +1834,6 @@ long vwrite(char *buf, char *addr, unsig
> > struct vm_struct *tmp;
> > char *vaddr;
> > unsigned long n, buflen;
> > - int copied = 0;
> >
> > /* Don't allow overflow */
> > if ((unsigned long) addr + count < count)
> > @@ -1858,7 +1857,6 @@ long vwrite(char *buf, char *addr, unsig
> > n = count;
> > if (!(tmp->flags & VM_IOREMAP)) {
> > aligned_vwrite(buf, addr, n);
> > - copied++;
> > }
> > buf += n;
> > addr += n;
> > @@ -1866,8 +1864,6 @@ long vwrite(char *buf, char *addr, unsig
> > }
> > finished:
> > read_unlock(&vmlist_lock);
> > - if (!copied)
> > - return 0;
> > return buflen;
> > }
> >
> > > ==
> > > needs fix.
> > >
> > > Anyway, I wonder kmem is broken. It's should be totally rewritten.
> > >
> > > For example, this doesn't check anything.
> > > ==
> > > if (p < (unsigned long) high_memory) {
> > >
> > > ==
> > >
> > > But....are there users ?
> > > If necessary, I'll write some...
> >
> > I'm trying to stop possible mem/kmem users to access hwpoison pages..
> > I'm not the user, but rather a tester ;)
> >
> > Thanks,
> > Fengguang
> >
> > > Thanks,
> > > -Kame
> > >
> > >
> > > > Thanks,
> > > > Fengguang
> > > >
> > > > On Mon, Sep 14, 2009 at 11:29:01AM +0800, Wu Fengguang wrote:
> > > > > Current vwrite silently ignores vwrite() to vmap holes.
> > > > > Better behavior should be stop the write and return
> > > > > on such holes.
> > > > >
> > > > > Also return on partial read, which may happen in future
> > > > > (eg. hit hwpoison pages).
> > > > >
> > > > > CC: Andi Kleen <[email protected]>
> > > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > > ---
> > > > > drivers/char/mem.c | 30 ++++++++++++++++++------------
> > > > > 1 file changed, 18 insertions(+), 12 deletions(-)
> > > > >
> > > > > --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> > > > > +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> > > > > @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> > > > > if (!kbuf)
> > > > > return -ENOMEM;
> > > > > while (count > 0) {
> > > > > + unsigned long n;
> > > > > +
> > > > > sz = size_inside_page(p, count);
> > > > > - sz = vread(kbuf, (char *)p, sz);
> > > > > - if (!sz)
> > > > > + n = vread(kbuf, (char *)p, sz);
> > > > > + if (!n)
> > > > > break;
> > > > > - if (copy_to_user(buf, kbuf, sz)) {
> > > > > + if (copy_to_user(buf, kbuf, n)) {
> > > > > free_page((unsigned long)kbuf);
> > > > > return -EFAULT;
> > > > > }
> > > > > - count -= sz;
> > > > > - buf += sz;
> > > > > - read += sz;
> > > > > - p += sz;
> > > > > + count -= n;
> > > > > + buf += n;
> > > > > + read += n;
> > > > > + p += n;
> > > > > + if (n < sz)
> > > > > + break;
> > > > > }
> > > > > free_page((unsigned long)kbuf);
> > > > > }
> > > > > @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> > > > > free_page((unsigned long)kbuf);
> > > > > return -EFAULT;
> > > > > }
> > > > > - sz = vwrite(kbuf, (char *)p, sz);
> > > > > - count -= sz;
> > > > > - buf += sz;
> > > > > - virtr += sz;
> > > > > - p += sz;
> > > > > + n = vwrite(kbuf, (char *)p, sz);
> > > > > + count -= n;
> > > > > + buf += n;
> > > > > + virtr += n;
> > > > > + p += n;
> > > > > + if (n < sz)
> > > > > + break;
> > > > > }
> > > > > free_page((unsigned long)kbuf);
> > > > > }
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at http://www.tux.org/lkml/
> > > >
> >


Attachments:
(No filename) (9.86 kB)
tmap.c (1.35 kB)
Download all attachments

2009-09-15 08:01:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Question: how to handle too big f_pos Re: [PATCH] devmem: handle partial kmem write/read

I'm writing a patch against /dev/kmem...I found a problem.

/dev/kmem (and /proc/<pid>/mem) puts virtual addres to f->f_pos.

And, f->f_pos is loff_t .... signed value (not unsigned)

Then, this check
rw_verify_area(READ, file, pos, count);
=>
pos = *ppos;
if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
return retval;

always returns -EINVAL when I read /dev/kmem's valid address.

Then, how should I do for read /dev/kmem ? Any idea ?
(Added Al Viro to CC:)

What I'm really afraid of is /proc/<pid>/mem. IIUC, it's used by gdb to snoop
memory, (for example, at gcore).
I think gdb uses ptrace if it fails to read /proc/<pid>/mem but...it's deadly
slow.

Thanks,
-Kame

p.s. no problems in /proc/kcore..its offset is not bare addresss.





On Mon, 14 Sep 2009 11:29:01 +0800
Wu Fengguang <[email protected]> wrote:

> Current vwrite silently ignores vwrite() to vmap holes.
> Better behavior should be stop the write and return
> on such holes.
>
> Also return on partial read, which may happen in future
> (eg. hit hwpoison pages).
>
> CC: Andi Kleen <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> drivers/char/mem.c | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
> --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> if (!kbuf)
> return -ENOMEM;
> while (count > 0) {
> + unsigned long n;
> +
> sz = size_inside_page(p, count);
> - sz = vread(kbuf, (char *)p, sz);
> - if (!sz)
> + n = vread(kbuf, (char *)p, sz);
> + if (!n)
> break;
> - if (copy_to_user(buf, kbuf, sz)) {
> + if (copy_to_user(buf, kbuf, n)) {
> free_page((unsigned long)kbuf);
> return -EFAULT;
> }
> - count -= sz;
> - buf += sz;
> - read += sz;
> - p += sz;
> + count -= n;
> + buf += n;
> + read += n;
> + p += n;
> + if (n < sz)
> + break;
> }
> free_page((unsigned long)kbuf);
> }
> @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> free_page((unsigned long)kbuf);
> return -EFAULT;
> }
> - sz = vwrite(kbuf, (char *)p, sz);
> - count -= sz;
> - buf += sz;
> - virtr += sz;
> - p += sz;
> + n = vwrite(kbuf, (char *)p, sz);
> + count -= n;
> + buf += n;
> + virtr += n;
> + p += n;
> + if (n < sz)
> + break;
> }
> free_page((unsigned long)kbuf);
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-15 08:11:37

by Fengguang Wu

[permalink] [raw]
Subject: Re: Question: how to handle too big f_pos Re: [PATCH] devmem: handle partial kmem write/read

Good catch, Kame!

FYI mmap is OK. I succeeded in reading this address via kmem:

ffffffff81723880 D modprobe_path

Thanks,
Fengguang

On Tue, Sep 15, 2009 at 03:58:52PM +0800, KAMEZAWA Hiroyuki wrote:
> I'm writing a patch against /dev/kmem...I found a problem.
>
> /dev/kmem (and /proc/<pid>/mem) puts virtual addres to f->f_pos.
>
> And, f->f_pos is loff_t .... signed value (not unsigned)
>
> Then, this check
> rw_verify_area(READ, file, pos, count);
> =>
> pos = *ppos;
> if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> return retval;
>
> always returns -EINVAL when I read /dev/kmem's valid address.
>
> Then, how should I do for read /dev/kmem ? Any idea ?
> (Added Al Viro to CC:)
>
> What I'm really afraid of is /proc/<pid>/mem. IIUC, it's used by gdb to snoop
> memory, (for example, at gcore).
> I think gdb uses ptrace if it fails to read /proc/<pid>/mem but...it's deadly
> slow.
>
> Thanks,
> -Kame
>
> p.s. no problems in /proc/kcore..its offset is not bare addresss.
>
>
>
>
>
> On Mon, 14 Sep 2009 11:29:01 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > Current vwrite silently ignores vwrite() to vmap holes.
> > Better behavior should be stop the write and return
> > on such holes.
> >
> > Also return on partial read, which may happen in future
> > (eg. hit hwpoison pages).
> >
> > CC: Andi Kleen <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > ---
> > drivers/char/mem.c | 30 ++++++++++++++++++------------
> > 1 file changed, 18 insertions(+), 12 deletions(-)
> >
> > --- linux-mm.orig/drivers/char/mem.c 2009-09-14 10:59:50.000000000 +0800
> > +++ linux-mm/drivers/char/mem.c 2009-09-14 11:06:25.000000000 +0800
> > @@ -444,18 +444,22 @@ static ssize_t read_kmem(struct file *fi
> > if (!kbuf)
> > return -ENOMEM;
> > while (count > 0) {
> > + unsigned long n;
> > +
> > sz = size_inside_page(p, count);
> > - sz = vread(kbuf, (char *)p, sz);
> > - if (!sz)
> > + n = vread(kbuf, (char *)p, sz);
> > + if (!n)
> > break;
> > - if (copy_to_user(buf, kbuf, sz)) {
> > + if (copy_to_user(buf, kbuf, n)) {
> > free_page((unsigned long)kbuf);
> > return -EFAULT;
> > }
> > - count -= sz;
> > - buf += sz;
> > - read += sz;
> > - p += sz;
> > + count -= n;
> > + buf += n;
> > + read += n;
> > + p += n;
> > + if (n < sz)
> > + break;
> > }
> > free_page((unsigned long)kbuf);
> > }
> > @@ -551,11 +555,13 @@ static ssize_t write_kmem(struct file *
> > free_page((unsigned long)kbuf);
> > return -EFAULT;
> > }
> > - sz = vwrite(kbuf, (char *)p, sz);
> > - count -= sz;
> > - buf += sz;
> > - virtr += sz;
> > - p += sz;
> > + n = vwrite(kbuf, (char *)p, sz);
> > + count -= n;
> > + buf += n;
> > + virtr += n;
> > + p += n;
> > + if (n < sz)
> > + break;
> > }
> > free_page((unsigned long)kbuf);
> > }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2009-09-15 09:53:00

by Hugh Dickins

[permalink] [raw]
Subject: Re: Question: how to handle too big f_pos Re: [PATCH] devmem: handle partial kmem write/read

On Tue, 15 Sep 2009, KAMEZAWA Hiroyuki wrote:

> I'm writing a patch against /dev/kmem...I found a problem.
>
> /dev/kmem (and /proc/<pid>/mem) puts virtual addres to f->f_pos.
>
> And, f->f_pos is loff_t .... signed value (not unsigned)
>
> Then, this check
> rw_verify_area(READ, file, pos, count);
> =>
> pos = *ppos;
> if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> return retval;
>
> always returns -EINVAL when I read /dev/kmem's valid address.
>
> Then, how should I do for read /dev/kmem ? Any idea ?
> (Added Al Viro to CC:)

I believe what's supposed to happen is that special drivers like
/dev/mem and /dev/kmem provide their own .llseek method (they do).
But at some point along the way rw_verify_area() got "improved"
and now prevents them working on many configurations.

I noticed around 2.6.24-rc1, and have been building my debug kernels
with the patch below ever since then, as I sometimes like to peek and
poke into /dev/kmem to examine and try different things while running.

But whether it's the right patch is another matter. Though this should
be independent of that, I've also got a patch at the drivers/char/mem.c
end (I'll post that shortly in the main thread, not appropriate here),
and have never found time to consider whether these hacks amount to
anything more than "works for me".

IIRC, this patch below only covers one aspect of the negative offset
problem. Where the problems lie does change from time to time - back
in 2003 I used to use pread() and pwrite() to avoid llseek() issues;
but when I came back to it in 2007, I think I found those no longer
worked (on i386 or x86_64 or powerpc?), and lseek64() was the best bet.

Anyway, for what it's worth, ...

--- 2.6.31/fs/read_write.c 2009-09-09 23:13:59.000000000 +0100
+++ 2.6.31d/fs/read_write.c 2009-09-10 09:38:30.000000000 +0100
@@ -222,8 +222,13 @@ int rw_verify_area(int read_write, struc
if (unlikely((ssize_t) count < 0))
return retval;
pos = *ppos;
- if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
- return retval;
+ if (pos >= 0) {
+ if (unlikely((loff_t) (pos + count) < 0))
+ count = 1 + (size_t) LLONG_MAX - pos;
+ } else {
+ if (unlikely((loff_t) (pos + count) > 0))
+ count = - pos;
+ }

if (unlikely(inode->i_flock && mandatory_lock(inode))) {
retval = locks_mandatory_area(

2009-09-16 05:32:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH][bugfix] more checks for negative f_pos handling (Was Re: Question: how to handle too big f_pos


The problem:
> I'm writing a patch against /dev/kmem...I found a problem.
>
> /dev/kmem (and /proc/<pid>/mem) puts virtual addres to f->f_pos.
>
> but f->f_pos is always negative and rw_verify_ara() returns -EINVAL always.

Changed CC: List.

This is a trial to consider how to fix negative f_pos problem shown in above.

Hmm, even after this patch, x86's vsyscall area is not readable.
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
But maybe no problems. (now, it cannot be read, anyway.)

I tested /dev/kmem on x86-64 and this works fine. I added a fix for
/proc/<pid>/mem because I know ia64's hugetlbe area is not readable
via /proc/<pid>/mem. (But I'm not sure other 64bit arch has this
kind of problems in /proc/<pid>/mem)

==
From: KAMEZAWA Hiruyoki <[email protected]>

Modifying rw_verify_area()'s negative f_pos check.

Now, rw_verify_area() has this check
if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
return -EINVAL

And access to special files as /dev/mem,kmem, /proc/<pid>/mem
returns unexpected -EINVAL.
(For example, ia64 maps hugetlb at 0x8000000000000000- region)

This patch tries to make range check more precise by using
llseek ops defined per special files.

Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>
---
fs/proc/base.c | 22 +++++++++++++++++-----
fs/read_write.c | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 54 insertions(+), 7 deletions(-)

Index: mmotm-2.6.31-Sep14/fs/read_write.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/read_write.c
+++ mmotm-2.6.31-Sep14/fs/read_write.c
@@ -205,6 +205,37 @@ bad:
}
#endif

+static int
+__verify_negative_pos_range(struct file *file, loff_t pos, size_t count)
+{
+ unsigned long long upos, end;
+ loff_t ret;
+
+ /* disallow overflow */
+ upos = (unsigned long long)pos;
+ end = upos + count;
+ if (end < pos)
+ return -EOVERFLOW;
+ /*
+ * Sanity check...subsystem has to provide llseek for handle big pos.
+ * Subsystem's llseek should verify f_pos's value comaparing with its
+ * max file size.
+ * Note1: generic file ops' llseek cannot handle negative pos.
+ * Note2: should we take care of pos == -EINVAL ?
+ * Note3: we check flags and ops here for avoiding taking locks in.
+ * default_lseek.
+ */
+ ret = -EINVAL;
+ if ((file->f_mode & FMODE_LSEEK) &&
+ (file->f_op && file->f_op->llseek)) {
+ ret = vfs_llseek(file, 0, SEEK_CUR);
+ if (ret == pos)
+ return 0;
+ }
+
+ return (int)ret;
+}
+
/*
* rw_verify_area doesn't like huge counts. We limit
* them to something that fits in "int" so that others
@@ -222,8 +253,12 @@ int rw_verify_area(int read_write, struc
if (unlikely((ssize_t) count < 0))
return retval;
pos = *ppos;
- if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
- return retval;
+ if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
+ /* some files requires special care */
+ retval = __verify_negative_pos_range(file, pos, count);
+ if (retval)
+ return retval;
+ }

if (unlikely(inode->i_flock && mandatory_lock(inode))) {
retval = locks_mandatory_area(
Index: mmotm-2.6.31-Sep14/fs/proc/base.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
+++ mmotm-2.6.31-Sep14/fs/proc/base.c
@@ -903,18 +903,30 @@ out_no_task:

loff_t mem_lseek(struct file *file, loff_t offset, int orig)
{
+ struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+ unsigned long long new_offset = -EINVAL;
+
+ if (!task) /* lseek's spec doesn't allow -ESRCH but... */
+ return -ESRCH;
+
switch (orig) {
case 0:
- file->f_pos = offset;
+ new_offset = offset;
break;
case 1:
- file->f_pos += offset;
+ new_offset = (unsigned long long)f->f_pos + offset;
break;
default:
- return -EINVAL;
+ new_offset = -EINVAL;
+ break;
}
- force_successful_syscall_return();
- return file->f_pos;
+ if (new_offset < (unsigned long long)TASK_SIZE_OF(task)) {
+ file->f_pos = (loff_t)new_offset;
+ force_successful_syscall_return();
+ } else
+ new_offset = -EINVAL;
+ put_task_struct(task);
+ return (loff_t)new_offset;
}

static const struct file_operations proc_mem_operations = {

2009-09-16 08:20:34

by Cong Wang

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling (Was Re: Question: how to handle too big f_pos

On Wed, Sep 16, 2009 at 1:29 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> The problem:
>> I'm writing a patch against /dev/kmem...I found a problem.
>>
>> /dev/kmem (and /proc/<pid>/mem) puts virtual addres to f->f_pos.
>>
>> but f->f_pos is always negative and rw_verify_ara() returns -EINVAL always.
>
> Changed CC: List.
>
> This is a trial to consider how to fix negative f_pos problem shown in above.
>
> Hmm, even after this patch, x86's vsyscall area is not readable.
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]
> But maybe no problems. (now, it cannot be read, anyway.)
>
> I tested /dev/kmem on x86-64 and this works fine. I added a fix for
> /proc/<pid>/mem because I know ia64's hugetlbe area is not readable
> via /proc/<pid>/mem. (But I'm not sure other 64bit arch has this
> kind of problems in /proc/<pid>/mem)
>
> ==
> From: KAMEZAWA Hiruyoki <[email protected]>
>
> Modifying rw_verify_area()'s negative f_pos check.
>
> Now, rw_verify_area() has this check
>   if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
>                return -EINVAL
>
> And access to special files as /dev/mem,kmem, /proc/<pid>/mem
> returns unexpected -EINVAL.
> (For example, ia64 maps hugetlb at 0x8000000000000000- region)
>
> This patch tries to make range check more precise by using
> llseek ops defined per special files.
>
> Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>
> ---
>  fs/proc/base.c  |   22 +++++++++++++++++-----
>  fs/read_write.c |   39 +++++++++++++++++++++++++++++++++++++--
>  2 files changed, 54 insertions(+), 7 deletions(-)
>
> Index: mmotm-2.6.31-Sep14/fs/read_write.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> +++ mmotm-2.6.31-Sep14/fs/read_write.c
> @@ -205,6 +205,37 @@ bad:
>  }
>  #endif
>
> +static int
> +__verify_negative_pos_range(struct file *file, loff_t pos, size_t count)
> +{
> +       unsigned long long upos, end;
> +       loff_t ret;
> +
> +       /* disallow overflow */
> +       upos = (unsigned long long)pos;
> +       end = upos + count;
> +       if (end < pos)
> +               return -EOVERFLOW;
> +       /*
> +        * Sanity check...subsystem has to provide llseek for handle big pos.
> +        * Subsystem's llseek should verify f_pos's value comaparing with its
> +        * max file size.
> +        * Note1: generic file ops' llseek cannot handle negative pos.
> +        * Note2: should we take care of pos == -EINVAL ?
> +        * Note3: we check flags and ops here for avoiding taking locks in.
> +        * default_lseek.
> +        */
> +       ret = -EINVAL;
> +       if ((file->f_mode & FMODE_LSEEK) &&
> +           (file->f_op && file->f_op->llseek)) {
> +               ret = vfs_llseek(file, 0, SEEK_CUR);
> +               if (ret == pos)
> +                       return 0;
> +       }
> +
> +       return (int)ret;
> +}
> +
>  /*
>  * rw_verify_area doesn't like huge counts. We limit
>  * them to something that fits in "int" so that others
> @@ -222,8 +253,12 @@ int rw_verify_area(int read_write, struc
>        if (unlikely((ssize_t) count < 0))
>                return retval;
>        pos = *ppos;
> -       if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> -               return retval;
> +       if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
> +               /* some files requires special care */
> +               retval = __verify_negative_pos_range(file, pos, count);
> +               if (retval)
> +                       return retval;
> +       }
>
>        if (unlikely(inode->i_flock && mandatory_lock(inode))) {
>                retval = locks_mandatory_area(
> Index: mmotm-2.6.31-Sep14/fs/proc/base.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
> +++ mmotm-2.6.31-Sep14/fs/proc/base.c
> @@ -903,18 +903,30 @@ out_no_task:
>
>  loff_t mem_lseek(struct file *file, loff_t offset, int orig)
>  {
> +       struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> +       unsigned long long new_offset = -EINVAL;


Why not make 'new_offset' as loff_t? This can make your code easier.

> +
> +       if (!task) /* lseek's spec doesn't allow -ESRCH but... */


No worry, we have many ESRCH for proc files.

> +               return -ESRCH;
> +
>        switch (orig) {
>        case 0:
> -               file->f_pos = offset;
> +               new_offset = offset;
>                break;
>        case 1:
> -               file->f_pos += offset;
> +               new_offset = (unsigned long long)f->f_pos + offset;
>                break;
>        default:
> -               return -EINVAL;
> +               new_offset = -EINVAL;
> +               break;
>        }
> -       force_successful_syscall_return();
> -       return file->f_pos;
> +       if (new_offset < (unsigned long long)TASK_SIZE_OF(task)) {


Hmm, why this check?

> +               file->f_pos = (loff_t)new_offset;
> +               force_successful_syscall_return();
> +       } else
> +               new_offset = -EINVAL;
> +       put_task_struct(task);
> +       return (loff_t)new_offset;
>  }
>
>  static const struct file_operations proc_mem_operations = {

Thanks.

2009-09-16 08:46:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling (Was Re: Question: how to handle too big f_pos

Ah, sorry. I should CC: you.

On Wed, 16 Sep 2009 16:20:32 +0800
Américo Wang <[email protected]> wrote:

> On Wed, Sep 16, 2009 at 1:29 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> >
> > The problem:
> >> I'm writing a patch against /dev/kmem...I found a problem.
> >>
> >> /dev/kmem (and /proc/<pid>/mem) puts virtual addres to f->f_pos.
> >>
> >> but f->f_pos is always negative and rw_verify_ara() returns -EINVAL always.
> >
> > Changed CC: List.
> >
> > This is a trial to consider how to fix negative f_pos problem shown in above.
> >
> > Hmm, even after this patch, x86's vsyscall area is not readable.
> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]
> > But maybe no problems. (now, it cannot be read, anyway.)
> >
> > I tested /dev/kmem on x86-64 and this works fine. I added a fix for
> > /proc/<pid>/mem because I know ia64's hugetlbe area is not readable
> > via /proc/<pid>/mem. (But I'm not sure other 64bit arch has this
> > kind of problems in /proc/<pid>/mem)
> >
> > ==
> > From: KAMEZAWA Hiruyoki <[email protected]>
> >
> > Modifying rw_verify_area()'s negative f_pos check.
> >
> > Now, rw_verify_area() has this check
> >   if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> >                return -EINVAL
> >
> > And access to special files as /dev/mem,kmem, /proc/<pid>/mem
> > returns unexpected -EINVAL.
> > (For example, ia64 maps hugetlb at 0x8000000000000000- region)
> >
> > This patch tries to make range check more precise by using
> > llseek ops defined per special files.
> >
> > Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>
> > ---
> >  fs/proc/base.c  |   22 +++++++++++++++++-----
> >  fs/read_write.c |   39 +++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 54 insertions(+), 7 deletions(-)
> >
> > Index: mmotm-2.6.31-Sep14/fs/read_write.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> > +++ mmotm-2.6.31-Sep14/fs/read_write.c
> > @@ -205,6 +205,37 @@ bad:
> >  }
> >  #endif
> >
> > +static int
> > +__verify_negative_pos_range(struct file *file, loff_t pos, size_t count)
> > +{
> > +       unsigned long long upos, end;
> > +       loff_t ret;
> > +
> > +       /* disallow overflow */
> > +       upos = (unsigned long long)pos;
> > +       end = upos + count;
> > +       if (end < pos)
> > +               return -EOVERFLOW;
> > +       /*
> > +        * Sanity check...subsystem has to provide llseek for handle big pos.
> > +        * Subsystem's llseek should verify f_pos's value comaparing with its
> > +        * max file size.
> > +        * Note1: generic file ops' llseek cannot handle negative pos.
> > +        * Note2: should we take care of pos == -EINVAL ?
> > +        * Note3: we check flags and ops here for avoiding taking locks in.
> > +        * default_lseek.
> > +        */
> > +       ret = -EINVAL;
> > +       if ((file->f_mode & FMODE_LSEEK) &&
> > +           (file->f_op && file->f_op->llseek)) {
> > +               ret = vfs_llseek(file, 0, SEEK_CUR);
> > +               if (ret == pos)
> > +                       return 0;
> > +       }
> > +
> > +       return (int)ret;
> > +}
> > +
> >  /*
> >  * rw_verify_area doesn't like huge counts. We limit
> >  * them to something that fits in "int" so that others
> > @@ -222,8 +253,12 @@ int rw_verify_area(int read_write, struc
> >        if (unlikely((ssize_t) count < 0))
> >                return retval;
> >        pos = *ppos;
> > -       if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> > -               return retval;
> > +       if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
> > +               /* some files requires special care */
> > +               retval = __verify_negative_pos_range(file, pos, count);
> > +               if (retval)
> > +                       return retval;
> > +       }
> >
> >        if (unlikely(inode->i_flock && mandatory_lock(inode))) {
> >                retval = locks_mandatory_area(
> > Index: mmotm-2.6.31-Sep14/fs/proc/base.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
> > +++ mmotm-2.6.31-Sep14/fs/proc/base.c
> > @@ -903,18 +903,30 @@ out_no_task:
> >
> >  loff_t mem_lseek(struct file *file, loff_t offset, int orig)
> >  {
> > +       struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> > +       unsigned long long new_offset = -EINVAL;
>
>
> Why not make 'new_offset' as loff_t? This can make your code easier.
>
loff_t is "long long", I wanted "unsigned long long" for showing
f_pos here is treated as "unsigned".



> > +
> > +       if (!task) /* lseek's spec doesn't allow -ESRCH but... */
>
>
> No worry, we have many ESRCH for proc files.
>
I know ;)

> > +               return -ESRCH;
> > +
> >        switch (orig) {
> >        case 0:
> > -               file->f_pos = offset;
> > +               new_offset = offset;
> >                break;
> >        case 1:
> > -               file->f_pos += offset;
> > +               new_offset = (unsigned long long)f->f_pos + offset;
> >                break;
> >        default:
> > -               return -EINVAL;
> > +               new_offset = -EINVAL;
> > +               break;
> >        }
> > -       force_successful_syscall_return();
> > -       return file->f_pos;
> > +       if (new_offset < (unsigned long long)TASK_SIZE_OF(task)) {
>
>
> Hmm, why this check?
>
2 reasons.

1. If this lseek has to check something, this is it.
2. On architecture where 32bit program can ran on 64bit,
moving f_pos above 4G is out-of-range, for example.

But mem_read() will catch any bad f_pos, anyway. So, just making
allow all f_pos here is maybe a choice. Considering lseek,
providing this range check here is not so bad.

Thanks.
-Kame

> > +               file->f_pos = (loff_t)new_offset;
> > +               force_successful_syscall_return();
> > +       } else
> > +               new_offset = -EINVAL;
> > +       put_task_struct(task);
> > +       return (loff_t)new_offset;
> >  }
> >
> >  static const struct file_operations proc_mem_operations = {
>
> Thanks.
>

2009-09-16 09:13:58

by Cong Wang

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling (Was Re: Question: how to handle too big f_pos

On Wed, Sep 16, 2009 at 4:44 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> Ah, sorry. I should CC: you.


No problem. :)

>> > --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
>> > +++ mmotm-2.6.31-Sep14/fs/proc/base.c
>> > @@ -903,18 +903,30 @@ out_no_task:
>> >
>> >  loff_t mem_lseek(struct file *file, loff_t offset, int orig)
>> >  {
>> > +       struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
>> > +       unsigned long long new_offset = -EINVAL;
>>
>>
>> Why not make 'new_offset' as loff_t? This can make your code easier.
>>
> loff_t is "long long", I wanted "unsigned long long" for showing
> f_pos here is treated as "unsigned".
>


Yeah, the same as for __verify_negative_pos_range(), right...


<snip>

>> >        }
>> > -       force_successful_syscall_return();
>> > -       return file->f_pos;
>> > +       if (new_offset < (unsigned long long)TASK_SIZE_OF(task)) {
>>
>>
>> Hmm, why this check?
>>
> 2 reasons.
>
>  1. If this lseek has to check something, this is it.
>  2. On architecture where 32bit program can ran on 64bit,
>     moving f_pos above 4G is out-of-range, for example.
>
> But mem_read() will catch any bad f_pos, anyway. So, just making
> allow all f_pos here is maybe a choice. Considering lseek,
> providing this range check here is not so bad.

Ok, I misunderstood the macro 'TASK_SIZE_OF', then no problem.

Reviewed-by: WANG Cong <[email protected]>

Thanks.

2009-09-16 12:06:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling (Was Re: Question: how to handle too big f_pos

Am$(D+1(Brico_Wang wrote:
> On Wed, Sep 16, 2009 at 4:44 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> Ah, sorry. I should CC: you.
>
>
> No problem. :)
>
>>> > --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
>>> > +++ mmotm-2.6.31-Sep14/fs/proc/base.c
>>> > @@ -903,18 +903,30 @@ out_no_task:
>>> >
>>> > ?loff_t mem_lseek(struct file *file, loff_t offset, int orig)
>>> > ?{
>>> > + ? ? ? struct task_struct *task =
>>> get_proc_task(file->f_path.dentry->d_inode);
>>> > + ? ? ? unsigned long long new_offset = -EINVAL;
>>>
>>>
>>> Why not make 'new_offset' as loff_t? This can make your code easier.
>>>
>> loff_t is "long long", I wanted "unsigned long long" for showing
>> f_pos here is treated as "unsigned".
>>
>
>
> Yeah, the same as for __verify_negative_pos_range(), right...
>
>
> <snip>
>
>>> > ? ? ? ?}
>>> > - ? ? ? force_successful_syscall_return();
>>> > - ? ? ? return file->f_pos;
>>> > + ? ? ? if (new_offset < (unsigned long long)TASK_SIZE_OF(task)) {
>>>
>>>
>>> Hmm, why this check?
>>>
>> 2 reasons.
>>
>> ?1. If this lseek has to check something, this is it.
>> ?2. On architecture where 32bit program can ran on 64bit,
>> ? ? moving f_pos above 4G is out-of-range, for example.
>>
>> But mem_read() will catch any bad f_pos, anyway. So, just making
>> allow all f_pos here is maybe a choice. Considering lseek,
>> providing this range check here is not so bad.
>
> Ok, I misunderstood the macro 'TASK_SIZE_OF', then no problem.
>
> Reviewed-by: WANG Cong <[email protected]>
>
Ah, very sorry. I noticed I didn't handle pread/pwrite, splice, etc...
I'll do retry.

Sorry,
-Kame


> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-17 03:06:16

by Cong Wang

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling (Was Re: Question: how to handle too big f_pos

2009/9/16 KAMEZAWA Hiroyuki <[email protected]>:
> Am�+1rico_Wang wrote:
>> On Wed, Sep 16, 2009 at 4:44 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>>> Ah, sorry. I should CC: you.
>>
>>
>> No problem. :)
>>
>>>> > --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
>>>> > +++ mmotm-2.6.31-Sep14/fs/proc/base.c
>>>> > @@ -903,18 +903,30 @@ out_no_task:
>>>> >
>>>> > ?loff_t mem_lseek(struct file *file, loff_t offset, int orig)
>>>> > ?{
>>>> > + ? ? ? struct task_struct *task =
>>>> get_proc_task(file->f_path.dentry->d_inode);
>>>> > + ? ? ? unsigned long long new_offset = -EINVAL;
>>>>
>>>>
>>>> Why not make 'new_offset' as loff_t? This can make your code easier.
>>>>
>>> loff_t is "long long", I wanted "unsigned long long" for showing
>>> f_pos here is treated as "unsigned".
>>>
>>
>>
>> Yeah, the same as for __verify_negative_pos_range(), right...
>>
>>
>> <snip>
>>
>>>> > ? ? ? ?}
>>>> > - ? ? ? force_successful_syscall_return();
>>>> > - ? ? ? return file->f_pos;
>>>> > + ? ? ? if (new_offset < (unsigned long long)TASK_SIZE_OF(task)) {
>>>>
>>>>
>>>> Hmm, why this check?
>>>>
>>> 2 reasons.
>>>
>>> ?1. If this lseek has to check something, this is it.
>>> ?2. On architecture where 32bit program can ran on 64bit,
>>> ? ? moving f_pos above 4G is out-of-range, for example.
>>>
>>> But mem_read() will catch any bad f_pos, anyway. So, just making
>>> allow all f_pos here is maybe a choice. Considering lseek,
>>> providing this range check here is not so bad.
>>
>> Ok, I misunderstood the macro 'TASK_SIZE_OF', then no problem.
>>
>> Reviewed-by: WANG Cong <[email protected]>
>>
> Ah, very sorry. I noticed I didn't handle pread/pwrite, splice, etc...
> I'll do retry.

Yes? Not necessary...

In man page, it said /proc/<pid>/mem can be accessed via open(),
read(), fseek(), no pread(), no splice()...

2009-09-17 05:55:33

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH][bugfix] more checks for negative f_pos handling v2

Considering several approaches, I think this is the simplest..
Any idea ?

Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

Now, rw_verify_area() checsk f_pos is negative or not. And if
negative, returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
has negative offsets. And we can't do any access via read/write
to the file(device).

This patch introduce a flag S_VERYBIG and allow negative file
offsets.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/mem.c | 23 +++++++++++++----------
fs/proc/base.c | 2 ++
fs/read_write.c | 20 ++++++++++++++++++--
include/linux/fs.h | 2 ++
4 files changed, 35 insertions(+), 12 deletions(-)

Index: mmotm-2.6.31-Sep14/fs/read_write.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/read_write.c
+++ mmotm-2.6.31-Sep14/fs/read_write.c
@@ -205,6 +205,19 @@ bad:
}
#endif

+static bool
+__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
+{
+ struct inode *inode;
+ loff_t isize;
+
+ if (pos + count < pos)
+ return -EOVERFLOW;
+ if (IS_VERYBIG(inode))
+ return 0;
+ return -EINVAL;
+}
+
/*
* rw_verify_area doesn't like huge counts. We limit
* them to something that fits in "int" so that others
@@ -222,8 +235,11 @@ int rw_verify_area(int read_write, struc
if (unlikely((ssize_t) count < 0))
return retval;
pos = *ppos;
- if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
- return retval;
+ if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
+ retval = __negative_fpos_check(inode, pos, count))
+ if (retval)
+ return retval;
+ }

if (unlikely(inode->i_flock && mandatory_lock(inode))) {
retval = locks_mandatory_area(
Index: mmotm-2.6.31-Sep14/include/linux/fs.h
===================================================================
--- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
+++ mmotm-2.6.31-Sep14/include/linux/fs.h
@@ -231,6 +231,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -265,6 +266,7 @@ struct inodes_stat_t {
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
+#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)

/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
+++ mmotm-2.6.31-Sep14/drivers/char/mem.c
@@ -825,22 +825,23 @@ static const struct memdev {
const char *name;
const struct file_operations *fops;
struct backing_dev_info *dev_info;
+ bool verybig;
} devlist[] = {
- [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
+ [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
#ifdef CONFIG_DEVKMEM
- [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
+ [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
#endif
- [ 3] = {"null", &null_fops, NULL },
+ [3] = {"null", &null_fops, NULL, false },
#ifdef CONFIG_DEVPORT
- [ 4] = { "port", &port_fops, NULL },
+ [4] = { "port", &port_fops, NULL, false },
#endif
- [ 5] = { "zero", &zero_fops, &zero_bdi },
- [ 6] = { "full", &full_fops, NULL },
- [ 7] = { "random", &random_fops, NULL },
- [ 9] = { "urandom", &urandom_fops, NULL },
- [11] = { "kmsg", &kmsg_fops, NULL },
+ [5] = { "zero", &zero_fops, &zero_bdi, false },
+ [6] = { "full", &full_fops, NULL, false },
+ [7] = { "random", &random_fops, NULL, false },
+ [9] = { "urandom", &urandom_fops, NULL, false },
+ [11] = { "kmsg", &kmsg_fops, NULL, false },
#ifdef CONFIG_CRASH_DUMP
- [12] = { "oldmem", &oldmem_fops, NULL },
+ [12] = { "oldmem", &oldmem_fops, NULL, false },
#endif
};

@@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
ret = dev->fops->open(inode, filp);
else
ret = 0;
+ if (dev->verybig)
+ inode->i_flags |= S_VERYBIG;
out:
unlock_kernel();
return ret;
Index: mmotm-2.6.31-Sep14/fs/proc/base.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
+++ mmotm-2.6.31-Sep14/fs/proc/base.c
@@ -779,6 +779,8 @@ static const struct file_operations proc
static int mem_open(struct inode* inode, struct file* file)
{
file->private_data = (void*)((long)current->self_exec_id);
+ /* this file is read only and we can catch out-pf-range */
+ inode->i_flags |= S_VERYBIG;
return 0;
}

2009-09-17 06:09:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH][bugfix] more checks for negative f_pos handling v3

Very sorry...reflesh miss.
==
From: KAMEZAWA Hiroyuki <[email protected]>

Now, rw_verify_area() checsk f_pos is negative or not. And if
negative, returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
has negative offsets. And we can't do any access via read/write
to the file(device).

This patch introduce a flag S_VERYBIG and allow negative file
offsets.

Changelog:
- fixed bug in rw_verify_area (it cannot be compiled)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/mem.c | 23 +++++++++++++----------
fs/proc/base.c | 2 ++
fs/read_write.c | 20 ++++++++++++++++++--
include/linux/fs.h | 2 ++
4 files changed, 35 insertions(+), 12 deletions(-)

Index: mmotm-2.6.31-Sep14/fs/read_write.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/read_write.c
+++ mmotm-2.6.31-Sep14/fs/read_write.c
@@ -205,6 +205,16 @@ bad:
}
#endif

+static bool
+__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
+{
+ if (pos + count < pos)
+ return -EOVERFLOW;
+ if (IS_VERYBIG(inode))
+ return 0;
+ return -EINVAL;
+}
+
/*
* rw_verify_area doesn't like huge counts. We limit
* them to something that fits in "int" so that others
@@ -222,8 +232,11 @@ int rw_verify_area(int read_write, struc
if (unlikely((ssize_t) count < 0))
return retval;
pos = *ppos;
- if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
- return retval;
+ if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
+ retval = __negative_fpos_check(inode, pos, count);
+ if (retval)
+ return retval;
+ }

if (unlikely(inode->i_flock && mandatory_lock(inode))) {
retval = locks_mandatory_area(
Index: mmotm-2.6.31-Sep14/include/linux/fs.h
===================================================================
--- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
+++ mmotm-2.6.31-Sep14/include/linux/fs.h
@@ -231,6 +231,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -265,6 +266,7 @@ struct inodes_stat_t {
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
+#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)

/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
+++ mmotm-2.6.31-Sep14/drivers/char/mem.c
@@ -825,22 +825,23 @@ static const struct memdev {
const char *name;
const struct file_operations *fops;
struct backing_dev_info *dev_info;
+ bool verybig;
} devlist[] = {
- [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
+ [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
#ifdef CONFIG_DEVKMEM
- [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
+ [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
#endif
- [ 3] = {"null", &null_fops, NULL },
+ [3] = {"null", &null_fops, NULL, false },
#ifdef CONFIG_DEVPORT
- [ 4] = { "port", &port_fops, NULL },
+ [4] = { "port", &port_fops, NULL, false },
#endif
- [ 5] = { "zero", &zero_fops, &zero_bdi },
- [ 6] = { "full", &full_fops, NULL },
- [ 7] = { "random", &random_fops, NULL },
- [ 9] = { "urandom", &urandom_fops, NULL },
- [11] = { "kmsg", &kmsg_fops, NULL },
+ [5] = { "zero", &zero_fops, &zero_bdi, false },
+ [6] = { "full", &full_fops, NULL, false },
+ [7] = { "random", &random_fops, NULL, false },
+ [9] = { "urandom", &urandom_fops, NULL, false },
+ [11] = { "kmsg", &kmsg_fops, NULL, false },
#ifdef CONFIG_CRASH_DUMP
- [12] = { "oldmem", &oldmem_fops, NULL },
+ [12] = { "oldmem", &oldmem_fops, NULL, false },
#endif
};

@@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
ret = dev->fops->open(inode, filp);
else
ret = 0;
+ if (dev->verybig)
+ inode->i_flags |= S_VERYBIG;
out:
unlock_kernel();
return ret;
Index: mmotm-2.6.31-Sep14/fs/proc/base.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
+++ mmotm-2.6.31-Sep14/fs/proc/base.c
@@ -779,6 +779,8 @@ static const struct file_operations proc
static int mem_open(struct inode* inode, struct file* file)
{
file->private_data = (void*)((long)current->self_exec_id);
+ /* this file is read only and we can catch out-pf-range */
+ inode->i_flags |= S_VERYBIG;
return 0;
}

2009-09-17 06:21:44

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v3

On Thu, Sep 17, 2009 at 02:07:26PM +0800, KAMEZAWA Hiroyuki wrote:
> Very sorry...reflesh miss.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, rw_verify_area() checsk f_pos is negative or not. And if
> negative, returns -EINVAL.
>
> But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
> has negative offsets. And we can't do any access via read/write
> to the file(device).
>
> This patch introduce a flag S_VERYBIG and allow negative file
> offsets.
>
> Changelog:
> - fixed bug in rw_verify_area (it cannot be compiled)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> drivers/char/mem.c | 23 +++++++++++++----------
> fs/proc/base.c | 2 ++
> fs/read_write.c | 20 ++++++++++++++++++--
> include/linux/fs.h | 2 ++
> 4 files changed, 35 insertions(+), 12 deletions(-)
>
> Index: mmotm-2.6.31-Sep14/fs/read_write.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> +++ mmotm-2.6.31-Sep14/fs/read_write.c
> @@ -205,6 +205,16 @@ bad:
> }
> #endif
>
> +static bool
> +__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
> +{
> + if (pos + count < pos)
> + return -EOVERFLOW;
> + if (IS_VERYBIG(inode))
> + return 0;
> + return -EINVAL;

This always return negative values for !IS_VERYBIG inodes,
looks weird at least.

> +}
> +
> /*
> * rw_verify_area doesn't like huge counts. We limit
> * them to something that fits in "int" so that others
> @@ -222,8 +232,11 @@ int rw_verify_area(int read_write, struc
> if (unlikely((ssize_t) count < 0))
> return retval;
> pos = *ppos;
> - if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> - return retval;
> + if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
> + retval = __negative_fpos_check(inode, pos, count);
> + if (retval)
> + return retval;
> + }
>
> if (unlikely(inode->i_flock && mandatory_lock(inode))) {
> retval = locks_mandatory_area(
> Index: mmotm-2.6.31-Sep14/include/linux/fs.h
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
> +++ mmotm-2.6.31-Sep14/include/linux/fs.h
> @@ -231,6 +231,7 @@ struct inodes_stat_t {
> #define S_NOCMTIME 128 /* Do not update file c/mtime */
> #define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
> #define S_PRIVATE 512 /* Inode is fs-internal */
> +#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */
>
> /*
> * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -265,6 +266,7 @@ struct inodes_stat_t {
> #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
> #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
> #define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
> +#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)
>
> /* the read-only stuff doesn't really belong here, but any other place is
> probably as bad and I don't want to create yet another include file. */
> Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
> +++ mmotm-2.6.31-Sep14/drivers/char/mem.c
> @@ -825,22 +825,23 @@ static const struct memdev {
> const char *name;
> const struct file_operations *fops;
> struct backing_dev_info *dev_info;
> + bool verybig;
> } devlist[] = {
> - [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
> + [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
> #ifdef CONFIG_DEVKMEM
> - [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
> + [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
> #endif
> - [ 3] = {"null", &null_fops, NULL },
> + [3] = {"null", &null_fops, NULL, false },
> #ifdef CONFIG_DEVPORT
> - [ 4] = { "port", &port_fops, NULL },
> + [4] = { "port", &port_fops, NULL, false },
> #endif
> - [ 5] = { "zero", &zero_fops, &zero_bdi },
> - [ 6] = { "full", &full_fops, NULL },
> - [ 7] = { "random", &random_fops, NULL },
> - [ 9] = { "urandom", &urandom_fops, NULL },
> - [11] = { "kmsg", &kmsg_fops, NULL },
> + [5] = { "zero", &zero_fops, &zero_bdi, false },
> + [6] = { "full", &full_fops, NULL, false },
> + [7] = { "random", &random_fops, NULL, false },
> + [9] = { "urandom", &urandom_fops, NULL, false },
> + [11] = { "kmsg", &kmsg_fops, NULL, false },
> #ifdef CONFIG_CRASH_DUMP
> - [12] = { "oldmem", &oldmem_fops, NULL },
> + [12] = { "oldmem", &oldmem_fops, NULL, false },
> #endif

How about make the fields aligned? That would look much nicer :)

Thanks,
Fengguang

> };
>
> @@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
> ret = dev->fops->open(inode, filp);
> else
> ret = 0;
> + if (dev->verybig)
> + inode->i_flags |= S_VERYBIG;
> out:
> unlock_kernel();
> return ret;
> Index: mmotm-2.6.31-Sep14/fs/proc/base.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
> +++ mmotm-2.6.31-Sep14/fs/proc/base.c
> @@ -779,6 +779,8 @@ static const struct file_operations proc
> static int mem_open(struct inode* inode, struct file* file)
> {
> file->private_data = (void*)((long)current->self_exec_id);
> + /* this file is read only and we can catch out-pf-range */
> + inode->i_flags |= S_VERYBIG;
> return 0;
> }
>

2009-09-17 06:33:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v3

On Thu, 17 Sep 2009 14:21:24 +0800
Wu Fengguang <[email protected]> wrote:

> On Thu, Sep 17, 2009 at 02:07:26PM +0800, KAMEZAWA Hiroyuki wrote:
> > Very sorry...reflesh miss.
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, rw_verify_area() checsk f_pos is negative or not. And if
> > negative, returns -EINVAL.
> >
> > But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
> > has negative offsets. And we can't do any access via read/write
> > to the file(device).
> >
> > This patch introduce a flag S_VERYBIG and allow negative file
> > offsets.
> >
> > Changelog:
> > - fixed bug in rw_verify_area (it cannot be compiled)
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > drivers/char/mem.c | 23 +++++++++++++----------
> > fs/proc/base.c | 2 ++
> > fs/read_write.c | 20 ++++++++++++++++++--
> > include/linux/fs.h | 2 ++
> > 4 files changed, 35 insertions(+), 12 deletions(-)
> >
> > Index: mmotm-2.6.31-Sep14/fs/read_write.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> > +++ mmotm-2.6.31-Sep14/fs/read_write.c
> > @@ -205,6 +205,16 @@ bad:
> > }
> > #endif
> >
> > +static bool
> > +__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
> > +{
> > + if (pos + count < pos)
> > + return -EOVERFLOW;
> > + if (IS_VERYBIG(inode))
> > + return 0;
> > + return -EINVAL;
>
> This always return negative values for !IS_VERYBIG inodes,
> looks weird at least.
>
I need to refresh my eyes, too...
Thanks, will fix.. return value is not bool, but int...




> > +}
> > +
> > /*
> > * rw_verify_area doesn't like huge counts. We limit
> > * them to something that fits in "int" so that others
> > @@ -222,8 +232,11 @@ int rw_verify_area(int read_write, struc
> > if (unlikely((ssize_t) count < 0))
> > return retval;
> > pos = *ppos;
> > - if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> > - return retval;
> > + if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
> > + retval = __negative_fpos_check(inode, pos, count);
> > + if (retval)
> > + return retval;
> > + }
> >
> > if (unlikely(inode->i_flock && mandatory_lock(inode))) {
> > retval = locks_mandatory_area(
> > Index: mmotm-2.6.31-Sep14/include/linux/fs.h
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
> > +++ mmotm-2.6.31-Sep14/include/linux/fs.h
> > @@ -231,6 +231,7 @@ struct inodes_stat_t {
> > #define S_NOCMTIME 128 /* Do not update file c/mtime */
> > #define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
> > #define S_PRIVATE 512 /* Inode is fs-internal */
> > +#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */
> >
> > /*
> > * Note that nosuid etc flags are inode-specific: setting some file-system
> > @@ -265,6 +266,7 @@ struct inodes_stat_t {
> > #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
> > #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
> > #define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
> > +#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)
> >
> > /* the read-only stuff doesn't really belong here, but any other place is
> > probably as bad and I don't want to create yet another include file. */
> > Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
> > +++ mmotm-2.6.31-Sep14/drivers/char/mem.c
> > @@ -825,22 +825,23 @@ static const struct memdev {
> > const char *name;
> > const struct file_operations *fops;
> > struct backing_dev_info *dev_info;
> > + bool verybig;
> > } devlist[] = {
> > - [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
> > + [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
> > #ifdef CONFIG_DEVKMEM
> > - [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
> > + [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
> > #endif
> > - [ 3] = {"null", &null_fops, NULL },
> > + [3] = {"null", &null_fops, NULL, false },
> > #ifdef CONFIG_DEVPORT
> > - [ 4] = { "port", &port_fops, NULL },
> > + [4] = { "port", &port_fops, NULL, false },
> > #endif
> > - [ 5] = { "zero", &zero_fops, &zero_bdi },
> > - [ 6] = { "full", &full_fops, NULL },
> > - [ 7] = { "random", &random_fops, NULL },
> > - [ 9] = { "urandom", &urandom_fops, NULL },
> > - [11] = { "kmsg", &kmsg_fops, NULL },
> > + [5] = { "zero", &zero_fops, &zero_bdi, false },
> > + [6] = { "full", &full_fops, NULL, false },
> > + [7] = { "random", &random_fops, NULL, false },
> > + [9] = { "urandom", &urandom_fops, NULL, false },
> > + [11] = { "kmsg", &kmsg_fops, NULL, false },
> > #ifdef CONFIG_CRASH_DUMP
> > - [12] = { "oldmem", &oldmem_fops, NULL },
> > + [12] = { "oldmem", &oldmem_fops, NULL, false },
> > #endif
>
> How about make the fields aligned? That would look much nicer :)
>
checkpatch.pl complains agaisnt [ 5]...but ok, ajust "=" placement.

Thanks,
-kame


> Thanks,
> Fengguang
>
> > };
> >
> > @@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
> > ret = dev->fops->open(inode, filp);
> > else
> > ret = 0;
> > + if (dev->verybig)
> > + inode->i_flags |= S_VERYBIG;
> > out:
> > unlock_kernel();
> > return ret;
> > Index: mmotm-2.6.31-Sep14/fs/proc/base.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
> > +++ mmotm-2.6.31-Sep14/fs/proc/base.c
> > @@ -779,6 +779,8 @@ static const struct file_operations proc
> > static int mem_open(struct inode* inode, struct file* file)
> > {
> > file->private_data = (void*)((long)current->self_exec_id);
> > + /* this file is read only and we can catch out-pf-range */
> > + inode->i_flags |= S_VERYBIG;
> > return 0;
> > }
> >
>

2009-09-17 06:53:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

From: KAMEZAWA Hiroyuki <[email protected]>

Now, rw_verify_area() checsk f_pos is negative or not. And if
negative, returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
has negative offsets. And we can't do any access via read/write
to the file(device).

This patch introduce a flag S_VERYBIG and allow negative file
offsets for big files. (usual files don't allow it.)

Changelog: v3->v4
- make changes in mem.c aligned.
- change __negative_fpos_check() to return int.
- fixed bug in "pos" check.
- added comments.

Changelog: v2->v3
- fixed bug in rw_verify_area (it cannot be compiled)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/mem.c | 23 +++++++++++++----------
fs/proc/base.c | 2 ++
fs/read_write.c | 22 ++++++++++++++++++++--
include/linux/fs.h | 2 ++
4 files changed, 37 insertions(+), 12 deletions(-)

Index: mmotm-2.6.31-Sep14/fs/read_write.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/read_write.c
+++ mmotm-2.6.31-Sep14/fs/read_write.c
@@ -205,6 +205,21 @@ bad:
}
#endif

+static int
+__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
+{
+ /*
+ * pos or pos+count is negative here, check overflow.
+ * too big "count" will be caught in rw_verify_area().
+ */
+ if ((pos < 0) && (pos + count < pos))
+ return -EOVERFLOW;
+ /* If !VERYBIG inode, negative pos(pos+count) is not allowed */
+ if (!IS_VERYBIG(inode))
+ return -EINVAL;
+ return 0;
+}
+
/*
* rw_verify_area doesn't like huge counts. We limit
* them to something that fits in "int" so that others
@@ -222,8 +237,11 @@ int rw_verify_area(int read_write, struc
if (unlikely((ssize_t) count < 0))
return retval;
pos = *ppos;
- if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
- return retval;
+ if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
+ retval = __negative_fpos_check(inode, pos, count);
+ if (retval)
+ return retval;
+ }

if (unlikely(inode->i_flock && mandatory_lock(inode))) {
retval = locks_mandatory_area(
Index: mmotm-2.6.31-Sep14/include/linux/fs.h
===================================================================
--- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
+++ mmotm-2.6.31-Sep14/include/linux/fs.h
@@ -231,6 +231,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -265,6 +266,7 @@ struct inodes_stat_t {
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
+#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)

/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
+++ mmotm-2.6.31-Sep14/drivers/char/mem.c
@@ -825,22 +825,23 @@ static const struct memdev {
const char *name;
const struct file_operations *fops;
struct backing_dev_info *dev_info;
+ bool verybig;
} devlist[] = {
- [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
+ [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
#ifdef CONFIG_DEVKMEM
- [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
+ [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
#endif
- [ 3] = {"null", &null_fops, NULL },
+ [3] = {"null", &null_fops, NULL, false },
#ifdef CONFIG_DEVPORT
- [ 4] = { "port", &port_fops, NULL },
+ [4] = { "port", &port_fops, NULL, false },
#endif
- [ 5] = { "zero", &zero_fops, &zero_bdi },
- [ 6] = { "full", &full_fops, NULL },
- [ 7] = { "random", &random_fops, NULL },
- [ 9] = { "urandom", &urandom_fops, NULL },
- [11] = { "kmsg", &kmsg_fops, NULL },
+ [5] = { "zero", &zero_fops, &zero_bdi, false },
+ [6] = { "full", &full_fops, NULL, false },
+ [7] = { "random", &random_fops, NULL, false },
+ [9] = { "urandom", &urandom_fops, NULL, false },
+ [11] = { "kmsg", &kmsg_fops, NULL, false },
#ifdef CONFIG_CRASH_DUMP
- [12] = { "oldmem", &oldmem_fops, NULL },
+ [12] = { "oldmem", &oldmem_fops, NULL, false },
#endif
};

@@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
ret = dev->fops->open(inode, filp);
else
ret = 0;
+ if (dev->verybig)
+ inode->i_flags |= S_VERYBIG;
out:
unlock_kernel();
return ret;
Index: mmotm-2.6.31-Sep14/fs/proc/base.c
===================================================================
--- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
+++ mmotm-2.6.31-Sep14/fs/proc/base.c
@@ -779,6 +779,8 @@ static const struct file_operations proc
static int mem_open(struct inode* inode, struct file* file)
{
file->private_data = (void*)((long)current->self_exec_id);
+ /* this file is read only and we can catch out-pf-range */
+ inode->i_flags |= S_VERYBIG;
return 0;
}

2009-09-17 06:53:31

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v3

On Thu, Sep 17, 2009 at 02:31:19PM +0800, KAMEZAWA Hiroyuki wrote:
> > How about make the fields aligned? That would look much nicer :)
> >
> checkpatch.pl complains agaisnt [ 5]...but ok, ajust "=" placement.

How about this? checkpatch won't complain on it :)

} devlist[] = {
[1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
#ifdef CONFIG_DEVKMEM
[2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
#endif
[3] = { "null", &null_fops, NULL, false },
#ifdef CONFIG_DEVPORT
[4] = { "port", &port_fops, NULL, false },
#endif
[5] = { "zero", &zero_fops, &zero_bdi, false },
[6] = { "full", &full_fops, NULL, false },
[7] = { "random", &random_fops, NULL, false },
[9] = { "urandom", &urandom_fops, NULL, false },
[11] = { "kmsg", &kmsg_fops, NULL, false },
#ifdef CONFIG_CRASH_DUMP
[12] = { "oldmem", &oldmem_fops, NULL, false },
#endif
};

Thanks,
Fengguang

2009-09-17 07:14:37

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

On Thu, Sep 17, 2009 at 02:51:00PM +0800, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, rw_verify_area() checsk f_pos is negative or not. And if
> negative, returns -EINVAL.
>
> But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
> has negative offsets. And we can't do any access via read/write
> to the file(device).
>
> This patch introduce a flag S_VERYBIG and allow negative file
> offsets for big files. (usual files don't allow it.)
>
> Changelog: v3->v4
> - make changes in mem.c aligned.
> - change __negative_fpos_check() to return int.
> - fixed bug in "pos" check.
> - added comments.
>
> Changelog: v2->v3
> - fixed bug in rw_verify_area (it cannot be compiled)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> drivers/char/mem.c | 23 +++++++++++++----------
> fs/proc/base.c | 2 ++
> fs/read_write.c | 22 ++++++++++++++++++++--
> include/linux/fs.h | 2 ++
> 4 files changed, 37 insertions(+), 12 deletions(-)
>
> Index: mmotm-2.6.31-Sep14/fs/read_write.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> +++ mmotm-2.6.31-Sep14/fs/read_write.c
> @@ -205,6 +205,21 @@ bad:
> }
> #endif
>
> +static int
> +__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
> +{
> + /*
> + * pos or pos+count is negative here, check overflow.
> + * too big "count" will be caught in rw_verify_area().
> + */
> + if ((pos < 0) && (pos + count < pos))
> + return -EOVERFLOW;

This returns -EOVERFLOW when pos=-10 and count=1. What's the intention?
Just to return a different error code other than -EINVAL?

Also it seems you did two behavior changes at the same time: the above
-EOVERFLOW and the below IS_VERYBIG(). Are they tightly coupled?

> + /* If !VERYBIG inode, negative pos(pos+count) is not allowed */
> + if (!IS_VERYBIG(inode))
> + return -EINVAL;
> + return 0;
> +}
> +
> /*
> * rw_verify_area doesn't like huge counts. We limit
> * them to something that fits in "int" so that others
> @@ -222,8 +237,11 @@ int rw_verify_area(int read_write, struc
> if (unlikely((ssize_t) count < 0))
> return retval;
> pos = *ppos;
> - if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> - return retval;
> + if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
> + retval = __negative_fpos_check(inode, pos, count);
> + if (retval)
> + return retval;
> + }

This could look more nicer:

retval = __negative_fpos_check(inode, pos, count);
if (retval)
return retval;

But they are all minor issues :)

Thanks,
Fengguang

>
> if (unlikely(inode->i_flock && mandatory_lock(inode))) {
> retval = locks_mandatory_area(
> Index: mmotm-2.6.31-Sep14/include/linux/fs.h
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
> +++ mmotm-2.6.31-Sep14/include/linux/fs.h
> @@ -231,6 +231,7 @@ struct inodes_stat_t {
> #define S_NOCMTIME 128 /* Do not update file c/mtime */
> #define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
> #define S_PRIVATE 512 /* Inode is fs-internal */
> +#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */
>
> /*
> * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -265,6 +266,7 @@ struct inodes_stat_t {
> #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
> #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
> #define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
> +#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)
>
> /* the read-only stuff doesn't really belong here, but any other place is
> probably as bad and I don't want to create yet another include file. */
> Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
> +++ mmotm-2.6.31-Sep14/drivers/char/mem.c
> @@ -825,22 +825,23 @@ static const struct memdev {
> const char *name;
> const struct file_operations *fops;
> struct backing_dev_info *dev_info;
> + bool verybig;
> } devlist[] = {
> - [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
> + [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
> #ifdef CONFIG_DEVKMEM
> - [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
> + [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
> #endif
> - [ 3] = {"null", &null_fops, NULL },
> + [3] = {"null", &null_fops, NULL, false },
> #ifdef CONFIG_DEVPORT
> - [ 4] = { "port", &port_fops, NULL },
> + [4] = { "port", &port_fops, NULL, false },
> #endif
> - [ 5] = { "zero", &zero_fops, &zero_bdi },
> - [ 6] = { "full", &full_fops, NULL },
> - [ 7] = { "random", &random_fops, NULL },
> - [ 9] = { "urandom", &urandom_fops, NULL },
> - [11] = { "kmsg", &kmsg_fops, NULL },
> + [5] = { "zero", &zero_fops, &zero_bdi, false },
> + [6] = { "full", &full_fops, NULL, false },
> + [7] = { "random", &random_fops, NULL, false },
> + [9] = { "urandom", &urandom_fops, NULL, false },
> + [11] = { "kmsg", &kmsg_fops, NULL, false },
> #ifdef CONFIG_CRASH_DUMP
> - [12] = { "oldmem", &oldmem_fops, NULL },
> + [12] = { "oldmem", &oldmem_fops, NULL, false },
> #endif
> };
>
> @@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
> ret = dev->fops->open(inode, filp);
> else
> ret = 0;
> + if (dev->verybig)
> + inode->i_flags |= S_VERYBIG;
> out:
> unlock_kernel();
> return ret;
> Index: mmotm-2.6.31-Sep14/fs/proc/base.c
> ===================================================================
> --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
> +++ mmotm-2.6.31-Sep14/fs/proc/base.c
> @@ -779,6 +779,8 @@ static const struct file_operations proc
> static int mem_open(struct inode* inode, struct file* file)
> {
> file->private_data = (void*)((long)current->self_exec_id);
> + /* this file is read only and we can catch out-pf-range */
> + inode->i_flags |= S_VERYBIG;
> return 0;
> }
>

2009-09-17 07:25:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

On Thu, 17 Sep 2009 15:14:28 +0800
Wu Fengguang <[email protected]> wrote:

> On Thu, Sep 17, 2009 at 02:51:00PM +0800, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, rw_verify_area() checsk f_pos is negative or not. And if
> > negative, returns -EINVAL.
> >
> > But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
> > has negative offsets. And we can't do any access via read/write
> > to the file(device).
> >
> > This patch introduce a flag S_VERYBIG and allow negative file
> > offsets for big files. (usual files don't allow it.)
> >
> > Changelog: v3->v4
> > - make changes in mem.c aligned.
> > - change __negative_fpos_check() to return int.
> > - fixed bug in "pos" check.
> > - added comments.
> >
> > Changelog: v2->v3
> > - fixed bug in rw_verify_area (it cannot be compiled)
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > drivers/char/mem.c | 23 +++++++++++++----------
> > fs/proc/base.c | 2 ++
> > fs/read_write.c | 22 ++++++++++++++++++++--
> > include/linux/fs.h | 2 ++
> > 4 files changed, 37 insertions(+), 12 deletions(-)
> >
> > Index: mmotm-2.6.31-Sep14/fs/read_write.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> > +++ mmotm-2.6.31-Sep14/fs/read_write.c
> > @@ -205,6 +205,21 @@ bad:
> > }
> > #endif
> >
> > +static int
> > +__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
> > +{
> > + /*
> > + * pos or pos+count is negative here, check overflow.
> > + * too big "count" will be caught in rw_verify_area().
> > + */
> > + if ((pos < 0) && (pos + count < pos))
> > + return -EOVERFLOW;
>
> This returns -EOVERFLOW when pos=-10 and count=1. What's the intention?
Hmm ?

pos+count=-9 > -10 ? it's ok. no -EOVERFLOW

pos=-10, count=11,
pos+count=1 > -10, then overflow.

> Just to return a different error code other than -EINVAL?
>
For showing what this "if" checks. EINVAL is better ?

Thanks,
-Kame


> Also it seems you did two behavior changes at the same time: the above
> -EOVERFLOW and the below IS_VERYBIG(). Are they tightly coupled?
>
> > + /* If !VERYBIG inode, negative pos(pos+count) is not allowed */
> > + if (!IS_VERYBIG(inode))
> > + return -EINVAL;
> > + return 0;
> > +}
> > +
> > /*
> > * rw_verify_area doesn't like huge counts. We limit
> > * them to something that fits in "int" so that others
> > @@ -222,8 +237,11 @@ int rw_verify_area(int read_write, struc
> > if (unlikely((ssize_t) count < 0))
> > return retval;
> > pos = *ppos;
> > - if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> > - return retval;
> > + if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
> > + retval = __negative_fpos_check(inode, pos, count);
> > + if (retval)
> > + return retval;
> > + }
>
> This could look more nicer:
>
> retval = __negative_fpos_check(inode, pos, count);
> if (retval)
> return retval;
>
> But they are all minor issues :)
>
> Thanks,
> Fengguang
>
> >
> > if (unlikely(inode->i_flock && mandatory_lock(inode))) {
> > retval = locks_mandatory_area(
> > Index: mmotm-2.6.31-Sep14/include/linux/fs.h
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
> > +++ mmotm-2.6.31-Sep14/include/linux/fs.h
> > @@ -231,6 +231,7 @@ struct inodes_stat_t {
> > #define S_NOCMTIME 128 /* Do not update file c/mtime */
> > #define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
> > #define S_PRIVATE 512 /* Inode is fs-internal */
> > +#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */
> >
> > /*
> > * Note that nosuid etc flags are inode-specific: setting some file-system
> > @@ -265,6 +266,7 @@ struct inodes_stat_t {
> > #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
> > #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
> > #define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
> > +#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)
> >
> > /* the read-only stuff doesn't really belong here, but any other place is
> > probably as bad and I don't want to create yet another include file. */
> > Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
> > +++ mmotm-2.6.31-Sep14/drivers/char/mem.c
> > @@ -825,22 +825,23 @@ static const struct memdev {
> > const char *name;
> > const struct file_operations *fops;
> > struct backing_dev_info *dev_info;
> > + bool verybig;
> > } devlist[] = {
> > - [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
> > + [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
> > #ifdef CONFIG_DEVKMEM
> > - [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
> > + [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
> > #endif
> > - [ 3] = {"null", &null_fops, NULL },
> > + [3] = {"null", &null_fops, NULL, false },
> > #ifdef CONFIG_DEVPORT
> > - [ 4] = { "port", &port_fops, NULL },
> > + [4] = { "port", &port_fops, NULL, false },
> > #endif
> > - [ 5] = { "zero", &zero_fops, &zero_bdi },
> > - [ 6] = { "full", &full_fops, NULL },
> > - [ 7] = { "random", &random_fops, NULL },
> > - [ 9] = { "urandom", &urandom_fops, NULL },
> > - [11] = { "kmsg", &kmsg_fops, NULL },
> > + [5] = { "zero", &zero_fops, &zero_bdi, false },
> > + [6] = { "full", &full_fops, NULL, false },
> > + [7] = { "random", &random_fops, NULL, false },
> > + [9] = { "urandom", &urandom_fops, NULL, false },
> > + [11] = { "kmsg", &kmsg_fops, NULL, false },
> > #ifdef CONFIG_CRASH_DUMP
> > - [12] = { "oldmem", &oldmem_fops, NULL },
> > + [12] = { "oldmem", &oldmem_fops, NULL, false },
> > #endif
> > };
> >
> > @@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
> > ret = dev->fops->open(inode, filp);
> > else
> > ret = 0;
> > + if (dev->verybig)
> > + inode->i_flags |= S_VERYBIG;
> > out:
> > unlock_kernel();
> > return ret;
> > Index: mmotm-2.6.31-Sep14/fs/proc/base.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
> > +++ mmotm-2.6.31-Sep14/fs/proc/base.c
> > @@ -779,6 +779,8 @@ static const struct file_operations proc
> > static int mem_open(struct inode* inode, struct file* file)
> > {
> > file->private_data = (void*)((long)current->self_exec_id);
> > + /* this file is read only and we can catch out-pf-range */
> > + inode->i_flags |= S_VERYBIG;
> > return 0;
> > }
> >
>

2009-09-17 07:30:44

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

On Thu, Sep 17, 2009 at 03:23:24PM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 17 Sep 2009 15:14:28 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > On Thu, Sep 17, 2009 at 02:51:00PM +0800, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > Now, rw_verify_area() checsk f_pos is negative or not. And if
> > > negative, returns -EINVAL.
> > >
> > > But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
> > > has negative offsets. And we can't do any access via read/write
> > > to the file(device).
> > >
> > > This patch introduce a flag S_VERYBIG and allow negative file
> > > offsets for big files. (usual files don't allow it.)
> > >
> > > Changelog: v3->v4
> > > - make changes in mem.c aligned.
> > > - change __negative_fpos_check() to return int.
> > > - fixed bug in "pos" check.
> > > - added comments.
> > >
> > > Changelog: v2->v3
> > > - fixed bug in rw_verify_area (it cannot be compiled)
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > > ---
> > > drivers/char/mem.c | 23 +++++++++++++----------
> > > fs/proc/base.c | 2 ++
> > > fs/read_write.c | 22 ++++++++++++++++++++--
> > > include/linux/fs.h | 2 ++
> > > 4 files changed, 37 insertions(+), 12 deletions(-)
> > >
> > > Index: mmotm-2.6.31-Sep14/fs/read_write.c
> > > ===================================================================
> > > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> > > +++ mmotm-2.6.31-Sep14/fs/read_write.c
> > > @@ -205,6 +205,21 @@ bad:
> > > }
> > > #endif
> > >
> > > +static int
> > > +__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
> > > +{
> > > + /*
> > > + * pos or pos+count is negative here, check overflow.
> > > + * too big "count" will be caught in rw_verify_area().
> > > + */
> > > + if ((pos < 0) && (pos + count < pos))
> > > + return -EOVERFLOW;
> >
> > This returns -EOVERFLOW when pos=-10 and count=1. What's the intention?
> Hmm ?
>
> pos+count=-9 > -10 ? it's ok. no -EOVERFLOW
>
> pos=-10, count=11,
> pos+count=1 > -10, then overflow.

Ah yes, was confused..

> > Just to return a different error code other than -EINVAL?
> >
> For showing what this "if" checks. EINVAL is better ?

..so please ignore this question.

Thanks,
Fengguang

>
> > Also it seems you did two behavior changes at the same time: the above
> > -EOVERFLOW and the below IS_VERYBIG(). Are they tightly coupled?
> >
> > > + /* If !VERYBIG inode, negative pos(pos+count) is not allowed */
> > > + if (!IS_VERYBIG(inode))
> > > + return -EINVAL;
> > > + return 0;
> > > +}
> > > +
> > > /*
> > > * rw_verify_area doesn't like huge counts. We limit
> > > * them to something that fits in "int" so that others
> > > @@ -222,8 +237,11 @@ int rw_verify_area(int read_write, struc
> > > if (unlikely((ssize_t) count < 0))
> > > return retval;
> > > pos = *ppos;
> > > - if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
> > > - return retval;
> > > + if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
> > > + retval = __negative_fpos_check(inode, pos, count);
> > > + if (retval)
> > > + return retval;
> > > + }
> >
> > This could look more nicer:
> >
> > retval = __negative_fpos_check(inode, pos, count);
> > if (retval)
> > return retval;
> >
> > But they are all minor issues :)
> >
> > Thanks,
> > Fengguang
> >
> > >
> > > if (unlikely(inode->i_flock && mandatory_lock(inode))) {
> > > retval = locks_mandatory_area(
> > > Index: mmotm-2.6.31-Sep14/include/linux/fs.h
> > > ===================================================================
> > > --- mmotm-2.6.31-Sep14.orig/include/linux/fs.h
> > > +++ mmotm-2.6.31-Sep14/include/linux/fs.h
> > > @@ -231,6 +231,7 @@ struct inodes_stat_t {
> > > #define S_NOCMTIME 128 /* Do not update file c/mtime */
> > > #define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
> > > #define S_PRIVATE 512 /* Inode is fs-internal */
> > > +#define S_VERYBIG 1024 /* Allow file's loff_t can be negative */
> > >
> > > /*
> > > * Note that nosuid etc flags are inode-specific: setting some file-system
> > > @@ -265,6 +266,7 @@ struct inodes_stat_t {
> > > #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
> > > #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
> > > #define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
> > > +#define IS_VERYBIG(inode) ((inode)->i_flags & S_VERYBIG)
> > >
> > > /* the read-only stuff doesn't really belong here, but any other place is
> > > probably as bad and I don't want to create yet another include file. */
> > > Index: mmotm-2.6.31-Sep14/drivers/char/mem.c
> > > ===================================================================
> > > --- mmotm-2.6.31-Sep14.orig/drivers/char/mem.c
> > > +++ mmotm-2.6.31-Sep14/drivers/char/mem.c
> > > @@ -825,22 +825,23 @@ static const struct memdev {
> > > const char *name;
> > > const struct file_operations *fops;
> > > struct backing_dev_info *dev_info;
> > > + bool verybig;
> > > } devlist[] = {
> > > - [ 1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi },
> > > + [1] = { "mem", &mem_fops, &directly_mappable_cdev_bdi, true },
> > > #ifdef CONFIG_DEVKMEM
> > > - [ 2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi },
> > > + [2] = { "kmem", &kmem_fops, &directly_mappable_cdev_bdi, true },
> > > #endif
> > > - [ 3] = {"null", &null_fops, NULL },
> > > + [3] = {"null", &null_fops, NULL, false },
> > > #ifdef CONFIG_DEVPORT
> > > - [ 4] = { "port", &port_fops, NULL },
> > > + [4] = { "port", &port_fops, NULL, false },
> > > #endif
> > > - [ 5] = { "zero", &zero_fops, &zero_bdi },
> > > - [ 6] = { "full", &full_fops, NULL },
> > > - [ 7] = { "random", &random_fops, NULL },
> > > - [ 9] = { "urandom", &urandom_fops, NULL },
> > > - [11] = { "kmsg", &kmsg_fops, NULL },
> > > + [5] = { "zero", &zero_fops, &zero_bdi, false },
> > > + [6] = { "full", &full_fops, NULL, false },
> > > + [7] = { "random", &random_fops, NULL, false },
> > > + [9] = { "urandom", &urandom_fops, NULL, false },
> > > + [11] = { "kmsg", &kmsg_fops, NULL, false },
> > > #ifdef CONFIG_CRASH_DUMP
> > > - [12] = { "oldmem", &oldmem_fops, NULL },
> > > + [12] = { "oldmem", &oldmem_fops, NULL, false },
> > > #endif
> > > };
> > >
> > > @@ -868,6 +869,8 @@ static int memory_open(struct inode *ino
> > > ret = dev->fops->open(inode, filp);
> > > else
> > > ret = 0;
> > > + if (dev->verybig)
> > > + inode->i_flags |= S_VERYBIG;
> > > out:
> > > unlock_kernel();
> > > return ret;
> > > Index: mmotm-2.6.31-Sep14/fs/proc/base.c
> > > ===================================================================
> > > --- mmotm-2.6.31-Sep14.orig/fs/proc/base.c
> > > +++ mmotm-2.6.31-Sep14/fs/proc/base.c
> > > @@ -779,6 +779,8 @@ static const struct file_operations proc
> > > static int mem_open(struct inode* inode, struct file* file)
> > > {
> > > file->private_data = (void*)((long)current->self_exec_id);
> > > + /* this file is read only and we can catch out-pf-range */
> > > + inode->i_flags |= S_VERYBIG;
> > > return 0;
> > > }
> > >
> >

2009-09-17 09:42:12

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

On Thu, Sep 17, 2009 at 03:23:24PM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 17 Sep 2009 15:14:28 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > On Thu, Sep 17, 2009 at 02:51:00PM +0800, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > Now, rw_verify_area() checsk f_pos is negative or not. And if
> > > negative, returns -EINVAL.
> > >
> > > But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
> > > has negative offsets. And we can't do any access via read/write
> > > to the file(device).
> > >
> > > This patch introduce a flag S_VERYBIG and allow negative file
> > > offsets for big files. (usual files don't allow it.)
> > >
> > > Changelog: v3->v4
> > > - make changes in mem.c aligned.
> > > - change __negative_fpos_check() to return int.
> > > - fixed bug in "pos" check.
> > > - added comments.
> > >
> > > Changelog: v2->v3
> > > - fixed bug in rw_verify_area (it cannot be compiled)
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > > ---
> > > drivers/char/mem.c | 23 +++++++++++++----------
> > > fs/proc/base.c | 2 ++
> > > fs/read_write.c | 22 ++++++++++++++++++++--
> > > include/linux/fs.h | 2 ++
> > > 4 files changed, 37 insertions(+), 12 deletions(-)
> > >
> > > Index: mmotm-2.6.31-Sep14/fs/read_write.c
> > > ===================================================================
> > > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> > > +++ mmotm-2.6.31-Sep14/fs/read_write.c
> > > @@ -205,6 +205,21 @@ bad:
> > > }
> > > #endif
> > >
> > > +static int
> > > +__negative_fpos_check(struct inode *inode, loff_t pos, size_t count)
> > > +{
> > > + /*
> > > + * pos or pos+count is negative here, check overflow.
> > > + * too big "count" will be caught in rw_verify_area().
> > > + */
> > > + if ((pos < 0) && (pos + count < pos))
> > > + return -EOVERFLOW;
> >
> > This returns -EOVERFLOW when pos=-10 and count=1. What's the intention?
> Hmm ?
>
> pos+count=-9 > -10 ? it's ok. no -EOVERFLOW
>
> pos=-10, count=11,
> pos+count=1 > -10, then overflow.

Hmm, it seems less confusing to do

static int __negative_fpos_check(struct inode *inode,
unsigned long pos,
unsigned long count)
{
if (pos + count < pos)
return -EOVERFLOW;
...
}

Thanks,
Fengguang

2009-09-17 10:54:02

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

Wu Fengguang wrote:
> On Thu, Sep 17, 2009 at 03:23:24PM +0800, KAMEZAWA Hiroyuki wrote:
>> On Thu, 17 Sep 2009 15:14:28 +0800
>> Wu Fengguang <[email protected]> wrote:
>>
>> > On Thu, Sep 17, 2009 at 02:51:00PM +0800, KAMEZAWA Hiroyuki wrote:
>> > > From: KAMEZAWA Hiroyuki <[email protected]>
>> > >
>> > > Now, rw_verify_area() checsk f_pos is negative or not. And if
>> > > negative, returns -EINVAL.
>> > >
>> > > But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
>> > > has negative offsets. And we can't do any access via read/write
>> > > to the file(device).
>> > >
>> > > This patch introduce a flag S_VERYBIG and allow negative file
>> > > offsets for big files. (usual files don't allow it.)
>> > >
>> > > Changelog: v3->v4
>> > > - make changes in mem.c aligned.
>> > > - change __negative_fpos_check() to return int.
>> > > - fixed bug in "pos" check.
>> > > - added comments.
>> > >
>> > > Changelog: v2->v3
>> > > - fixed bug in rw_verify_area (it cannot be compiled)
>> > >
>> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> > > ---
>> > > drivers/char/mem.c | 23 +++++++++++++----------
>> > > fs/proc/base.c | 2 ++
>> > > fs/read_write.c | 22 ++++++++++++++++++++--
>> > > include/linux/fs.h | 2 ++
>> > > 4 files changed, 37 insertions(+), 12 deletions(-)
>> > >
>> > > Index: mmotm-2.6.31-Sep14/fs/read_write.c
>> > > ===================================================================
>> > > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
>> > > +++ mmotm-2.6.31-Sep14/fs/read_write.c
>> > > @@ -205,6 +205,21 @@ bad:
>> > > }
>> > > #endif
>> > >
>> > > +static int
>> > > +__negative_fpos_check(struct inode *inode, loff_t pos, size_t
>> count)
>> > > +{
>> > > + /*
>> > > + * pos or pos+count is negative here, check overflow.
>> > > + * too big "count" will be caught in rw_verify_area().
>> > > + */
>> > > + if ((pos < 0) && (pos + count < pos))
>> > > + return -EOVERFLOW;
>> >
>> > This returns -EOVERFLOW when pos=-10 and count=1. What's the
>> intention?
>> Hmm ?
>>
>> pos+count=-9 > -10 ? it's ok. no -EOVERFLOW
>>
>> pos=-10, count=11,
>> pos+count=1 > -10, then overflow.
>
> Hmm, it seems less confusing to do
>
> static int __negative_fpos_check(struct inode *inode,
> unsigned long pos,
> unsigned long count)
> {
> if (pos + count < pos)
> return -EOVERFLOW;
> ...
> }
>
have to avoid pos == LONGLONGMAX case.

Thanks,
-Kame


> Thanks,
> Fengguang
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-17 10:58:07

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

KAMEZAWA Hiroyuki wrote:
> Wu Fengguang wrote:
>> On Thu, Sep 17, 2009 at 03:23:24PM +0800, KAMEZAWA Hiroyuki wrote:
>>> On Thu, 17 Sep 2009 15:14:28 +0800
>>> Wu Fengguang <[email protected]> wrote:
>>>
>>> > On Thu, Sep 17, 2009 at 02:51:00PM +0800, KAMEZAWA Hiroyuki wrote:
>>> > > From: KAMEZAWA Hiroyuki <[email protected]>
>>> > >
>>> > > Now, rw_verify_area() checsk f_pos is negative or not. And if
>>> > > negative, returns -EINVAL.
>>> > >
>>> > > But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
>>> > > has negative offsets. And we can't do any access via read/write
>>> > > to the file(device).
>>> > >
>>> > > This patch introduce a flag S_VERYBIG and allow negative file
>>> > > offsets for big files. (usual files don't allow it.)
>>> > >
>>> > > Changelog: v3->v4
>>> > > - make changes in mem.c aligned.
>>> > > - change __negative_fpos_check() to return int.
>>> > > - fixed bug in "pos" check.
>>> > > - added comments.
>>> > >
>>> > > Changelog: v2->v3
>>> > > - fixed bug in rw_verify_area (it cannot be compiled)
>>> > >
>>> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>>> > > ---
>>> > > drivers/char/mem.c | 23 +++++++++++++----------
>>> > > fs/proc/base.c | 2 ++
>>> > > fs/read_write.c | 22 ++++++++++++++++++++--
>>> > > include/linux/fs.h | 2 ++
>>> > > 4 files changed, 37 insertions(+), 12 deletions(-)
>>> > >
>>> > > Index: mmotm-2.6.31-Sep14/fs/read_write.c
>>> > > ===================================================================
>>> > > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
>>> > > +++ mmotm-2.6.31-Sep14/fs/read_write.c
>>> > > @@ -205,6 +205,21 @@ bad:
>>> > > }
>>> > > #endif
>>> > >
>>> > > +static int
>>> > > +__negative_fpos_check(struct inode *inode, loff_t pos, size_t
>>> count)
>>> > > +{
>>> > > + /*
>>> > > + * pos or pos+count is negative here, check overflow.
>>> > > + * too big "count" will be caught in rw_verify_area().
>>> > > + */
>>> > > + if ((pos < 0) && (pos + count < pos))
>>> > > + return -EOVERFLOW;
>>> >
>>> > This returns -EOVERFLOW when pos=-10 and count=1. What's the
>>> intention?
>>> Hmm ?
>>>
>>> pos+count=-9 > -10 ? it's ok. no -EOVERFLOW
>>>
>>> pos=-10, count=11,
>>> pos+count=1 > -10, then overflow.
>>
>> Hmm, it seems less confusing to do
>>
>> static int __negative_fpos_check(struct inode *inode,
>> unsigned long pos,
>> unsigned long count)
>> {
>> if (pos + count < pos)
>> return -EOVERFLOW;
>> ...
>> }
>>
> have to avoid pos == LONGLONGMAX case.
>
Ah, you ask me to do cast from loff_t to unsigned long long ?

Not making much difference, I think. This is usual math.
But ok, I don't want to explain again.
If I post v5, I'll do.

Thanks,
-Kame


> Thanks,
> -Kame
>
>
>> Thanks,
>> Fengguang
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-17 12:40:46

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

On Thu, Sep 17, 2009 at 06:54:00PM +0800, KAMEZAWA Hiroyuki wrote:
> Wu Fengguang wrote:
> > On Thu, Sep 17, 2009 at 03:23:24PM +0800, KAMEZAWA Hiroyuki wrote:
> >> On Thu, 17 Sep 2009 15:14:28 +0800
> >> Wu Fengguang <[email protected]> wrote:
> >>
> >> > On Thu, Sep 17, 2009 at 02:51:00PM +0800, KAMEZAWA Hiroyuki wrote:
> >> > > From: KAMEZAWA Hiroyuki <[email protected]>
> >> > >
> >> > > Now, rw_verify_area() checsk f_pos is negative or not. And if
> >> > > negative, returns -EINVAL.
> >> > >
> >> > > But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
> >> > > has negative offsets. And we can't do any access via read/write
> >> > > to the file(device).
> >> > >
> >> > > This patch introduce a flag S_VERYBIG and allow negative file
> >> > > offsets for big files. (usual files don't allow it.)
> >> > >
> >> > > Changelog: v3->v4
> >> > > - make changes in mem.c aligned.
> >> > > - change __negative_fpos_check() to return int.
> >> > > - fixed bug in "pos" check.
> >> > > - added comments.
> >> > >
> >> > > Changelog: v2->v3
> >> > > - fixed bug in rw_verify_area (it cannot be compiled)
> >> > >
> >> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >> > > ---
> >> > > drivers/char/mem.c | 23 +++++++++++++----------
> >> > > fs/proc/base.c | 2 ++
> >> > > fs/read_write.c | 22 ++++++++++++++++++++--
> >> > > include/linux/fs.h | 2 ++
> >> > > 4 files changed, 37 insertions(+), 12 deletions(-)
> >> > >
> >> > > Index: mmotm-2.6.31-Sep14/fs/read_write.c
> >> > > ===================================================================
> >> > > --- mmotm-2.6.31-Sep14.orig/fs/read_write.c
> >> > > +++ mmotm-2.6.31-Sep14/fs/read_write.c
> >> > > @@ -205,6 +205,21 @@ bad:
> >> > > }
> >> > > #endif
> >> > >
> >> > > +static int
> >> > > +__negative_fpos_check(struct inode *inode, loff_t pos, size_t
> >> count)
> >> > > +{
> >> > > + /*
> >> > > + * pos or pos+count is negative here, check overflow.
> >> > > + * too big "count" will be caught in rw_verify_area().
> >> > > + */
> >> > > + if ((pos < 0) && (pos + count < pos))
> >> > > + return -EOVERFLOW;
> >> >
> >> > This returns -EOVERFLOW when pos=-10 and count=1. What's the
> >> intention?
> >> Hmm ?
> >>
> >> pos+count=-9 > -10 ? it's ok. no -EOVERFLOW
> >>
> >> pos=-10, count=11,
> >> pos+count=1 > -10, then overflow.
> >
> > Hmm, it seems less confusing to do
> >
> > static int __negative_fpos_check(struct inode *inode,
> > unsigned long pos,
> > unsigned long count)
> > {
> > if (pos + count < pos)
> > return -EOVERFLOW;
> > ...
> > }
> >
> have to avoid pos == LONGLONGMAX case.

Ah sorry, didn't know loff_t could be long long..

Thanks,
Fengguang

2009-09-18 00:04:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

On Thu, 17 Sep 2009 20:40:39 +0800
Wu Fengguang <[email protected]> wrote:

> > > if (pos + count < pos)
> > > return -EOVERFLOW;
> > > ...
> > > }
> > >
> > have to avoid pos == LONGLONGMAX case.
>
> Ah sorry, didn't know loff_t could be long long..
>
Thank you. I'll restart this after merge window.
(And maybe I had to CC fsdevel...)

Thanks,
-Kame

2009-09-18 02:25:15

by Cong Wang

[permalink] [raw]
Subject: Re: [RFC][PATCH][bugfix] more checks for negative f_pos handling v4

On Fri, Sep 18, 2009 at 8:02 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 17 Sep 2009 20:40:39 +0800
> Wu Fengguang <[email protected]> wrote:
>
>> > >         if (pos + count < pos)
>> > >                 return -EOVERFLOW;
>> > >         ...
>> > > }
>> > >
>> > have to avoid pos == LONGLONGMAX case.
>>
>> Ah sorry, didn't know loff_t could be long long..
>>
> Thank you. I'll restart this after merge window.
> (And maybe I had to CC fsdevel...)

Yes, please do so. I would like to hear from Al..