2014-12-04 20:20:17

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCHES] iov_iter.c rewrite

First of all, I want to apologize for the nastiness of preprocessor
use in this series. Seeing that the whole "macros that look like new kinds
of C statements" thing (including list_for_each_...(), etc) is very much not
to my liking, I really don't trust my taste on finer details and I'd very
much like some feedback.

The reason for doing that kind of tricks is that iov_iter.c keeps
growing more and more boilerplate code. For iov_iter-net series we need
* csum_and_copy_from_iter()
* csum_and_copy_to_iter()
* copy_from_iter_nocache()
That's 3 new primitives, each in 2 variants (iovec and bvec).
* ITER_KVEC handled without going through uaccess.h stuff (and
independent of set_fs() state).
And *that* means 3 variants intstead of 2 for most of the existing primitives.
That's far too much, and the amount of copies of the same logics would pretty
much guarantee that it will be a breeding ground for hard-to-kill bugs.

The following series (also in vfs.git#iov_iter) actually manages to
do all of the above *and* shrink the damn thing quite a bit. The generated
code appears to be no worse than before. The price is a couple of iterator
macros - iterate_all_kinds() and iterate_and_advance(). They are given an
iov_iter, size (i.e. the amount of data in iov_iter beginning we want to go
through), name of the loop variable and 3 variants of loop body - for iovec,
bvec and kvec resp. Loop variable is declared *inside* the expansion of those
suckers according to the kind of iov_iter - it's struct iovec, struct bio_vec
or struct kvec, covering the current range to deal with.
The difference between those two is that iterate_and_advance() will
advance the iov_iter by the amount it has handled and iterate_all_kinds()
will leave iov_iter unchanged.

Unless I hear anybody yelling, it goes into vfs.git#for-next today,
so if you have objections, suggestions, etc., give those *now*.

Al Viro (13):
iov_iter.c: macros for iterating over iov_iter
iov_iter.c: iterate_and_advance
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: handle ITER_KVEC directly
csum_and_copy_..._iter()
new helper: iov_iter_kvec()
copy_from_iter_nocache()

Diffstat:
include/linux/uio.h | 6 +
mm/iov_iter.c | 1077 +++++++++++++++++++++------------------------------
2 files changed, 445 insertions(+), 638 deletions(-)


2014-12-04 20:23:25

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 03/13] iov_iter.c: convert iov_iter_npages() to iterate_all_kinds

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 73 ++++++++++++++++-------------------------------------------
1 file changed, 19 insertions(+), 54 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index e91bf0a..bc666e7 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -493,32 +493,6 @@ static ssize_t get_pages_alloc_iovec(struct iov_iter *i,
return (res == n ? len : res * PAGE_SIZE) - *start;
}

-static int iov_iter_npages_iovec(const struct iov_iter *i, int maxpages)
-{
- size_t offset = i->iov_offset;
- size_t size = i->count;
- const struct iovec *iov = i->iov;
- int npages = 0;
- int n;
-
- for (n = 0; size && n < i->nr_segs; n++, iov++) {
- unsigned long addr = (unsigned long)iov->iov_base + offset;
- size_t len = iov->iov_len - offset;
- offset = 0;
- if (unlikely(!len)) /* empty segment */
- continue;
- if (len > size)
- len = size;
- npages += (addr + len + PAGE_SIZE - 1) / PAGE_SIZE
- - addr / PAGE_SIZE;
- if (npages >= maxpages) /* don't bother going further */
- return maxpages;
- size -= len;
- offset = 0;
- }
- return min(npages, maxpages);
-}
-
static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t len)
{
char *from = kmap_atomic(page);
@@ -715,30 +689,6 @@ static ssize_t get_pages_alloc_bvec(struct iov_iter *i,
return len;
}

-static int iov_iter_npages_bvec(const struct iov_iter *i, int maxpages)
-{
- size_t offset = i->iov_offset;
- size_t size = i->count;
- const struct bio_vec *bvec = i->bvec;
- int npages = 0;
- int n;
-
- for (n = 0; size && n < i->nr_segs; n++, bvec++) {
- size_t len = bvec->bv_len - offset;
- offset = 0;
- if (unlikely(!len)) /* empty segment */
- continue;
- if (len > size)
- len = size;
- npages++;
- if (npages >= maxpages) /* don't bother going further */
- return maxpages;
- size -= len;
- offset = 0;
- }
- return min(npages, maxpages);
-}
-
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
@@ -862,9 +812,24 @@ EXPORT_SYMBOL(iov_iter_get_pages_alloc);

int iov_iter_npages(const struct iov_iter *i, int maxpages)
{
- if (i->type & ITER_BVEC)
- return iov_iter_npages_bvec(i, maxpages);
- else
- return iov_iter_npages_iovec(i, maxpages);
+ size_t size = i->count;
+ int npages = 0;
+
+ if (!size)
+ return 0;
+
+ iterate_all_kinds(i, size, v, ({
+ unsigned long p = (unsigned long)v.iov_base;
+ npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
+ - p / PAGE_SIZE;
+ if (npages >= maxpages)
+ return maxpages;
+ 0;}),({
+ npages++;
+ if (npages >= maxpages)
+ return maxpages;
+ })
+ )
+ return npages;
}
EXPORT_SYMBOL(iov_iter_npages);
--
2.1.3

2014-12-04 20:23:29

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 09/13] iov_iter.c: convert copy_to_iter() to iterate_and_advance

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 91 ++++++-----------------------------------------------------
1 file changed, 9 insertions(+), 82 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 791429d..66665449 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -97,51 +97,6 @@
i->iov_offset = skip; \
}

-static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
-{
- size_t skip, copy, left, wanted;
- const struct iovec *iov;
- char __user *buf;
-
- if (unlikely(bytes > i->count))
- bytes = i->count;
-
- if (unlikely(!bytes))
- return 0;
-
- wanted = bytes;
- iov = i->iov;
- skip = i->iov_offset;
- buf = iov->iov_base + skip;
- copy = min(bytes, iov->iov_len - skip);
-
- left = __copy_to_user(buf, from, copy);
- copy -= left;
- skip += copy;
- from += copy;
- bytes -= copy;
- while (unlikely(!left && bytes)) {
- iov++;
- buf = iov->iov_base;
- copy = min(bytes, iov->iov_len);
- left = __copy_to_user(buf, from, copy);
- copy -= left;
- skip = copy;
- from += copy;
- bytes -= copy;
- }
-
- if (skip == iov->iov_len) {
- iov++;
- skip = 0;
- }
- i->count -= wanted - bytes;
- i->nr_segs -= iov - i->iov;
- i->iov = iov;
- i->iov_offset = skip;
- return wanted - bytes;
-}
-
static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
@@ -360,51 +315,23 @@ static void memzero_page(struct page *page, size_t offset, size_t len)
kunmap_atomic(addr);
}

-static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
{
- size_t skip, copy, wanted;
- const struct bio_vec *bvec;
-
+ char *from = addr;
if (unlikely(bytes > i->count))
bytes = i->count;

if (unlikely(!bytes))
return 0;

- wanted = bytes;
- bvec = i->bvec;
- skip = i->iov_offset;
- copy = min_t(size_t, bytes, bvec->bv_len - skip);
-
- memcpy_to_page(bvec->bv_page, skip + bvec->bv_offset, from, copy);
- skip += copy;
- from += copy;
- bytes -= copy;
- while (bytes) {
- bvec++;
- copy = min(bytes, (size_t)bvec->bv_len);
- memcpy_to_page(bvec->bv_page, bvec->bv_offset, from, copy);
- skip = copy;
- from += copy;
- bytes -= copy;
- }
- if (skip == bvec->bv_len) {
- bvec++;
- skip = 0;
- }
- i->count -= wanted - bytes;
- i->nr_segs -= bvec - i->bvec;
- i->bvec = bvec;
- i->iov_offset = skip;
- return wanted - bytes;
-}
+ iterate_and_advance(i, bytes, v,
+ __copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
+ v.iov_len),
+ memcpy_to_page(v.bv_page, v.bv_offset,
+ (from += v.bv_len) - v.bv_len, v.bv_len)
+ )

-size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
-{
- if (i->type & ITER_BVEC)
- return copy_to_iter_bvec(addr, bytes, i);
- else
- return copy_to_iter_iovec(addr, bytes, i);
+ return bytes;
}
EXPORT_SYMBOL(copy_to_iter);

--
2.1.3

2014-12-04 20:23:28

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()

From: Al Viro <[email protected]>

Just have copy_page_{to,from}_iter() fall back to kmap_atomic +
copy_{to,from}_iter() + kunmap_atomic() in ITER_BVEC case. As
the matter of fact, that's what we want to do for any iov_iter
kind that isn't blocking - e.g. ITER_KVEC will also go that way
once we recognize it on iov_iter.c primitives level

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 60 ++++++++++++++++++++++++-----------------------------------
1 file changed, 24 insertions(+), 36 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 39ad713..17b7144 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -486,30 +486,33 @@ static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
return wanted;
}

-static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
- size_t bytes, struct iov_iter *i)
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
{
- void *kaddr = kmap_atomic(page);
- size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
- kunmap_atomic(kaddr);
- return wanted;
+ if (i->type & ITER_BVEC)
+ return copy_to_iter_bvec(addr, bytes, i);
+ else
+ return copy_to_iter_iovec(addr, bytes, i);
}
+EXPORT_SYMBOL(copy_to_iter);

-static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
- size_t bytes, struct iov_iter *i)
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
{
- void *kaddr = kmap_atomic(page);
- size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
- kunmap_atomic(kaddr);
- return wanted;
+ if (i->type & ITER_BVEC)
+ return copy_from_iter_bvec(addr, bytes, i);
+ else
+ return copy_from_iter_iovec(addr, bytes, i);
}
+EXPORT_SYMBOL(copy_from_iter);

size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
- if (i->type & ITER_BVEC)
- return copy_page_to_iter_bvec(page, offset, bytes, i);
- else
+ if (i->type & (ITER_BVEC|ITER_KVEC)) {
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
+ kunmap_atomic(kaddr);
+ return wanted;
+ } else
return copy_page_to_iter_iovec(page, offset, bytes, i);
}
EXPORT_SYMBOL(copy_page_to_iter);
@@ -517,31 +520,16 @@ EXPORT_SYMBOL(copy_page_to_iter);
size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
- if (i->type & ITER_BVEC)
- return copy_page_from_iter_bvec(page, offset, bytes, i);
- else
+ if (i->type & ITER_BVEC) {
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
+ kunmap_atomic(kaddr);
+ return wanted;
+ } else
return copy_page_from_iter_iovec(page, offset, bytes, i);
}
EXPORT_SYMBOL(copy_page_from_iter);

-size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
-{
- if (i->type & ITER_BVEC)
- return copy_to_iter_bvec(addr, bytes, i);
- else
- return copy_to_iter_iovec(addr, bytes, i);
-}
-EXPORT_SYMBOL(copy_to_iter);
-
-size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
-{
- if (i->type & ITER_BVEC)
- return copy_from_iter_bvec(addr, bytes, i);
- else
- return copy_from_iter_iovec(addr, bytes, i);
-}
-EXPORT_SYMBOL(copy_from_iter);
-
size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
{
if (unlikely(bytes > i->count))
--
2.1.3

2014-12-04 20:23:27

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 06/13] iov_iter.c: convert iov_iter_zero() to iterate_and_advance

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 98 ++++++++---------------------------------------------------
1 file changed, 12 insertions(+), 86 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 3214b9b..39ad713 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -349,50 +349,6 @@ done:
return wanted - bytes;
}

-static size_t zero_iovec(size_t bytes, struct iov_iter *i)
-{
- size_t skip, copy, left, wanted;
- const struct iovec *iov;
- char __user *buf;
-
- if (unlikely(bytes > i->count))
- bytes = i->count;
-
- if (unlikely(!bytes))
- return 0;
-
- wanted = bytes;
- iov = i->iov;
- skip = i->iov_offset;
- buf = iov->iov_base + skip;
- copy = min(bytes, iov->iov_len - skip);
-
- left = __clear_user(buf, copy);
- copy -= left;
- skip += copy;
- bytes -= copy;
-
- while (unlikely(!left && bytes)) {
- iov++;
- buf = iov->iov_base;
- copy = min(bytes, iov->iov_len);
- left = __clear_user(buf, copy);
- copy -= left;
- skip = copy;
- bytes -= copy;
- }
-
- if (skip == iov->iov_len) {
- iov++;
- skip = 0;
- }
- i->count -= wanted - bytes;
- i->nr_segs -= iov - i->iov;
- i->iov = iov;
- i->iov_offset = skip;
- return wanted - bytes;
-}
-
/*
* Fault in the first iovec of the given iov_iter, to a maximum length
* of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -548,43 +504,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
return wanted;
}

-static size_t zero_bvec(size_t bytes, struct iov_iter *i)
-{
- size_t skip, copy, wanted;
- const struct bio_vec *bvec;
-
- if (unlikely(bytes > i->count))
- bytes = i->count;
-
- if (unlikely(!bytes))
- return 0;
-
- wanted = bytes;
- bvec = i->bvec;
- skip = i->iov_offset;
- copy = min_t(size_t, bytes, bvec->bv_len - skip);
-
- memzero_page(bvec->bv_page, skip + bvec->bv_offset, copy);
- skip += copy;
- bytes -= copy;
- while (bytes) {
- bvec++;
- copy = min(bytes, (size_t)bvec->bv_len);
- memzero_page(bvec->bv_page, bvec->bv_offset, copy);
- skip = copy;
- bytes -= copy;
- }
- if (skip == bvec->bv_len) {
- bvec++;
- skip = 0;
- }
- i->count -= wanted - bytes;
- i->nr_segs -= bvec - i->bvec;
- i->bvec = bvec;
- i->iov_offset = skip;
- return wanted - bytes;
-}
-
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
@@ -625,11 +544,18 @@ EXPORT_SYMBOL(copy_from_iter);

size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
{
- if (i->type & ITER_BVEC) {
- return zero_bvec(bytes, i);
- } else {
- return zero_iovec(bytes, i);
- }
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ iterate_and_advance(i, bytes, v,
+ __clear_user(v.iov_base, v.iov_len),
+ memzero_page(v.bv_page, v.bv_offset, v.bv_len)
+ )
+
+ return bytes;
}
EXPORT_SYMBOL(iov_iter_zero);

--
2.1.3

2014-12-04 20:23:24

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 05/13] iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 107 ++++++++++++++++++++++++----------------------------------
1 file changed, 45 insertions(+), 62 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 75e29ef..3214b9b 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -428,43 +428,6 @@ void iov_iter_init(struct iov_iter *i, int direction,
}
EXPORT_SYMBOL(iov_iter_init);

-static ssize_t get_pages_alloc_iovec(struct iov_iter *i,
- struct page ***pages, size_t maxsize,
- size_t *start)
-{
- size_t offset = i->iov_offset;
- const struct iovec *iov = i->iov;
- size_t len;
- unsigned long addr;
- void *p;
- int n;
- int res;
-
- len = iov->iov_len - offset;
- if (len > i->count)
- len = i->count;
- if (len > maxsize)
- len = maxsize;
- addr = (unsigned long)iov->iov_base + offset;
- len += *start = addr & (PAGE_SIZE - 1);
- addr &= ~(PAGE_SIZE - 1);
- n = (len + PAGE_SIZE - 1) / PAGE_SIZE;
-
- p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
- if (!p)
- p = vmalloc(n * sizeof(struct page *));
- if (!p)
- return -ENOMEM;
-
- res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, p);
- if (unlikely(res < 0)) {
- kvfree(p);
- return res;
- }
- *pages = p;
- return (res == n ? len : res * PAGE_SIZE) - *start;
-}
-
static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t len)
{
char *from = kmap_atomic(page);
@@ -622,27 +585,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static ssize_t get_pages_alloc_bvec(struct iov_iter *i,
- struct page ***pages, size_t maxsize,
- size_t *start)
-{
- const struct bio_vec *bvec = i->bvec;
- size_t len = bvec->bv_len - i->iov_offset;
- if (len > i->count)
- len = i->count;
- if (len > maxsize)
- len = maxsize;
- *start = bvec->bv_offset + i->iov_offset;
-
- *pages = kmalloc(sizeof(struct page *), GFP_KERNEL);
- if (!*pages)
- return -ENOMEM;
-
- get_page(**pages = bvec->bv_page);
-
- return len;
-}
-
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
@@ -777,14 +719,55 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
}
EXPORT_SYMBOL(iov_iter_get_pages);

+static struct page **get_pages_array(size_t n)
+{
+ struct page **p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
+ if (!p)
+ p = vmalloc(n * sizeof(struct page *));
+ return p;
+}
+
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
{
- if (i->type & ITER_BVEC)
- return get_pages_alloc_bvec(i, pages, maxsize, start);
- else
- return get_pages_alloc_iovec(i, pages, maxsize, start);
+ struct page **p;
+
+ if (maxsize > i->count)
+ maxsize = i->count;
+
+ if (!maxsize)
+ return 0;
+
+ iterate_all_kinds(i, maxsize, v, ({
+ unsigned long addr = (unsigned long)v.iov_base;
+ size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+ int n;
+ int res;
+
+ addr &= ~(PAGE_SIZE - 1);
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+ p = get_pages_array(n);
+ if (!p)
+ return -ENOMEM;
+ res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, p);
+ if (unlikely(res < 0)) {
+ kvfree(p);
+ return res;
+ }
+ *pages = p;
+ return (res == n ? len : res * PAGE_SIZE) - *start;
+ 0;}),({
+ /* can't be more than PAGE_SIZE */
+ *start = v.bv_offset;
+ *pages = p = get_pages_array(1);
+ if (!p)
+ return -ENOMEM;
+ get_page(*p = v.bv_page);
+ return v.bv_len;
+ })
+ )
+ return 0;
}
EXPORT_SYMBOL(iov_iter_get_pages_alloc);

--
2.1.3

2014-12-04 20:23:23

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 04/13] iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 78 +++++++++++++++++++++--------------------------------------
1 file changed, 28 insertions(+), 50 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index bc666e7..75e29ef 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -428,34 +428,6 @@ void iov_iter_init(struct iov_iter *i, int direction,
}
EXPORT_SYMBOL(iov_iter_init);

-static ssize_t get_pages_iovec(struct iov_iter *i,
- struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start)
-{
- size_t offset = i->iov_offset;
- const struct iovec *iov = i->iov;
- size_t len;
- unsigned long addr;
- int n;
- int res;
-
- len = iov->iov_len - offset;
- if (len > i->count)
- len = i->count;
- if (len > maxsize)
- len = maxsize;
- addr = (unsigned long)iov->iov_base + offset;
- len += *start = addr & (PAGE_SIZE - 1);
- if (len > maxpages * PAGE_SIZE)
- len = maxpages * PAGE_SIZE;
- addr &= ~(PAGE_SIZE - 1);
- n = (len + PAGE_SIZE - 1) / PAGE_SIZE;
- res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pages);
- if (unlikely(res < 0))
- return res;
- return (res == n ? len : res * PAGE_SIZE) - *start;
-}
-
static ssize_t get_pages_alloc_iovec(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
@@ -650,24 +622,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static ssize_t get_pages_bvec(struct iov_iter *i,
- struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start)
-{
- const struct bio_vec *bvec = i->bvec;
- size_t len = bvec->bv_len - i->iov_offset;
- if (len > i->count)
- len = i->count;
- if (len > maxsize)
- len = maxsize;
- /* can't be more than PAGE_SIZE */
- *start = bvec->bv_offset + i->iov_offset;
-
- get_page(*pages = bvec->bv_page);
-
- return len;
-}
-
static ssize_t get_pages_alloc_bvec(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
@@ -792,10 +746,34 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
size_t *start)
{
- if (i->type & ITER_BVEC)
- return get_pages_bvec(i, pages, maxsize, maxpages, start);
- else
- return get_pages_iovec(i, pages, maxsize, maxpages, start);
+ if (maxsize > i->count)
+ maxsize = i->count;
+
+ if (!maxsize)
+ return 0;
+
+ iterate_all_kinds(i, maxsize, v, ({
+ unsigned long addr = (unsigned long)v.iov_base;
+ size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+ int n;
+ int res;
+
+ if (len > maxpages * PAGE_SIZE)
+ len = maxpages * PAGE_SIZE;
+ addr &= ~(PAGE_SIZE - 1);
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+ res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pages);
+ if (unlikely(res < 0))
+ return res;
+ return (res == n ? len : res * PAGE_SIZE) - *start;
+ 0;}),({
+ /* can't be more than PAGE_SIZE */
+ *start = v.bv_offset;
+ get_page(*pages = v.bv_page);
+ return v.bv_len;
+ })
+ )
+ return 0;
}
EXPORT_SYMBOL(iov_iter_get_pages);

--
2.1.3

2014-12-04 20:23:22

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 01/13] iov_iter.c: macros for iterating over iov_iter

From: Al Viro <[email protected]>

iterate_all_kinds(iter, size, ident, step_iovec, step_bvec)
iterates through the ranges covered by iter (up to size bytes total),
repeating step_iovec or step_bvec for each of those. ident is
declared in expansion of that thing, either as struct iovec or
struct bvec, and it contains the range we are currently looking
at. step_bvec should be a void expression, step_iovec - a size_t
one, with non-zero meaning "stop here, that many bytes from this
range left". In the end, the amount actually handled is stored
in size.

iov_iter_copy_from_user_atomic() and iov_iter_alignment() converted
to it.

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 212 ++++++++++++++++++++++++----------------------------------
1 file changed, 86 insertions(+), 126 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index e34a3cb..798fcb4 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -4,6 +4,72 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>

+#define iterate_iovec(i, n, __v, __p, skip, STEP) { \
+ size_t left; \
+ size_t wanted = n; \
+ __p = i->iov; \
+ __v.iov_len = min(n, __p->iov_len - skip); \
+ if (likely(__v.iov_len)) { \
+ __v.iov_base = __p->iov_base + skip; \
+ left = (STEP); \
+ __v.iov_len -= left; \
+ skip += __v.iov_len; \
+ n -= __v.iov_len; \
+ } else { \
+ left = 0; \
+ } \
+ while (unlikely(!left && n)) { \
+ __p++; \
+ __v.iov_len = min(n, __p->iov_len); \
+ if (unlikely(!__v.iov_len)) \
+ continue; \
+ __v.iov_base = __p->iov_base; \
+ left = (STEP); \
+ __v.iov_len -= left; \
+ skip = __v.iov_len; \
+ n -= __v.iov_len; \
+ } \
+ n = wanted - n; \
+}
+
+#define iterate_bvec(i, n, __v, __p, skip, STEP) { \
+ size_t wanted = n; \
+ __p = i->bvec; \
+ __v.bv_len = min_t(size_t, n, __p->bv_len - skip); \
+ if (likely(__v.bv_len)) { \
+ __v.bv_page = __p->bv_page; \
+ __v.bv_offset = __p->bv_offset + skip; \
+ (void)(STEP); \
+ skip += __v.bv_len; \
+ n -= __v.bv_len; \
+ } \
+ while (unlikely(n)) { \
+ __p++; \
+ __v.bv_len = min_t(size_t, n, __p->bv_len); \
+ if (unlikely(!__v.bv_len)) \
+ continue; \
+ __v.bv_page = __p->bv_page; \
+ __v.bv_offset = __p->bv_offset; \
+ (void)(STEP); \
+ skip = __v.bv_len; \
+ n -= __v.bv_len; \
+ } \
+ n = wanted; \
+}
+
+#define iterate_all_kinds(i, n, v, I, B) { \
+ size_t skip = i->iov_offset; \
+ if (unlikely(i->type & ITER_BVEC)) { \
+ const struct bio_vec *bvec; \
+ struct bio_vec v; \
+ iterate_bvec(i, n, v, bvec, skip, (B)) \
+ } else { \
+ const struct iovec *iov; \
+ struct iovec v; \
+ iterate_iovec(i, n, v, iov, skip, (I)) \
+ } \
+}
+
static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, left, wanted;
@@ -300,54 +366,6 @@ static size_t zero_iovec(size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static size_t __iovec_copy_from_user_inatomic(char *vaddr,
- const struct iovec *iov, size_t base, size_t bytes)
-{
- size_t copied = 0, left = 0;
-
- while (bytes) {
- char __user *buf = iov->iov_base + base;
- int copy = min(bytes, iov->iov_len - base);
-
- base = 0;
- left = __copy_from_user_inatomic(vaddr, buf, copy);
- copied += copy;
- bytes -= copy;
- vaddr += copy;
- iov++;
-
- if (unlikely(left))
- break;
- }
- return copied - left;
-}
-
-/*
- * Copy as much as we can into the page and return the number of bytes which
- * were successfully copied. If a fault is encountered then return the number of
- * bytes which were copied.
- */
-static size_t copy_from_user_atomic_iovec(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
-{
- char *kaddr;
- size_t copied;
-
- kaddr = kmap_atomic(page);
- if (likely(i->nr_segs == 1)) {
- int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
- left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
- copied = bytes - left;
- } else {
- copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
- }
- kunmap_atomic(kaddr);
-
- return copied;
-}
-
static void advance_iovec(struct iov_iter *i, size_t bytes)
{
BUG_ON(i->count < bytes);
@@ -404,30 +422,6 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
}
EXPORT_SYMBOL(iov_iter_fault_in_readable);

-static unsigned long alignment_iovec(const struct iov_iter *i)
-{
- const struct iovec *iov = i->iov;
- unsigned long res;
- size_t size = i->count;
- size_t n;
-
- if (!size)
- return 0;
-
- res = (unsigned long)iov->iov_base + i->iov_offset;
- n = iov->iov_len - i->iov_offset;
- if (n >= size)
- return res | size;
- size -= n;
- res |= n;
- while (size > (++iov)->iov_len) {
- res |= (unsigned long)iov->iov_base | iov->iov_len;
- size -= iov->iov_len;
- }
- res |= (unsigned long)iov->iov_base | size;
- return res;
-}
-
void iov_iter_init(struct iov_iter *i, int direction,
const struct iovec *iov, unsigned long nr_segs,
size_t count)
@@ -691,28 +685,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static size_t copy_from_user_bvec(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
-{
- char *kaddr;
- size_t left;
- const struct bio_vec *bvec;
- size_t base = i->iov_offset;
-
- kaddr = kmap_atomic(page);
- for (left = bytes, bvec = i->bvec; left; bvec++, base = 0) {
- size_t copy = min(left, bvec->bv_len - base);
- if (!bvec->bv_len)
- continue;
- memcpy_from_page(kaddr + offset, bvec->bv_page,
- bvec->bv_offset + base, copy);
- offset += copy;
- left -= copy;
- }
- kunmap_atomic(kaddr);
- return bytes;
-}
-
static void advance_bvec(struct iov_iter *i, size_t bytes)
{
BUG_ON(i->count < bytes);
@@ -749,30 +721,6 @@ static void advance_bvec(struct iov_iter *i, size_t bytes)
}
}

-static unsigned long alignment_bvec(const struct iov_iter *i)
-{
- const struct bio_vec *bvec = i->bvec;
- unsigned long res;
- size_t size = i->count;
- size_t n;
-
- if (!size)
- return 0;
-
- res = bvec->bv_offset + i->iov_offset;
- n = bvec->bv_len - i->iov_offset;
- if (n >= size)
- return res | size;
- size -= n;
- res |= n;
- while (size > (++bvec)->bv_len) {
- res |= bvec->bv_offset | bvec->bv_len;
- size -= bvec->bv_len;
- }
- res |= bvec->bv_offset | size;
- return res;
-}
-
static ssize_t get_pages_bvec(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
size_t *start)
@@ -887,10 +835,15 @@ EXPORT_SYMBOL(iov_iter_zero);
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
- if (i->type & ITER_BVEC)
- return copy_from_user_bvec(page, i, offset, bytes);
- else
- return copy_from_user_atomic_iovec(page, i, offset, bytes);
+ char *kaddr = kmap_atomic(page), *p = kaddr + offset;
+ iterate_all_kinds(i, bytes, v,
+ __copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
+ v.iov_base, v.iov_len),
+ memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+ v.bv_offset, v.bv_len)
+ )
+ kunmap_atomic(kaddr);
+ return bytes;
}
EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);

@@ -919,10 +872,17 @@ EXPORT_SYMBOL(iov_iter_single_seg_count);

unsigned long iov_iter_alignment(const struct iov_iter *i)
{
- if (i->type & ITER_BVEC)
- return alignment_bvec(i);
- else
- return alignment_iovec(i);
+ unsigned long res = 0;
+ size_t size = i->count;
+
+ if (!size)
+ return 0;
+
+ iterate_all_kinds(i, size, v,
+ (res |= (unsigned long)v.iov_base | v.iov_len, 0),
+ res |= v.bv_offset | v.bv_len
+ )
+ return res;
}
EXPORT_SYMBOL(iov_iter_alignment);

--
2.1.3

2014-12-04 20:23:21

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 02/13] iov_iter.c: iterate_and_advance

From: Al Viro <[email protected]>

same as iterate_all_kinds, but iterator is moved to the position past
the last byte we'd handled.

iov_iter_advance() converted to it

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 104 ++++++++++++++++------------------------------------------
1 file changed, 28 insertions(+), 76 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 798fcb4..e91bf0a 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -70,6 +70,33 @@
} \
}

+#define iterate_and_advance(i, n, v, I, B) { \
+ size_t skip = i->iov_offset; \
+ if (unlikely(i->type & ITER_BVEC)) { \
+ const struct bio_vec *bvec; \
+ struct bio_vec v; \
+ iterate_bvec(i, n, v, bvec, skip, (B)) \
+ if (skip == bvec->bv_len) { \
+ bvec++; \
+ skip = 0; \
+ } \
+ i->nr_segs -= bvec - i->bvec; \
+ i->bvec = bvec; \
+ } else { \
+ const struct iovec *iov; \
+ struct iovec v; \
+ iterate_iovec(i, n, v, iov, skip, (I)) \
+ if (skip == iov->iov_len) { \
+ iov++; \
+ skip = 0; \
+ } \
+ i->nr_segs -= iov - i->iov; \
+ i->iov = iov; \
+ } \
+ i->count -= n; \
+ i->iov_offset = skip; \
+}
+
static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, left, wanted;
@@ -366,42 +393,6 @@ static size_t zero_iovec(size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static void advance_iovec(struct iov_iter *i, size_t bytes)
-{
- BUG_ON(i->count < bytes);
-
- if (likely(i->nr_segs == 1)) {
- i->iov_offset += bytes;
- i->count -= bytes;
- } else {
- const struct iovec *iov = i->iov;
- size_t base = i->iov_offset;
- unsigned long nr_segs = i->nr_segs;
-
- /*
- * The !iov->iov_len check ensures we skip over unlikely
- * zero-length segments (without overruning the iovec).
- */
- while (bytes || unlikely(i->count && !iov->iov_len)) {
- int copy;
-
- copy = min(bytes, iov->iov_len - base);
- BUG_ON(!i->count || i->count < copy);
- i->count -= copy;
- bytes -= copy;
- base += copy;
- if (iov->iov_len == base) {
- iov++;
- nr_segs--;
- base = 0;
- }
- }
- i->iov = iov;
- i->iov_offset = base;
- i->nr_segs = nr_segs;
- }
-}
-
/*
* Fault in the first iovec of the given iov_iter, to a maximum length
* of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -685,42 +676,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static void advance_bvec(struct iov_iter *i, size_t bytes)
-{
- BUG_ON(i->count < bytes);
-
- if (likely(i->nr_segs == 1)) {
- i->iov_offset += bytes;
- i->count -= bytes;
- } else {
- const struct bio_vec *bvec = i->bvec;
- size_t base = i->iov_offset;
- unsigned long nr_segs = i->nr_segs;
-
- /*
- * The !iov->iov_len check ensures we skip over unlikely
- * zero-length segments (without overruning the iovec).
- */
- while (bytes || unlikely(i->count && !bvec->bv_len)) {
- int copy;
-
- copy = min(bytes, bvec->bv_len - base);
- BUG_ON(!i->count || i->count < copy);
- i->count -= copy;
- bytes -= copy;
- base += copy;
- if (bvec->bv_len == base) {
- bvec++;
- nr_segs--;
- base = 0;
- }
- }
- i->bvec = bvec;
- i->iov_offset = base;
- i->nr_segs = nr_segs;
- }
-}
-
static ssize_t get_pages_bvec(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
size_t *start)
@@ -849,10 +804,7 @@ EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);

void iov_iter_advance(struct iov_iter *i, size_t size)
{
- if (i->type & ITER_BVEC)
- advance_bvec(i, size);
- else
- advance_iovec(i, size);
+ iterate_and_advance(i, size, v, 0, 0)
}
EXPORT_SYMBOL(iov_iter_advance);

--
2.1.3

2014-12-04 20:26:13

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 13/13] copy_from_iter_nocache()

From: Al Viro <[email protected]>

BTW, do we want memcpy_nocache()?

Signed-off-by: Al Viro <[email protected]>
---
include/linux/uio.h | 1 +
mm/iov_iter.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index c567655..bd8569a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -83,6 +83,7 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 7c04051..e0605c1 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -399,6 +399,27 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
}
EXPORT_SYMBOL(copy_from_iter);

+size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
+{
+ char *to = addr;
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ iterate_and_advance(i, bytes, v,
+ __copy_from_user_nocache((to += v.iov_len) - v.iov_len,
+ v.iov_base, v.iov_len),
+ memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+ v.bv_offset, v.bv_len),
+ memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+ )
+
+ return bytes;
+}
+EXPORT_SYMBOL(copy_from_iter_nocache);
+
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
--
2.1.3

2014-12-04 20:26:11

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 12/13] new helper: iov_iter_kvec()

From: Al Viro <[email protected]>

initialization of kvec-backed iov_iter

Signed-off-by: Al Viro <[email protected]>
---
include/linux/uio.h | 2 ++
mm/iov_iter.c | 13 +++++++++++++
2 files changed, 15 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 28ed2d9..c567655 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -87,6 +87,8 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
+void iov_iter_kvec(struct iov_iter *i, int direction, const struct kvec *iov,
+ unsigned long nr_segs, size_t count);
ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 5a3b4a8..7c04051 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -479,6 +479,19 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
}
EXPORT_SYMBOL(iov_iter_single_seg_count);

+void iov_iter_kvec(struct iov_iter *i, int direction,
+ const struct kvec *iov, unsigned long nr_segs,
+ size_t count)
+{
+ BUG_ON(!(direction & ITER_KVEC));
+ i->type = direction;
+ i->kvec = (struct kvec *)iov;
+ i->nr_segs = nr_segs;
+ i->iov_offset = 0;
+ i->count = count;
+}
+EXPORT_SYMBOL(iov_iter_kvec);
+
unsigned long iov_iter_alignment(const struct iov_iter *i)
{
unsigned long res = 0;
--
2.1.3

2014-12-04 20:27:14

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 10/13] iov_iter.c: handle ITER_KVEC directly

From: Al Viro <[email protected]>

... without bothering with copy_..._user()

Signed-off-by: Al Viro <[email protected]>
---
include/linux/uio.h | 1 +
mm/iov_iter.c | 101 +++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9b15814..6e16945 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -31,6 +31,7 @@ struct iov_iter {
size_t count;
union {
const struct iovec *iov;
+ const struct kvec *kvec;
const struct bio_vec *bvec;
};
unsigned long nr_segs;
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 66665449..d74de6d 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -32,6 +32,29 @@
n = wanted - n; \
}

+#define iterate_kvec(i, n, __v, __p, skip, STEP) { \
+ size_t wanted = n; \
+ __p = i->kvec; \
+ __v.iov_len = min(n, __p->iov_len - skip); \
+ if (likely(__v.iov_len)) { \
+ __v.iov_base = __p->iov_base + skip; \
+ (void)(STEP); \
+ skip += __v.iov_len; \
+ n -= __v.iov_len; \
+ } \
+ while (unlikely(n)) { \
+ __p++; \
+ __v.iov_len = min(n, __p->iov_len); \
+ if (unlikely(!__v.iov_len)) \
+ continue; \
+ __v.iov_base = __p->iov_base; \
+ (void)(STEP); \
+ skip = __v.iov_len; \
+ n -= __v.iov_len; \
+ } \
+ n = wanted; \
+}
+
#define iterate_bvec(i, n, __v, __p, skip, STEP) { \
size_t wanted = n; \
__p = i->bvec; \
@@ -57,12 +80,16 @@
n = wanted; \
}

-#define iterate_all_kinds(i, n, v, I, B) { \
+#define iterate_all_kinds(i, n, v, I, B, K) { \
size_t skip = i->iov_offset; \
if (unlikely(i->type & ITER_BVEC)) { \
const struct bio_vec *bvec; \
struct bio_vec v; \
iterate_bvec(i, n, v, bvec, skip, (B)) \
+ } else if (unlikely(i->type & ITER_KVEC)) { \
+ const struct kvec *kvec; \
+ struct kvec v; \
+ iterate_kvec(i, n, v, kvec, skip, (K)) \
} else { \
const struct iovec *iov; \
struct iovec v; \
@@ -70,7 +97,7 @@
} \
}

-#define iterate_and_advance(i, n, v, I, B) { \
+#define iterate_and_advance(i, n, v, I, B, K) { \
size_t skip = i->iov_offset; \
if (unlikely(i->type & ITER_BVEC)) { \
const struct bio_vec *bvec; \
@@ -82,6 +109,16 @@
} \
i->nr_segs -= bvec - i->bvec; \
i->bvec = bvec; \
+ } else if (unlikely(i->type & ITER_KVEC)) { \
+ const struct kvec *kvec; \
+ struct kvec v; \
+ iterate_kvec(i, n, v, kvec, skip, (K)) \
+ if (skip == kvec->iov_len) { \
+ kvec++; \
+ skip = 0; \
+ } \
+ i->nr_segs -= kvec - i->kvec; \
+ i->kvec = kvec; \
} else { \
const struct iovec *iov; \
struct iovec v; \
@@ -270,7 +307,7 @@ done:
*/
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
- if (!(i->type & ITER_BVEC)) {
+ if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
char __user *buf = i->iov->iov_base + i->iov_offset;
bytes = min(bytes, i->iov->iov_len - i->iov_offset);
return fault_in_pages_readable(buf, bytes);
@@ -284,10 +321,14 @@ void iov_iter_init(struct iov_iter *i, int direction,
size_t count)
{
/* It will get better. Eventually... */
- if (segment_eq(get_fs(), KERNEL_DS))
+ if (segment_eq(get_fs(), KERNEL_DS)) {
direction |= ITER_KVEC;
- i->type = direction;
- i->iov = iov;
+ i->type = direction;
+ i->kvec = (struct kvec *)iov;
+ } else {
+ i->type = direction;
+ i->iov = iov;
+ }
i->nr_segs = nr_segs;
i->iov_offset = 0;
i->count = count;
@@ -328,7 +369,8 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
v.iov_len),
memcpy_to_page(v.bv_page, v.bv_offset,
- (from += v.bv_len) - v.bv_len, v.bv_len)
+ (from += v.bv_len) - v.bv_len, v.bv_len),
+ memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
)

return bytes;
@@ -348,7 +390,8 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
v.iov_len),
memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
- v.bv_offset, v.bv_len)
+ v.bv_offset, v.bv_len),
+ memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
)

return bytes;
@@ -371,7 +414,7 @@ EXPORT_SYMBOL(copy_page_to_iter);
size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
- if (i->type & ITER_BVEC) {
+ if (i->type & (ITER_BVEC|ITER_KVEC)) {
void *kaddr = kmap_atomic(page);
size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
kunmap_atomic(kaddr);
@@ -391,7 +434,8 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)

iterate_and_advance(i, bytes, v,
__clear_user(v.iov_base, v.iov_len),
- memzero_page(v.bv_page, v.bv_offset, v.bv_len)
+ memzero_page(v.bv_page, v.bv_offset, v.bv_len),
+ memset(v.iov_base, 0, v.iov_len)
)

return bytes;
@@ -406,7 +450,8 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
v.iov_base, v.iov_len),
memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
- v.bv_offset, v.bv_len)
+ v.bv_offset, v.bv_len),
+ memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
)
kunmap_atomic(kaddr);
return bytes;
@@ -415,7 +460,7 @@ EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);

void iov_iter_advance(struct iov_iter *i, size_t size)
{
- iterate_and_advance(i, size, v, 0, 0)
+ iterate_and_advance(i, size, v, 0, 0, 0)
}
EXPORT_SYMBOL(iov_iter_advance);

@@ -443,7 +488,8 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)

iterate_all_kinds(i, size, v,
(res |= (unsigned long)v.iov_base | v.iov_len, 0),
- res |= v.bv_offset | v.bv_len
+ res |= v.bv_offset | v.bv_len,
+ res |= (unsigned long)v.iov_base | v.iov_len
)
return res;
}
@@ -478,6 +524,16 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
*start = v.bv_offset;
get_page(*pages = v.bv_page);
return v.bv_len;
+ }),({
+ unsigned long addr = (unsigned long)v.iov_base, end;
+ size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+
+ if (len > maxpages * PAGE_SIZE)
+ len = maxpages * PAGE_SIZE;
+ addr &= ~(PAGE_SIZE - 1);
+ for (end = addr + len; addr < end; addr += PAGE_SIZE)
+ get_page(*pages++ = virt_to_page(addr));
+ return len - *start;
})
)
return 0;
@@ -530,6 +586,19 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return -ENOMEM;
get_page(*p = v.bv_page);
return v.bv_len;
+ }),({
+ unsigned long addr = (unsigned long)v.iov_base, end;
+ size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+ int n;
+
+ addr &= ~(PAGE_SIZE - 1);
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+ *pages = p = get_pages_array(n);
+ if (!p)
+ return -ENOMEM;
+ for (end = addr + len; addr < end; addr += PAGE_SIZE)
+ get_page(*p++ = virt_to_page(addr));
+ return len - *start;
})
)
return 0;
@@ -554,6 +623,12 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
npages++;
if (npages >= maxpages)
return maxpages;
+ }),({
+ unsigned long p = (unsigned long)v.iov_base;
+ npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
+ - p / PAGE_SIZE;
+ if (npages >= maxpages)
+ return maxpages;
})
)
return npages;
--
2.1.3

2014-12-04 20:27:40

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 11/13] csum_and_copy_..._iter()

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
include/linux/uio.h | 2 ++
mm/iov_iter.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6e16945..28ed2d9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -124,6 +124,8 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
{
i->count = count;
}
+size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
+size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);

int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
int memcpy_toiovec(struct iovec *iov, unsigned char *kdata, int len);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index d74de6d..5a3b4a8 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -3,6 +3,7 @@
#include <linux/pagemap.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
+#include <net/checksum.h>

#define iterate_iovec(i, n, __v, __p, skip, STEP) { \
size_t left; \
@@ -605,6 +606,94 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
}
EXPORT_SYMBOL(iov_iter_get_pages_alloc);

+size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
+ struct iov_iter *i)
+{
+ char *to = addr;
+ __wsum sum, next;
+ size_t off = 0;
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ sum = *csum;
+ iterate_and_advance(i, bytes, v, ({
+ int err = 0;
+ next = csum_and_copy_from_user(v.iov_base,
+ (to += v.iov_len) - v.iov_len,
+ v.iov_len, 0, &err);
+ if (!err) {
+ sum = csum_block_add(sum, next, off);
+ off += v.iov_len;
+ }
+ err ? v.iov_len : 0;
+ }), ({
+ char *p = kmap_atomic(v.bv_page);
+ next = csum_partial_copy_nocheck(p + v.bv_offset,
+ (to += v.bv_len) - v.bv_len,
+ v.bv_len, 0);
+ kunmap_atomic(p);
+ sum = csum_block_add(sum, next, off);
+ off += v.bv_len;
+ }),({
+ next = csum_partial_copy_nocheck(v.iov_base,
+ (to += v.iov_len) - v.iov_len,
+ v.iov_len, 0);
+ sum = csum_block_add(sum, next, off);
+ off += v.iov_len;
+ })
+ )
+ *csum = sum;
+ return bytes;
+}
+EXPORT_SYMBOL(csum_and_copy_from_iter);
+
+size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum,
+ struct iov_iter *i)
+{
+ char *from = addr;
+ __wsum sum, next;
+ size_t off = 0;
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ sum = *csum;
+ iterate_and_advance(i, bytes, v, ({
+ int err = 0;
+ next = csum_and_copy_to_user((from += v.iov_len) - v.iov_len,
+ v.iov_base,
+ v.iov_len, 0, &err);
+ if (!err) {
+ sum = csum_block_add(sum, next, off);
+ off += v.iov_len;
+ }
+ err ? v.iov_len : 0;
+ }), ({
+ char *p = kmap_atomic(v.bv_page);
+ next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
+ p + v.bv_offset,
+ v.bv_len, 0);
+ kunmap_atomic(p);
+ sum = csum_block_add(sum, next, off);
+ off += v.bv_len;
+ }),({
+ next = csum_partial_copy_nocheck((from += v.iov_len) - v.iov_len,
+ v.iov_base,
+ v.iov_len, 0);
+ sum = csum_block_add(sum, next, off);
+ off += v.iov_len;
+ })
+ )
+ *csum = sum;
+ return bytes;
+}
+EXPORT_SYMBOL(csum_and_copy_to_iter);
+
int iov_iter_npages(const struct iov_iter *i, int maxpages)
{
size_t size = i->count;
--
2.1.3

2014-12-04 20:27:39

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCH 08/13] iov_iter.c: convert copy_from_iter() to iterate_and_advance

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
mm/iov_iter.c | 106 +++++++++-------------------------------------------------
1 file changed, 15 insertions(+), 91 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 17b7144..791429d 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -142,51 +142,6 @@ static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
-{
- size_t skip, copy, left, wanted;
- const struct iovec *iov;
- char __user *buf;
-
- if (unlikely(bytes > i->count))
- bytes = i->count;
-
- if (unlikely(!bytes))
- return 0;
-
- wanted = bytes;
- iov = i->iov;
- skip = i->iov_offset;
- buf = iov->iov_base + skip;
- copy = min(bytes, iov->iov_len - skip);
-
- left = __copy_from_user(to, buf, copy);
- copy -= left;
- skip += copy;
- to += copy;
- bytes -= copy;
- while (unlikely(!left && bytes)) {
- iov++;
- buf = iov->iov_base;
- copy = min(bytes, iov->iov_len);
- left = __copy_from_user(to, buf, copy);
- copy -= left;
- skip = copy;
- to += copy;
- bytes -= copy;
- }
-
- if (skip == iov->iov_len) {
- iov++;
- skip = 0;
- }
- i->count -= wanted - bytes;
- i->nr_segs -= iov - i->iov;
- i->iov = iov;
- i->iov_offset = skip;
- return wanted - bytes;
-}
-
static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
@@ -444,48 +399,6 @@ static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
return wanted - bytes;
}

-static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
-{
- size_t skip, copy, wanted;
- const struct bio_vec *bvec;
-
- if (unlikely(bytes > i->count))
- bytes = i->count;
-
- if (unlikely(!bytes))
- return 0;
-
- wanted = bytes;
- bvec = i->bvec;
- skip = i->iov_offset;
-
- copy = min(bytes, bvec->bv_len - skip);
-
- memcpy_from_page(to, bvec->bv_page, bvec->bv_offset + skip, copy);
-
- to += copy;
- skip += copy;
- bytes -= copy;
-
- while (bytes) {
- bvec++;
- copy = min(bytes, (size_t)bvec->bv_len);
- memcpy_from_page(to, bvec->bv_page, bvec->bv_offset, copy);
- skip = copy;
- to += copy;
- bytes -= copy;
- }
- if (skip == bvec->bv_len) {
- bvec++;
- skip = 0;
- }
- i->count -= wanted;
- i->nr_segs -= bvec - i->bvec;
- i->bvec = bvec;
- i->iov_offset = skip;
- return wanted;
-}
-
size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
{
if (i->type & ITER_BVEC)
@@ -497,10 +410,21 @@ EXPORT_SYMBOL(copy_to_iter);

size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
{
- if (i->type & ITER_BVEC)
- return copy_from_iter_bvec(addr, bytes, i);
- else
- return copy_from_iter_iovec(addr, bytes, i);
+ char *to = addr;
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ iterate_and_advance(i, bytes, v,
+ __copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
+ v.iov_len),
+ memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+ v.bv_offset, v.bv_len)
+ )
+
+ return bytes;
}
EXPORT_SYMBOL(copy_from_iter);

--
2.1.3

2014-12-05 12:28:19

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()

Hello.

On 12/4/2014 11:23 PM, Al Viro wrote:

Looks like you've erred in the subject as you seem to be getting rid of
copy_page_{to|from}_iter_bvec() instead of bvec_copy_page_{to|from}_iter()...

> From: Al Viro <[email protected]>

> Just have copy_page_{to,from}_iter() fall back to kmap_atomic +
> copy_{to,from}_iter() + kunmap_atomic() in ITER_BVEC case. As
> the matter of fact, that's what we want to do for any iov_iter
> kind that isn't blocking - e.g. ITER_KVEC will also go that way
> once we recognize it on iov_iter.c primitives level

> Signed-off-by: Al Viro <[email protected]>
> ---
> mm/iov_iter.c | 60 ++++++++++++++++++++++++-----------------------------------
> 1 file changed, 24 insertions(+), 36 deletions(-)

> diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> index 39ad713..17b7144 100644
> --- a/mm/iov_iter.c
> +++ b/mm/iov_iter.c
> @@ -486,30 +486,33 @@ static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
> return wanted;
> }
>
> -static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
> - size_t bytes, struct iov_iter *i)
> +size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
> {
> - void *kaddr = kmap_atomic(page);
> - size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
> - kunmap_atomic(kaddr);
> - return wanted;
> + if (i->type & ITER_BVEC)
> + return copy_to_iter_bvec(addr, bytes, i);
> + else
> + return copy_to_iter_iovec(addr, bytes, i);
> }
> +EXPORT_SYMBOL(copy_to_iter);
>
> -static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
> - size_t bytes, struct iov_iter *i)
> +size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
> {
> - void *kaddr = kmap_atomic(page);
> - size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
> - kunmap_atomic(kaddr);
> - return wanted;
> + if (i->type & ITER_BVEC)
> + return copy_from_iter_bvec(addr, bytes, i);
> + else
> + return copy_from_iter_iovec(addr, bytes, i);
> }
> +EXPORT_SYMBOL(copy_from_iter);
[...]

WBR, Sergei

2014-12-08 16:46:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Thu, Dec 04, 2014 at 08:20:11PM +0000, Al Viro wrote:
> First of all, I want to apologize for the nastiness of preprocessor
> use in this series. Seeing that the whole "macros that look like new kinds
> of C statements" thing (including list_for_each_...(), etc) is very much not
> to my liking, I really don't trust my taste on finer details and I'd very
> much like some feedback.
>
> The reason for doing that kind of tricks is that iov_iter.c keeps
> growing more and more boilerplate code. For iov_iter-net series we need
> * csum_and_copy_from_iter()
> * csum_and_copy_to_iter()
> * copy_from_iter_nocache()
> That's 3 new primitives, each in 2 variants (iovec and bvec).
> * ITER_KVEC handled without going through uaccess.h stuff (and
> independent of set_fs() state).
> And *that* means 3 variants intstead of 2 for most of the existing primitives.
> That's far too much, and the amount of copies of the same logics would pretty
> much guarantee that it will be a breeding ground for hard-to-kill bugs.
>
> The following series (also in vfs.git#iov_iter) actually manages to
> do all of the above *and* shrink the damn thing quite a bit. The generated
> code appears to be no worse than before. The price is a couple of iterator
> macros - iterate_all_kinds() and iterate_and_advance(). They are given an
> iov_iter, size (i.e. the amount of data in iov_iter beginning we want to go
> through), name of the loop variable and 3 variants of loop body - for iovec,
> bvec and kvec resp. Loop variable is declared *inside* the expansion of those
> suckers according to the kind of iov_iter - it's struct iovec, struct bio_vec
> or struct kvec, covering the current range to deal with.
> The difference between those two is that iterate_and_advance() will
> advance the iov_iter by the amount it has handled and iterate_all_kinds()
> will leave iov_iter unchanged.
>
> Unless I hear anybody yelling, it goes into vfs.git#for-next today,
> so if you have objections, suggestions, etc., give those *now*.
>
> Al Viro (13):
> iov_iter.c: macros for iterating over iov_iter
> iov_iter.c: iterate_and_advance
> iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
> iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
> iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
> iov_iter.c: convert iov_iter_zero() to iterate_and_advance
> iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
> iov_iter.c: convert copy_from_iter() to iterate_and_advance
> iov_iter.c: convert copy_to_iter() to iterate_and_advance
> iov_iter.c: handle ITER_KVEC directly
> csum_and_copy_..._iter()
> new helper: iov_iter_kvec()
> copy_from_iter_nocache()

I guess this crash is related to the patchset.

[ 102.337742] ------------[ cut here ]------------
[ 102.338270] kernel BUG at /home/kas/git/public/linux-next/arch/x86/mm/physaddr.c:26!
[ 102.339043] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 102.339622] Modules linked in:
[ 102.339951] CPU: 2 PID: 6029 Comm: trinity-c23 Not tainted 3.18.0-next-20141208-00036-gc7edb4791544-dirty #269
[ 102.340011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 102.340011] task: ffff880041c51510 ti: ffff880049c70000 task.ti: ffff880049c70000
[ 102.340011] RIP: 0010:[<ffffffff8104d8c0>] [<ffffffff8104d8c0>] __phys_addr+0x40/0x60
[ 102.340011] RSP: 0018:ffff880049c73838 EFLAGS: 00010206
[ 102.340011] RAX: 00004100174b4000 RBX: ffff880049c73b08 RCX: 0000000000000028
[ 102.340011] RDX: 0000000000000041 RSI: ffff88015dc980a8 RDI: ffffc900174b4000
[ 102.340011] RBP: ffff880049c73838 R08: ffffc900174b4000 R09: 000000000000000c
[ 102.340011] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88015dc980a8
[ 102.340011] R13: ffffc900174f4000 R14: ffffea0000000000 R15: ffffc900174b4000
[ 102.340011] FS: 00007fcdd37fb700(0000) GS:ffff88017aa00000(0000) knlGS:0000000000000000
[ 102.340011] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 102.340011] CR2: 00000000006246a8 CR3: 0000000068dbb000 CR4: 00000000001406e0
[ 102.340011] Stack:
[ 102.340011] ffff880049c73888 ffffffff8117a96b 0000000000000002 0000000000040000
[ 102.340011] ffffffff81204b3f ffff88015dc98028 0000000000000000 ffff880049c73a78
[ 102.340011] ffff88015dc98000 ffff880049c73a10 ffff880049c73978 ffffffff81203ec9
[ 102.340011] Call Trace:
[ 102.340011] [<ffffffff8117a96b>] iov_iter_get_pages+0x17b/0x390
[ 102.340011] [<ffffffff81204b3f>] ? __blockdev_direct_IO+0x15f/0x16a0
[ 102.340011] [<ffffffff81203ec9>] do_direct_IO+0x10a9/0x1bc0
[ 102.340011] [<ffffffff810a92d8>] ? lockdep_init_map+0x68/0x5c0
[ 102.340011] [<ffffffff81204d6c>] __blockdev_direct_IO+0x38c/0x16a0
[ 102.340011] [<ffffffff810a9d27>] ? __lock_acquire+0x4f7/0xd40
[ 102.340011] [<ffffffff810a73f9>] ? mark_held_locks+0x79/0xb0
[ 102.340011] [<ffffffff8133b510>] ? xfs_get_blocks+0x20/0x20
[ 102.340011] [<ffffffff81338ff0>] xfs_vm_direct_IO+0x120/0x140
[ 102.340011] [<ffffffff8133b510>] ? xfs_get_blocks+0x20/0x20
[ 102.340011] [<ffffffff810aa773>] ? lock_release_non_nested+0x203/0x3d0
[ 102.340011] [<ffffffff8135b9a7>] ? xfs_ilock+0x167/0x2e0
[ 102.340011] [<ffffffff8114d957>] generic_file_read_iter+0x517/0x6a0
[ 102.340011] [<ffffffff810a73f9>] ? mark_held_locks+0x79/0xb0
[ 102.340011] [<ffffffff8185e192>] ? __mutex_unlock_slowpath+0xb2/0x190
[ 102.340011] [<ffffffff8134bc2f>] xfs_file_read_iter+0x12f/0x460
[ 102.340011] [<ffffffff811c237e>] new_sync_read+0x7e/0xb0
[ 102.340011] [<ffffffff811c3528>] __vfs_read+0x18/0x50
[ 102.340011] [<ffffffff811c35ed>] vfs_read+0x8d/0x150
[ 102.340011] [<ffffffff811c89e8>] kernel_read+0x48/0x60
[ 102.340011] [<ffffffff810f4a92>] copy_module_from_fd.isra.51+0x112/0x170
[ 102.340011] [<ffffffff810f7646>] SyS_finit_module+0x56/0x80
[ 102.340011] [<ffffffff81861f92>] system_call_fastpath+0x12/0x17
[ 102.340011] Code: 00 00 00 00 00 78 00 00 48 01 f8 48 39 c2 72 1b 0f b6 0d 9d 7a ec 00 48 89 c2 48 d3 ea 48 85 d2 75 09 5d c3 0f 1f 80 00 00 00 00 <0f> 0b 66 0f 1f 44 00 00 48 89 d0 48 03 05 3e 87 dc 00 48 81 fa
[ 102.340011] RIP [<ffffffff8104d8c0>] __phys_addr+0x40/0x60
[ 102.340011] RSP <ffff880049c73838>
[ 102.371939] ---[ end trace e07368268cd6b49c ]---


--
Kirill A. Shutemov

2014-12-08 17:58:11

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 06:46:50PM +0200, Kirill A. Shutemov wrote:

> I guess this crash is related to the patchset.

Might be. Do you have a reproducer for it?

It looks like the second VIRTUAL_BUG_ON() in __phys_addr(), most likely
from __pa(), from virt_to_page(), from
unsigned long addr = (unsigned long)v.iov_base, end;
size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));

if (len > maxpages * PAGE_SIZE)
len = maxpages * PAGE_SIZE;
addr &= ~(PAGE_SIZE - 1);
for (end = addr + len; addr < end; addr += PAGE_SIZE)
get_page(*pages++ = virt_to_page(addr));
return len - *start;
in iov_iter_get_pages(). And that's ITER_KVEC case there... Further
call chain looks like dio_refill_pages(), from dio_get_page(), from
do_direct_io(), eventually from kernel_read() and finit_module(),
Presumably called on O_DIRECT descriptor...

I'll try to reproduce it here, but if you have any reliable reproducer, it
would be very welcome (and would make a useful addition to LTP and/or
xfstests).

2014-12-08 18:08:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 8:46 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> I guess this crash is related to the patchset.

Sounds likely.

> [ 102.338270] kernel BUG at /home/kas/git/public/linux-next/arch/x86/mm/physaddr.c:26!

So that's

VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));

and the code disassembles to:

0: 48 01 f8 add %rdi,%rax
3: 48 39 c2 cmp %rax,%rdx
6: 72 1b jb 0x23
8: 0f b6 0d 9d 7a ec 00 movzbl 0xec7a9d(%rip),%ecx # 0xec7aac
f: 48 89 c2 mov %rax,%rdx
12: 48 d3 ea shr %cl,%rdx
15: 48 85 d2 test %rdx,%rdx
18: 75 09 jne 0x23
1a: 5d pop %rbp
1b: c3 retq
1c: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
23:* 0f 0b ud2 <-- trapping instruction

with thre relevant registers being

> [ 102.340011] RAX: 00004100174b4000 RBX: ffff880049c73b08 RCX: 0000000000000028
> [ 102.340011] RDX: 0000000000000041 RSI: ffff88015dc980a8 RDI: ffffc900174b4000

so we've taken the second case (the %rcx value is
"boot_cpu_data.x86_phys_bits", which is that "movzbl", and the %rdx
value is the shifted value of %rax).

So %rax seems to contain 'x' at that point, which means that 'y' should be

x - (__START_KERNEL_map - PAGE_OFFSET)

which means that the _original_ address should be that plus
__START_KERNEL_map, ie just x + PAGE_OFFSET.

So it smells like the original virtual address was that
ffffc900174b4000 that we still find in %rdi.

Which is in the vmalloc address space. So somebody used a vmalloc'ed
address and tried to convert it to a physical address in order to look
up the page.

Which is not a valid operation, and the BUG_ON() is definitely proper.

Now *why* something tried to do a virt_to_page() on a vmalloc'ed
address, that I leave to others.

Linus

2014-12-08 18:08:33

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 05:58:05PM +0000, Al Viro wrote:
> It looks like the second VIRTUAL_BUG_ON() in __phys_addr(), most likely
> from __pa(), from virt_to_page(), from
> unsigned long addr = (unsigned long)v.iov_base, end;
> size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
>
> if (len > maxpages * PAGE_SIZE)
> len = maxpages * PAGE_SIZE;
> addr &= ~(PAGE_SIZE - 1);
> for (end = addr + len; addr < end; addr += PAGE_SIZE)
> get_page(*pages++ = virt_to_page(addr));
> return len - *start;
> in iov_iter_get_pages(). And that's ITER_KVEC case there... Further
> call chain looks like dio_refill_pages(), from dio_get_page(), from
> do_direct_io(), eventually from kernel_read() and finit_module(),
> Presumably called on O_DIRECT descriptor...

FWIW, virt_to_page() is probably not OK to call on an address in the
middle of vmalloc'ed area, is it? Would
for (end = addr + len; addr < end; addr += PAGE_SIZE) {
if (is_vmalloc_addr(addr))
ACCESS_ONCE(*(char *)addr);
get_page(*pages++ = virt_to_page(addr));
}
be a safe replacement for the loop in the above?

2014-12-08 18:14:10

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 10:07:55AM -0800, Linus Torvalds wrote:

> Which is in the vmalloc address space. So somebody used a vmalloc'ed
> address and tried to convert it to a physical address in order to look
> up the page.
>
> Which is not a valid operation, and the BUG_ON() is definitely proper.
>
> Now *why* something tried to do a virt_to_page() on a vmalloc'ed
> address, that I leave to others.

iov_iter_get_pages() in ITER_KVEC case, trying to avoid get_user_pages_fast()
and getting it wrong. FWIW, the reproducer is finit_module(fd, ....)
where fd has been opened with O_DIRECT. In that case we get kernel_read()
on O_DIRECT and the buffer has just been vmalloc'ed.

What's the sane way to grab struct page * for a vmalloc'ed address?

2014-12-08 18:14:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 10:08 AM, Al Viro <[email protected]> wrote:
>
> FWIW, virt_to_page() is probably not OK to call on an address in the
> middle of vmalloc'ed area, is it?

See my email that crossed yours. No it is not.

> Would
> for (end = addr + len; addr < end; addr += PAGE_SIZE) {
> if (is_vmalloc_addr(addr))
> ACCESS_ONCE(*(char *)addr);
> get_page(*pages++ = virt_to_page(addr));
> }
> be a safe replacement for the loop in the above?

No. That "ACCESS_ONCE()" does nothing. It reads a byte from 'addr' in
the vmalloc space, and might cause a page fault to make sure it's
mapped, but that is still a no-op.

You can't do "virt_to_page()" on anything but the normal 1:1 kernel
mappings (and only for non-highmem pages at that).

For a vmalloc() address, you'd have to actually walk the page tables.
Which is a f*cking horrible idea. Don't do it. We do have a
"vmalloc_to_page()" that does it, but the basic issue is that you damn
well shouldn't do IO on vmalloc'ed addresses. vmalloc'ed addresses
only exist in the first place to give a linear *virtual* mapping, if
you want physical pages you shouldn't have mixed it up with vmalloc in
the first place!

Where the hell does this crop up, and who does this insane thing
anyway? It's wrong. How did it ever work before?

Linus

2014-12-08 18:20:21

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 10:14:13AM -0800, Linus Torvalds wrote:

> For a vmalloc() address, you'd have to actually walk the page tables.
> Which is a f*cking horrible idea. Don't do it. We do have a
> "vmalloc_to_page()" that does it, but the basic issue is that you damn
> well shouldn't do IO on vmalloc'ed addresses. vmalloc'ed addresses
> only exist in the first place to give a linear *virtual* mapping, if
> you want physical pages you shouldn't have mixed it up with vmalloc in
> the first place!
>
> Where the hell does this crop up, and who does this insane thing
> anyway? It's wrong. How did it ever work before?

finit_module() with O_DIRECT descriptor. And I suspect that "not well"
is the answer - it used to call get_user_pages_fast() in that case.

I certainly had missed that insanity during the analysis - we don't do
a lot of O_DIRECT IO to/from kernel addresses of any sort... This
codepath allows it ;-/ Ability to trigger it is equivalent to ability
to run any code in kernel mode, so it's not an additional security hole,
but...

2014-12-08 18:23:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 10:14 AM, Al Viro <[email protected]> wrote:
>
> iov_iter_get_pages() in ITER_KVEC case, trying to avoid get_user_pages_fast()
> and getting it wrong. FWIW, the reproducer is finit_module(fd, ....)
> where fd has been opened with O_DIRECT. In that case we get kernel_read()
> on O_DIRECT and the buffer has just been vmalloc'ed.

Ugh. That's horrid. Do we need to even support O_DIRECT in that case?
In general, we should *not* do IO on vmalloc'ed areas, although at
least the non-O_DIRECT case where we just memcpy() it as if it came
from user space is much better.

Did this actually use to work? Or is it an issue of "the new iov_iter
is so generic that something that used to just return an error now
'works' and triggers the problem"?

> What's the sane way to grab struct page * for a vmalloc'ed address?

So "vmalloc_to_page()" should work.

However, it's actually fundamentally racy unless you can guarantee
that the vmalloc()'ed area in question is stable (so you had better
have done that allocation yourself, and be in control of the freeing,
rather than "we look up random vmalloc'ed addresses).

In general, it's really a horrible thing to use, and tends to be a big
red sign that "somebody misdesigned this badly"

Linus

2014-12-08 18:35:55

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 10:23:26AM -0800, Linus Torvalds wrote:

> Did this actually use to work? Or is it an issue of "the new iov_iter
> is so generic that something that used to just return an error now
> 'works' and triggers the problem"?

Looks like it failed with EINVAL. Which might very well be the sane
reaction - if we run into a vmalloc/module address, act as if we failed
to get that page and exit the loop.

> > What's the sane way to grab struct page * for a vmalloc'ed address?
>
> So "vmalloc_to_page()" should work.
>
> However, it's actually fundamentally racy unless you can guarantee
> that the vmalloc()'ed area in question is stable (so you had better
> have done that allocation yourself, and be in control of the freeing,
> rather than "we look up random vmalloc'ed addresses).

If vfree(buffer) races with kernel_read() into buffer, we are so badly
fucked that stability of pointers to pages is the least of our concerns...

> In general, it's really a horrible thing to use, and tends to be a big
> red sign that "somebody misdesigned this badly"

More like "nobody has thought of that case", at a guess, but then I hadn't
been involved in finit_module() design - I don't even remember the discussions
around it. That would be what, something circa 3.7?

2014-12-08 18:37:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 10:20 AM, Al Viro <[email protected]> wrote:
>
> I certainly had missed that insanity during the analysis - we don't do
> a lot of O_DIRECT IO to/from kernel addresses of any sort... This
> codepath allows it ;-/ Ability to trigger it is equivalent to ability
> to run any code in kernel mode, so it's not an additional security hole,
> but...

Is there any chance we could just return EINVAL for this case?

Who does O_DIRECT on module load anyway? If this is only for
finit_module(), that uses "kernel_read()", and maybe we could just
make sure that the kernel_read() function never ever uses the
direct-IO paths?

[ Time passes, I look at the code ]

Oh crap. So the reason it triggers seems to be that we basically get a
random file descriptor that we didn't open, and then we have

vfs_read() ->
xfs_file_operations->read() ->
ew_sync_read() ->
xfs_file_operations->read_iter()
xfs_file_read_iter()

and we are stuck with this iterator that really just wants to do copies.

How about we make "kernel_read()" just clear O_DIRECT? Does that fix
it to just use copies?

Linus

2014-12-08 18:46:39

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 10:37:51AM -0800, Linus Torvalds wrote:

> How about we make "kernel_read()" just clear O_DIRECT? Does that fix
> it to just use copies?

Umm... clearing O_DIRECT on struct file that might have other references
to it isn't nice, to put it mildly...

Frankly, stopping iov_iter_get_pages() on the first page in vmalloc/module
space looks like the sanest strategy anyway. We'd get the same behaviour
as we used to, and as for finit_module(2), well... put "fails if given
an O_DIRECT descriptor" and be done with that.

Alternatively, we can really go for
page = is_vmalloc_or_module_addr(addr) ? vmalloc_to_page(addr)
: virt_to_page(addr)
*pages++ = get_page(page);
and have the fucker work. Stability is up to the caller, of course -
reading into buffer that might get freed (and reused) under you has much
worse problems...

2014-12-08 18:56:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 05:58:05PM +0000, Al Viro wrote:
> On Mon, Dec 08, 2014 at 06:46:50PM +0200, Kirill A. Shutemov wrote:
>
> > I guess this crash is related to the patchset.
>
> Might be. Do you have a reproducer for it?

trinity triggers it for me in few minutes. I will try find out more once
get some time.

--
Kirill A. Shutemov

2014-12-08 18:57:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 10:46 AM, Al Viro <[email protected]> wrote:
> On Mon, Dec 08, 2014 at 10:37:51AM -0800, Linus Torvalds wrote:
>
>> How about we make "kernel_read()" just clear O_DIRECT? Does that fix
>> it to just use copies?
>
> Umm... clearing O_DIRECT on struct file that might have other references
> to it isn't nice, to put it mildly...

Yeah.

> Frankly, stopping iov_iter_get_pages() on the first page in vmalloc/module
> space looks like the sanest strategy anyway. We'd get the same behaviour
> as we used to, and as for finit_module(2), well... put "fails if given
> an O_DIRECT descriptor" and be done with that.

If it used to fail, then by all means, just make this a failure in the
new model. I really don't want to make core infrastructure silently
just call vmalloc_to_page() to make things "work".

And if it used to do "get_user_pages_fast()" then the old code really
didn't work on vmalloc ranges anyway, since that one checks for not
just _PAGE_PRESENT but also _PAGE_USER, which won't be set on a
vmalloc() page. In fact, it should have failed on *all* kernel pages.

> Alternatively, we can really go for
> page = is_vmalloc_or_module_addr(addr) ? vmalloc_to_page(addr)
> : virt_to_page(addr)
> *pages++ = get_page(page);

actually, no we cannot. Thinking some more about it, that
"get_page(page)" is wrong in _all_ cases. It actually works better for
vmalloc pages than for normal 1:1 pages, since it's actually seriously
and *horrendously* wrong for the case of random kernel addresses which
may not even be refcounted to begin with.

So the whole "get_page()" thing is broken. Iterating over pages in a
KVEC is simply wrong, wrong, wrong. It needs to fail.

Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
or page reference things.

The old code that apparently used "get_user_pages_fast()" was ok
almost by mistake, because it fails on all kernel pages.

Linus

2014-12-08 19:01:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> trinity triggers it for me in few minutes. I will try find out more once
> get some time.

You run trinity as *root*?

You're a brave man. Stupid, but brave ;)

I guess you're running it in a VM, but still.. Doing random system
calls as root sounds like a bad bad idea.

Linus

2014-12-08 19:16:09

by Dave Jones

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 11:01:41AM -0800, Linus Torvalds wrote:
> On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > trinity triggers it for me in few minutes. I will try find out more once
> > get some time.
>
> You run trinity as *root*?
>
> You're a brave man. Stupid, but brave ;)
>
> I guess you're running it in a VM, but still.. Doing random system
> calls as root sounds like a bad bad idea.

I've flip-flopped on this a few times. I used to be solidly in the same
position as your statement, but after seeing the things the secure-boot
crowd want to lock down, there are a ton of places in the kernel that
would need additional root-proofing to avoid scribbling over kernel
memory.

In short though, yeah, expect fireworks right now, especially on bare-metal.

At the same time, just to increase coverage testing of a lot of
root-required functionality (like various network sockets that can't be
opened as a regular user) I added a --drop-privs mode to trinity a while
ago, so after the socket creation, it can't do anything _too_ crazy.

Dave

2014-12-08 19:24:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 11:01:41AM -0800, Linus Torvalds wrote:
> On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > trinity triggers it for me in few minutes. I will try find out more once
> > get some time.
>
> You run trinity as *root*?
>
> You're a brave man. Stupid, but brave ;)
>
> I guess you're running it in a VM, but still.. Doing random system
> calls as root sounds like a bad bad idea.

I'm doing this for a long time and didn't see any problem bigger than qemu
crash[1]. ;)

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/194845/

--
Kirill A. Shutemov

2014-12-08 19:28:39

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> So the whole "get_page()" thing is broken. Iterating over pages in a
> KVEC is simply wrong, wrong, wrong. It needs to fail.

Well, _that_ is easy to do, of course... E.g. by replacing that thing in
iov_iter_get_pages()/iov_iter_get_pages_alloc() with ({ return -EFAULT; })
will do it.

> Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
> or page reference things.
>
> The old code that apparently used "get_user_pages_fast()" was ok
> almost by mistake, because it fails on all kernel pages.

On x86 it does, but I don't see anything obvious in generic version in
mm/gup.c, so the old code might still have a problem on some architectures.
What am I missing here?

2014-12-08 19:48:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 11:28 AM, Al Viro <[email protected]> wrote:
>
> On x86 it does, but I don't see anything obvious in generic version in
> mm/gup.c, so the old code might still have a problem on some architectures.
> What am I missing here?

Hmm. You may be right. The "access_ok()" is supposed to protect
things, but for cases like finit_module() that has explicitly said
"kernel addresses are ok", that check doesn't work.

Maybe something like this..

diff --git a/mm/gup.c b/mm/gup.c
index cd62c8c90d4a..6234b1e6ced9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -951,6 +951,9 @@ int __get_user_pages_fast(unsigned long start,
int nr_pages, int write,
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;

+ if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
+ return 0;
+
if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
start, len)))
return 0;

Completely untested, obviously. That code isn't even compiled on x86.
Adding linux-arch for more comments.

(Background: the generic non-x86 "get_user_pages_fast()" function
cannot check that the page tables are actually *user* page tables,
so..)

Linus

2014-12-08 22:14:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 09:23:58PM +0200, Kirill A. Shutemov wrote:
> > I guess you're running it in a VM, but still.. Doing random system
> > calls as root sounds like a bad bad idea.
>
> I'm doing this for a long time and didn't see any problem bigger than qemu
> crash[1]. ;)
>
> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/194845/

If you use something like this:

-drive file=root_fs.img,if=virtio,snapshot=on

running trinity as root should be quite safe in a VM. :-)

- Ted

2014-12-08 22:23:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 8, 2014 at 2:14 PM, Theodore Ts'o <[email protected]> wrote:
>
> running trinity as root should be quite safe in a VM. :-)

It's not so much the safety that I'd worry about, it's the "you can
legitimately just reboot it or cause kernel corruption as root". You
may not cause any problems outside of the VM, but any oopses inside
the VM might be due to trinity just doing bad things as root, rather
than kernel bugs..

Of course, it's probably hard to hit things like laoding random
modules etc, since even without signature requirements there are tons
of ELF sanity checks and other things. So it might be hard to actually
do those kinds of "corrupt kernel memory as root" things with trinity.

Linus

2014-12-08 22:31:31

by Dave Jones

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 02:23:06PM -0800, Linus Torvalds wrote:
> On Mon, Dec 8, 2014 at 2:14 PM, Theodore Ts'o <[email protected]> wrote:
> >
> > running trinity as root should be quite safe in a VM. :-)
>
> It's not so much the safety that I'd worry about, it's the "you can
> legitimately just reboot it or cause kernel corruption as root". You
> may not cause any problems outside of the VM, but any oopses inside
> the VM might be due to trinity just doing bad things as root, rather
> than kernel bugs..
>
> Of course, it's probably hard to hit things like laoding random
> modules etc, since even without signature requirements there are tons
> of ELF sanity checks and other things. So it might be hard to actually
> do those kinds of "corrupt kernel memory as root" things with trinity.

It also goes out of its way to avoid doing obviously stupid things,
like using /dev/mem as an fd, plus a whole bunch of similar sysfs/procfs
knobs. There are still likely a whole bunch of similar things that
might have horrible effects too. It's been on my todo for a while to
revisit that particular case and blacklist a bunch of other things.

And then there's obviously "don't do this syscall, ever" or "with these
args" type things, which could use expanding.. It's a big effort tbh.
I'm amazed that Sasha, Kirill and everyone else running it as root in
vm's aren't buried alive in false-positives.

Dave

2014-12-09 01:56:56

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Mon, Dec 08, 2014 at 07:28:17PM +0000, Al Viro wrote:
> On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> > So the whole "get_page()" thing is broken. Iterating over pages in a
> > KVEC is simply wrong, wrong, wrong. It needs to fail.
>
> Well, _that_ is easy to do, of course... E.g. by replacing that thing in
> iov_iter_get_pages()/iov_iter_get_pages_alloc() with ({ return -EFAULT; })
> will do it.

OK, folded and pushed (i.e. in vfs.git#iov_iter and vfs.git#for-next).
Matches earlier behaviour; oops reproducer is easy -

dd if=/dev/zero of=foo.ko bs=4096 count=1
cat >a.c <<'EOF'
#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
main()
{
int fd = open("foo.ko", 00040000);
syscall(__NR_finit_module, fd, "", 0);
}
EOF
gcc a.c
strace ./a.out

Correct behaviour (both the mainline and this series with fixes folded):
open("foo.ko", O_RDONLY|O_DIRECT) = 3
finit_module(3, "", 0) = -1 EFAULT (Bad address)
Incorrect one:
open("foo.ko", O_RDONLY|O_DIRECT) = 3
finit_module(3, "", 0 <unfinished ...>
+++ killed by SIGKILL +++
Killed

with that oops reproduced. Not sure if it's a proper LTP or xfstests fodder,
but anyway, here it is.

Incremental from previous for-next is
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index e0605c1..a1599ca 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -560,15 +560,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
get_page(*pages = v.bv_page);
return v.bv_len;
}),({
- unsigned long addr = (unsigned long)v.iov_base, end;
- size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
-
- if (len > maxpages * PAGE_SIZE)
- len = maxpages * PAGE_SIZE;
- addr &= ~(PAGE_SIZE - 1);
- for (end = addr + len; addr < end; addr += PAGE_SIZE)
- get_page(*pages++ = virt_to_page(addr));
- return len - *start;
+ return -EFAULT;
})
)
return 0;
@@ -622,18 +614,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
get_page(*p = v.bv_page);
return v.bv_len;
}),({
- unsigned long addr = (unsigned long)v.iov_base, end;
- size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
- int n;
-
- addr &= ~(PAGE_SIZE - 1);
- n = DIV_ROUND_UP(len, PAGE_SIZE);
- *pages = p = get_pages_array(n);
- if (!p)
- return -ENOMEM;
- for (end = addr + len; addr < end; addr += PAGE_SIZE)
- get_page(*p++ = virt_to_page(addr));
- return len - *start;
+ return -EFAULT;
})
)
return 0;

2014-12-09 02:21:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC][PATCHES] iov_iter.c rewrite

On Tue, Dec 09, 2014 at 01:56:51AM +0000, Al Viro wrote:
> On Mon, Dec 08, 2014 at 07:28:17PM +0000, Al Viro wrote:
> > On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> > > So the whole "get_page()" thing is broken. Iterating over pages in a
> > > KVEC is simply wrong, wrong, wrong. It needs to fail.
> >
> > Well, _that_ is easy to do, of course... E.g. by replacing that thing in
> > iov_iter_get_pages()/iov_iter_get_pages_alloc() with ({ return -EFAULT; })
> > will do it.
>
> OK, folded and pushed (i.e. in vfs.git#iov_iter and vfs.git#for-next).
> Matches earlier behaviour; oops reproducer is easy -
>
> dd if=/dev/zero of=foo.ko bs=4096 count=1
> cat >a.c <<'EOF'
> #define _GNU_SOURCE
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> main()
> {
> int fd = open("foo.ko", 00040000);
> syscall(__NR_finit_module, fd, "", 0);
> }
> EOF
> gcc a.c
> strace ./a.out
>
> Correct behaviour (both the mainline and this series with fixes folded):
> open("foo.ko", O_RDONLY|O_DIRECT) = 3
> finit_module(3, "", 0) = -1 EFAULT (Bad address)
> Incorrect one:
> open("foo.ko", O_RDONLY|O_DIRECT) = 3
> finit_module(3, "", 0 <unfinished ...>
> +++ killed by SIGKILL +++
> Killed
>
> with that oops reproduced. Not sure if it's a proper LTP or xfstests fodder,
> but anyway, here it is.
>
> Incremental from previous for-next is

Works for me.

--
Kirill A. Shutemov