2017-09-22 05:18:43

by Vitaly Mayatskih

[permalink] [raw]
Subject: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Signed-off-by: Vitaly Mayatskikh <[email protected]>

diff --git a/block/bio.c b/block/bio.c
index b38e962fa83e..10cd3b6bed27 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
offset = offset_in_page(uaddr);
for (j = cur_page; j < page_limit; j++) {
unsigned int bytes = PAGE_SIZE - offset;
+ unsigned short prev_bi_vcnt = bio->bi_vcnt;

if (len <= 0)
break;
@@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q,
bytes)
break;

+ /*
+ * check if vector was merged with previous
+ * drop page reference if needed
+ */
+ if (bio->bi_vcnt == prev_bi_vcnt)
+ put_page(pages[j]);
+
len -= bytes;
offset = 0;
}

--
wbr, Vitaly


2017-09-22 05:24:14

by Vitaly Mayatskih

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

Reproducer (needs SCSI disk):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <errno.h>
#include <malloc.h>
#include <string.h>
#include <scsi/sg.h>

#define NR_IOS 10000
#define NR_IOVECS 8
#define SG_IO 0x2285

int main(int argc, char *argv[])
{
int fd, i, j;
unsigned char *buf, *ptr, cdb[10];
sg_io_hdr_t io_hdr;
sg_iovec_t iovec[NR_IOVECS];

if (argc < 2) {
printf("Run: %s </dev/sdX>\n", argv[0]);
exit(1);
}

buf = ptr = memalign(4096, NR_IOS * NR_IOVECS * 512);
if (!buf) {
printf("can't alloc memory\n");
exit(1);
}

fd = open(argv[1], 0);
if (fd < 0) {
printf("open %s failed: %d (%s)\n", argv[1], errno, strerror(errno));
exit(1);
}

io_hdr.interface_id = 'S';
io_hdr.cmd_len = sizeof(cdb);
io_hdr.cmdp = cdb;
io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
io_hdr.dxfer_len = 512 * NR_IOVECS;
io_hdr.dxferp = iovec;
io_hdr.iovec_count = NR_IOVECS;

cdb[0] = 0x28; // READ10
cdb[8] = NR_IOVECS; // sectors

for (j = 0; j < NR_IOS; j++, ptr += 512) {
for (i = 0; i < NR_IOVECS; i++) {
iovec[i].iov_base = ptr;
iovec[i].iov_len = 512;
}
if (ioctl(fd, SG_IO, &io_hdr)) {
printf("IOCTL failed: %d (%s)\n", errno, strerror(errno));
exit(1);
}
}

free(buf);
close(fd);
return 0;
}


# free -m
total used free shared buff/cache available
Mem: 3827 46 3601 0 178 3568
Swap: 0 0 0
# ./sgio-leak /dev/sdd
# free -m
total used free shared buff/cache available
Mem: 3827 85 3562 0 178 3529
Swap: 0 0 0
[root@node-A ~]# free -m
total used free shared buff/cache available
Mem: 3827 85 3628 0 113 3561
Swap: 0 0 0
# ./sgio-leak /dev/sdd
# free -m
total used free shared buff/cache available
Mem: 3827 124 3589 0 113 3523
Swap: 0 0 0

--
wbr, Vitaly

2017-09-23 16:39:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

On Fri, Sep 22, 2017 at 01:18:39AM -0400, Vitaly Mayatskikh wrote:
> bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
> IO vector has small consecutive buffers belonging to the same page.
> bio_add_pc_page merges them into one, but the page reference is never
> dropped.
>
> Signed-off-by: Vitaly Mayatskikh <[email protected]>
>
> diff --git a/block/bio.c b/block/bio.c
> index b38e962fa83e..10cd3b6bed27 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
> offset = offset_in_page(uaddr);
> for (j = cur_page; j < page_limit; j++) {
> unsigned int bytes = PAGE_SIZE - offset;
> + unsigned short prev_bi_vcnt = bio->bi_vcnt;
>
> if (len <= 0)
> break;
> @@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q,
> bytes)
> break;
>
> + /*
> + * check if vector was merged with previous
> + * drop page reference if needed
> + */
> + if (bio->bi_vcnt == prev_bi_vcnt)
> + put_page(pages[j]);
> +

Except that now you've got double-puts on failure exits ;-/

2017-09-23 16:55:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

On Sat, Sep 23, 2017 at 05:39:28PM +0100, Al Viro wrote:
> On Fri, Sep 22, 2017 at 01:18:39AM -0400, Vitaly Mayatskikh wrote:
> > bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
> > IO vector has small consecutive buffers belonging to the same page.
> > bio_add_pc_page merges them into one, but the page reference is never
> > dropped.
> >
> > Signed-off-by: Vitaly Mayatskikh <[email protected]>
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index b38e962fa83e..10cd3b6bed27 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
> > offset = offset_in_page(uaddr);
> > for (j = cur_page; j < page_limit; j++) {
> > unsigned int bytes = PAGE_SIZE - offset;
> > + unsigned short prev_bi_vcnt = bio->bi_vcnt;
> >
> > if (len <= 0)
> > break;
> > @@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q,
> > bytes)
> > break;
> >
> > + /*
> > + * check if vector was merged with previous
> > + * drop page reference if needed
> > + */
> > + if (bio->bi_vcnt == prev_bi_vcnt)
> > + put_page(pages[j]);
> > +
>
> Except that now you've got double-puts on failure exits ;-/

IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
does. We *also* need to put everything left unused in pages[], but only from the
last iteration through iov_for_each().

Frankly, I would prefer to reuse the pages[], rather than append to it on each
iteration. Used iov_iter_get_pages_alloc(), actually.

2017-09-23 17:19:29

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

On Sat, Sep 23, 2017 at 05:55:37PM +0100, Al Viro wrote:

> IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
> does. We *also* need to put everything left unused in pages[], but only from the
> last iteration through iov_for_each().
>
> Frankly, I would prefer to reuse the pages[], rather than append to it on each
> iteration. Used iov_iter_get_pages_alloc(), actually.

Something like completely untested diff below, perhaps...

diff --git a/block/bio.c b/block/bio.c
index b38e962fa83e..b5fe23597b41 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1323,94 +1323,60 @@ struct bio *bio_map_user_iov(struct request_queue *q,
const struct iov_iter *iter,
gfp_t gfp_mask)
{
- int j;
- int nr_pages = 0;
- struct page **pages;
struct bio *bio;
- int cur_page = 0;
- int ret, offset;
+ struct bio_vec *bvec;
struct iov_iter i;
- struct iovec iov;
-
- iov_for_each(iov, i, *iter) {
- unsigned long uaddr = (unsigned long) iov.iov_base;
- unsigned long len = iov.iov_len;
- unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
- unsigned long start = uaddr >> PAGE_SHIFT;
-
- /*
- * Overflow, abort
- */
- if (end < start)
- return ERR_PTR(-EINVAL);
-
- nr_pages += end - start;
- /*
- * buffer must be aligned to at least logical block size for now
- */
- if (uaddr & queue_dma_alignment(q))
- return ERR_PTR(-EINVAL);
- }
+ int ret, j;

- if (!nr_pages)
+ if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);

- bio = bio_kmalloc(gfp_mask, nr_pages);
+ bio = bio_kmalloc(gfp_mask, iov_iter_npages(iter, BIO_MAX_PAGES));
if (!bio)
return ERR_PTR(-ENOMEM);

- ret = -ENOMEM;
- pages = kcalloc(nr_pages, sizeof(struct page *), gfp_mask);
- if (!pages)
- goto out;
+ i = *iter;
+ while (iov_iter_count(&i)) {
+ struct page **pages;
+ size_t offs;
+ ssize_t bytes;
+ int npages, j;

- iov_for_each(iov, i, *iter) {
- unsigned long uaddr = (unsigned long) iov.iov_base;
- unsigned long len = iov.iov_len;
- unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
- unsigned long start = uaddr >> PAGE_SHIFT;
- const int local_nr_pages = end - start;
- const int page_limit = cur_page + local_nr_pages;
-
- ret = get_user_pages_fast(uaddr, local_nr_pages,
- (iter->type & WRITE) != WRITE,
- &pages[cur_page]);
- if (ret < local_nr_pages) {
- ret = -EFAULT;
+ bytes = iov_iter_get_pages_alloc(&i, &pages, LONG_MAX, &offs);
+ if (bytes <= 0) {
+ ret = bytes ? bytes : -EFAULT;
goto out_unmap;
}

- offset = offset_in_page(uaddr);
- for (j = cur_page; j < page_limit; j++) {
- unsigned int bytes = PAGE_SIZE - offset;
+ npages = DIV_ROUND_UP(offs + bytes, PAGE_SIZE);

- if (len <= 0)
- break;
+ if (unlikely(offs & queue_dma_alignment(q))) {
+ ret = -EINVAL;
+ j = 0;
+ } else {
+ for (j = 0; j < npages; j++) {
+ struct page *page = pages[j];
+ unsigned n = PAGE_SIZE - offs;
+ unsigned prev_bi_vcnt = bio->bi_vcnt;
+
+ if (!bio_add_pc_page(q, bio, page, n, offs)) {
+ iov_iter_truncate(&i, 0);
+ break;
+ }
+
+ if (bio->bi_vcnt == prev_bi_vcnt)
+ put_page(page);

- if (bytes > len)
- bytes = len;
-
- /*
- * sorry...
- */
- if (bio_add_pc_page(q, bio, pages[j], bytes, offset) <
- bytes)
- break;
-
- len -= bytes;
- offset = 0;
+ iov_iter_advance(&i, n);
+ bytes -= n;
+ offs = 0;
+ }
}
-
- cur_page = j;
- /*
- * release the pages we didn't map into the bio, if any
- */
- while (j < page_limit)
+ while (j < npages)
put_page(pages[j++]);
+ kvfree(pages);
}

- kfree(pages);
-
bio_set_flag(bio, BIO_USER_MAPPED);

/*
@@ -1423,13 +1389,9 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;

out_unmap:
- for (j = 0; j < nr_pages; j++) {
- if (!pages[j])
- break;
- put_page(pages[j]);
+ bio_for_each_segment_all(bvec, bio, j) {
+ put_page(bvec->bv_page);
}
- out:
- kfree(pages);
bio_put(bio);
return ERR_PTR(ret);
}

2017-09-23 20:33:26

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

On Sat, Sep 23, 2017 at 06:19:26PM +0100, Al Viro wrote:
> On Sat, Sep 23, 2017 at 05:55:37PM +0100, Al Viro wrote:
>
> > IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
> > does. We *also* need to put everything left unused in pages[], but only from the
> > last iteration through iov_for_each().
> >
> > Frankly, I would prefer to reuse the pages[], rather than append to it on each
> > iteration. Used iov_iter_get_pages_alloc(), actually.
>
> Something like completely untested diff below, perhaps...

> + unsigned n = PAGE_SIZE - offs;
> + unsigned prev_bi_vcnt = bio->bi_vcnt;

Sorry, that should've been followed by
if (n > bytes)
n = bytes;

Anyway, a carved-up variant is in vfs.git#work.iov_iter. It still needs
review and testing; the patch Vitaly has posted in this thread plus 6
followups, hopefully more readable than aggregate diff.

Comments?

2017-09-24 14:27:42

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

On Sat, Sep 23, 2017 at 09:33:23PM +0100, Al Viro wrote:
> On Sat, Sep 23, 2017 at 06:19:26PM +0100, Al Viro wrote:
> > On Sat, Sep 23, 2017 at 05:55:37PM +0100, Al Viro wrote:
> >
> > > IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
> > > does. We *also* need to put everything left unused in pages[], but only from the
> > > last iteration through iov_for_each().
> > >
> > > Frankly, I would prefer to reuse the pages[], rather than append to it on each
> > > iteration. Used iov_iter_get_pages_alloc(), actually.
> >
> > Something like completely untested diff below, perhaps...
>
> > + unsigned n = PAGE_SIZE - offs;
> > + unsigned prev_bi_vcnt = bio->bi_vcnt;
>
> Sorry, that should've been followed by
> if (n > bytes)
> n = bytes;
>
> Anyway, a carved-up variant is in vfs.git#work.iov_iter. It still needs
> review and testing; the patch Vitaly has posted in this thread plus 6
> followups, hopefully more readable than aggregate diff.
>
> Comments?

BTW, there's something fishy in bio_copy_user_iov(). If the area we'd asked for
had been too large for a single bio, we are going to create a bio and have
bio_add_pc_page() eventually fill it up to limit. Then we return into
__blk_rq_map_user_iov(), advance iter and call bio_copy_user_iov() again.
Fine, but... now we might have non-zero iter->iov_offset. And this
bmd->is_our_pages = map_data ? 0 : 1;
memcpy(bmd->iov, iter->iov, sizeof(struct iovec) * iter->nr_segs);
iov_iter_init(&bmd->iter, iter->type, bmd->iov,
iter->nr_segs, iter->count);
does not even look at iter->iov_offset. As the result, when it gets to
bio_uncopy_user(), we copy the data from each bio into the *beginning* of
the user area, overwriting that from the other bio.

At the very least, we need bmd->iter = *iter; bmd->iter.iov = bmd->iov;
instead of that iov_iter_init() in there. I'm not sure how far back does
it go; looks like "block: support large requests in blk_rq_map_user_iov"
is the earliest possible point, but it might need more digging to make
sure. v4.5+, if that's when the problems began...

Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
force-pushed the result.

2017-09-24 17:15:49

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

On Sun, Sep 24, 2017 at 03:27:39PM +0100, Al Viro wrote:

> At the very least, we need bmd->iter = *iter; bmd->iter.iov = bmd->iov;
> instead of that iov_iter_init() in there. I'm not sure how far back does
> it go; looks like "block: support large requests in blk_rq_map_user_iov"
> is the earliest possible point, but it might need more digging to make
> sure. v4.5+, if that's when the problems began...
>
> Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
> force-pushed the result.

While we are at it, calculation of nr_pages in bio_copy_user_iov() is bloody
odd - why, in the name of everything unholy, does it care about the iovec
boundaries in there? We are copying data anyway; why does allocation of bio
care about the fragmentation of the other end of copying? Shouldn't it be
simply max(DIV_ROUND_UP(offset + len, PAGE_SIZE), BIO_MAX_PAGES)?

2017-09-25 01:48:34

by Vitaly Mayatskih

[permalink] [raw]
Subject: Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

On Sun, 24 Sep 2017 10:27:39 -0400,
Al Viro wrote:

> BTW, there's something fishy in bio_copy_user_iov(). If the area we'd asked for
> had been too large for a single bio, we are going to create a bio and have
> bio_add_pc_page() eventually fill it up to limit. Then we return into
> __blk_rq_map_user_iov(), advance iter and call bio_copy_user_iov() again.
> Fine, but... now we might have non-zero iter->iov_offset. And this
> bmd->is_our_pages = map_data ? 0 : 1;
> memcpy(bmd->iov, iter->iov, sizeof(struct iovec) * iter->nr_segs);
> iov_iter_init(&bmd->iter, iter->type, bmd->iov,
> iter->nr_segs, iter->count);
> does not even look at iter->iov_offset. As the result, when it gets to
> bio_uncopy_user(), we copy the data from each bio into the *beginning* of
> the user area, overwriting that from the other bio.

Yeah, something is wrong with bio_copy_user_iov. Our datapath hangs when IO flows through unmodified SG (it forces bio_copy if iov_count is set). I did not look at details, but same IO pattern and memory layout work well with bio_map (module refcount problem).

> Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
> force-pushed the result.

I'll give it a try, thanks!
--
wbr, Vitaly