2017-06-28 16:39:12

by wenxiong

[permalink] [raw]
Subject: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

From: Wen Xiong <[email protected]>

With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim: 6752128 kB
SUnreclaim: 6874880 kB
SUnreclaim: 7238080 kB
....
SUnreclaim: 22307264 kB
SUnreclaim: 22485888 kB
SUnreclaim: 22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure.

failing stack:

[36587.216329] [c000002ff60874a0] [c000000000bcac68]
dump_stack+0xb0/0xf0 (unreliable)
[36587.216349] [c000002ff60874e0] [c000000000bc8c94] panic+0x140/0x308
[36587.216407] [c000002ff6087570] [c000000000282530]
out_of_memory+0x4e0/0x650
[36587.216465] [c000002ff6087610] [c00000000028a154]
__alloc_pages_nodemask+0xf34/0x10b0
[36587.216534] [c000002ff6087810] [c00000000030b800]
alloc_pages_current+0xc0/0x1d0
[36587.216603] [c000002ff6087870] [c00000000031907c]
new_slab+0x46c/0x7d0
[36587.216661] [c000002ff6087950] [c00000000031bf00]
___slab_alloc+0x570/0x670
[36587.216718] [c000002ff6087a70] [c00000000031c05c]
__slab_alloc+0x5c/0x90
[36587.216776] [c000002ff6087ad0] [c00000000031c1f4]
kmem_cache_alloc+0x164/0x300
[36587.216845] [c000002ff6087b20] [c0000000002de120]
mmap_region+0x3e0/0x6e0
[36587.216903] [c000002ff6087c00] [c0000000002de7cc] do_mmap+0x3ac/0x480
[36587.216960] [c000002ff6087c80] [c0000000002af244]
vm_mmap_pgoff+0x114/0x160
[36587.217018] [c000002ff6087d60] [c0000000002db6b0]
SyS_mmap_pgoff+0x230/0x300
[36587.217087] [c000002ff6087de0] [c000000000014eac] sys_mmap+0x8c/0xd0
[36587.217145] [c000002ff6087e30] [c00000000000b184]
system_call+0x38/0xe0
[36587.217481] ---[ end Kernel panic - not syncing: Out of memory and no
killable processes...
[36587.217481]

Signed-off-by: Wen Xiong <[email protected]>
Reviewed by: Brian King <[email protected]>
Tested by: Murali Iyer <[email protected]>
---
fs/block_dev.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599d..e871444 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)

if (unlikely(bio.bi_error))
return bio.bi_error;
+
+ if (bio_integrity(&bio))
+ bio_integrity_free(&bio);
+
return ret;
}

--
1.7.1


2017-06-28 17:04:23

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On Wed, Jun 28, 2017 at 11:32:51AM -0500, [email protected] wrote:
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 519599d..e871444 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>
> if (unlikely(bio.bi_error))
> return bio.bi_error;
> +
> + if (bio_integrity(&bio))
> + bio_integrity_free(&bio);
> +
> return ret;
> }

We don't want to leak the integrity payload in case of bi_error either.

2017-06-28 17:43:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On 06/28/2017 10:32 AM, [email protected] wrote:
> From: Wen Xiong <[email protected]>
>
> With nvme devive + T10 enabled, On a system it has 256GB and started
> logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
> it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
> leaking.
>
> /proc/meminfo | grep SUnreclaim...
>
> SUnreclaim: 6752128 kB
> SUnreclaim: 6874880 kB
> SUnreclaim: 7238080 kB
> ....
> SUnreclaim: 22307264 kB
> SUnreclaim: 22485888 kB
> SUnreclaim: 22720256 kB
>
> When testcases with T10 enabled call into __blkdev_direct_IO_simple,
> code doesn't free memory allocated by bio_integrity_alloc. The patch
> fixes the issue. HTX has been run with +60 hours without failure.
>
> failing stack:
>
> [36587.216329] [c000002ff60874a0] [c000000000bcac68]
> dump_stack+0xb0/0xf0 (unreliable)
> [36587.216349] [c000002ff60874e0] [c000000000bc8c94] panic+0x140/0x308
> [36587.216407] [c000002ff6087570] [c000000000282530]
> out_of_memory+0x4e0/0x650
> [36587.216465] [c000002ff6087610] [c00000000028a154]
> __alloc_pages_nodemask+0xf34/0x10b0
> [36587.216534] [c000002ff6087810] [c00000000030b800]
> alloc_pages_current+0xc0/0x1d0
> [36587.216603] [c000002ff6087870] [c00000000031907c]
> new_slab+0x46c/0x7d0
> [36587.216661] [c000002ff6087950] [c00000000031bf00]
> ___slab_alloc+0x570/0x670
> [36587.216718] [c000002ff6087a70] [c00000000031c05c]
> __slab_alloc+0x5c/0x90
> [36587.216776] [c000002ff6087ad0] [c00000000031c1f4]
> kmem_cache_alloc+0x164/0x300
> [36587.216845] [c000002ff6087b20] [c0000000002de120]
> mmap_region+0x3e0/0x6e0
> [36587.216903] [c000002ff6087c00] [c0000000002de7cc] do_mmap+0x3ac/0x480
> [36587.216960] [c000002ff6087c80] [c0000000002af244]
> vm_mmap_pgoff+0x114/0x160
> [36587.217018] [c000002ff6087d60] [c0000000002db6b0]
> SyS_mmap_pgoff+0x230/0x300
> [36587.217087] [c000002ff6087de0] [c000000000014eac] sys_mmap+0x8c/0xd0
> [36587.217145] [c000002ff6087e30] [c00000000000b184]
> system_call+0x38/0xe0
> [36587.217481] ---[ end Kernel panic - not syncing: Out of memory and no
> killable processes...
> [36587.217481]
>
> Signed-off-by: Wen Xiong <[email protected]>
> Reviewed by: Brian King <[email protected]>
> Tested by: Murali Iyer <[email protected]>
> ---
> fs/block_dev.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 519599d..e871444 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>
> if (unlikely(bio.bi_error))
> return bio.bi_error;
> +
> + if (bio_integrity(&bio))
> + bio_integrity_free(&bio);
> +
> return ret;
> }

If this went through blk-throttle, then we'd also need to drop
the task association. Should this just use __bio_free(&bio)?

--
Jens Axboe

2017-06-28 17:45:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On 06/28/2017 11:42 AM, Jens Axboe wrote:
> On 06/28/2017 10:32 AM, [email protected] wrote:
>> From: Wen Xiong <[email protected]>
>>
>> With nvme devive + T10 enabled, On a system it has 256GB and started
>> logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
>> it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
>> leaking.
>>
>> /proc/meminfo | grep SUnreclaim...
>>
>> SUnreclaim: 6752128 kB
>> SUnreclaim: 6874880 kB
>> SUnreclaim: 7238080 kB
>> ....
>> SUnreclaim: 22307264 kB
>> SUnreclaim: 22485888 kB
>> SUnreclaim: 22720256 kB
>>
>> When testcases with T10 enabled call into __blkdev_direct_IO_simple,
>> code doesn't free memory allocated by bio_integrity_alloc. The patch
>> fixes the issue. HTX has been run with +60 hours without failure.
>>
>> failing stack:
>>
>> [36587.216329] [c000002ff60874a0] [c000000000bcac68]
>> dump_stack+0xb0/0xf0 (unreliable)
>> [36587.216349] [c000002ff60874e0] [c000000000bc8c94] panic+0x140/0x308
>> [36587.216407] [c000002ff6087570] [c000000000282530]
>> out_of_memory+0x4e0/0x650
>> [36587.216465] [c000002ff6087610] [c00000000028a154]
>> __alloc_pages_nodemask+0xf34/0x10b0
>> [36587.216534] [c000002ff6087810] [c00000000030b800]
>> alloc_pages_current+0xc0/0x1d0
>> [36587.216603] [c000002ff6087870] [c00000000031907c]
>> new_slab+0x46c/0x7d0
>> [36587.216661] [c000002ff6087950] [c00000000031bf00]
>> ___slab_alloc+0x570/0x670
>> [36587.216718] [c000002ff6087a70] [c00000000031c05c]
>> __slab_alloc+0x5c/0x90
>> [36587.216776] [c000002ff6087ad0] [c00000000031c1f4]
>> kmem_cache_alloc+0x164/0x300
>> [36587.216845] [c000002ff6087b20] [c0000000002de120]
>> mmap_region+0x3e0/0x6e0
>> [36587.216903] [c000002ff6087c00] [c0000000002de7cc] do_mmap+0x3ac/0x480
>> [36587.216960] [c000002ff6087c80] [c0000000002af244]
>> vm_mmap_pgoff+0x114/0x160
>> [36587.217018] [c000002ff6087d60] [c0000000002db6b0]
>> SyS_mmap_pgoff+0x230/0x300
>> [36587.217087] [c000002ff6087de0] [c000000000014eac] sys_mmap+0x8c/0xd0
>> [36587.217145] [c000002ff6087e30] [c00000000000b184]
>> system_call+0x38/0xe0
>> [36587.217481] ---[ end Kernel panic - not syncing: Out of memory and no
>> killable processes...
>> [36587.217481]
>>
>> Signed-off-by: Wen Xiong <[email protected]>
>> Reviewed by: Brian King <[email protected]>
>> Tested by: Murali Iyer <[email protected]>
>> ---
>> fs/block_dev.c | 4 ++++
>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 519599d..e871444 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>>
>> if (unlikely(bio.bi_error))
>> return bio.bi_error;
>> +
>> + if (bio_integrity(&bio))
>> + bio_integrity_free(&bio);
>> +
>> return ret;
>> }
>
> If this went through blk-throttle, then we'd also need to drop
> the task association. Should this just use __bio_free(&bio)?

Something like the below. Also fixes the case where we had an
error, your patch is still leaking for that case.


diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..bdc6e6541278 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -240,7 +240,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
return bvl;
}

-static void __bio_free(struct bio *bio)
+void __bio_free(struct bio *bio)
{
bio_disassociate_task(bio);

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..d929f886ee7b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -263,7 +263,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
kfree(vecs);

if (unlikely(bio.bi_error))
- return bio.bi_error;
+ ret = bio.bi_error;
+
+ __bio_free(&bio);
+
return ret;
}

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0e99cf..4a29dc3db8cb 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -380,6 +380,7 @@ extern mempool_t *biovec_create_pool(int pool_entries);

extern struct bio *bio_alloc_bioset(gfp_t, unsigned int, struct bio_set *);
extern void bio_put(struct bio *);
+extern void __bio_free(struct bio *);

extern void __bio_clone_fast(struct bio *, struct bio *);
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);

--
Jens Axboe

2017-06-28 17:51:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On 06/28/2017 11:10 AM, Keith Busch wrote:
> On Wed, Jun 28, 2017 at 11:32:51AM -0500, [email protected] wrote:
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 519599d..e871444 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>>
>> if (unlikely(bio.bi_error))
>> return bio.bi_error;
>> +
>> + if (bio_integrity(&bio))
>> + bio_integrity_free(&bio);
>> +
>> return ret;
>> }
>
> We don't want to leak the integrity payload in case of bi_error either.

Yep, see my follow up email on this.

--
Jens Axboe

2017-06-28 18:31:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On Wed, Jun 28, 2017 at 01:10:31PM -0400, Keith Busch wrote:
> On Wed, Jun 28, 2017 at 11:32:51AM -0500, [email protected] wrote:
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 519599d..e871444 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
> >
> > if (unlikely(bio.bi_error))
> > return bio.bi_error;
> > +
> > + if (bio_integrity(&bio))
> > + bio_integrity_free(&bio);
> > +
> > return ret;
> > }
>
> We don't want to leak the integrity payload in case of bi_error either.

And we should just call __bio_free. Which btw every user of bio_init
probably needs.

2017-06-28 18:34:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On 06/28/2017 12:31 PM, Christoph Hellwig wrote:
> On Wed, Jun 28, 2017 at 01:10:31PM -0400, Keith Busch wrote:
>> On Wed, Jun 28, 2017 at 11:32:51AM -0500, [email protected] wrote:
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index 519599d..e871444 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>>>
>>> if (unlikely(bio.bi_error))
>>> return bio.bi_error;
>>> +
>>> + if (bio_integrity(&bio))
>>> + bio_integrity_free(&bio);
>>> +
>>> return ret;
>>> }
>>
>> We don't want to leak the integrity payload in case of bi_error either.
>
> And we should just call __bio_free. Which btw every user of bio_init
> probably needs.

That's what I sent out. Here it is again. We should get this into 4.12,
so would be great with a review or two.


diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..bdc6e6541278 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -240,7 +240,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
return bvl;
}

-static void __bio_free(struct bio *bio)
+void __bio_free(struct bio *bio)
{
bio_disassociate_task(bio);

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..d929f886ee7b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -263,7 +263,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
kfree(vecs);

if (unlikely(bio.bi_error))
- return bio.bi_error;
+ ret = bio.bi_error;
+
+ __bio_free(&bio);
+
return ret;
}

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0e99cf..4a29dc3db8cb 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -380,6 +380,7 @@ extern mempool_t *biovec_create_pool(int pool_entries);

extern struct bio *bio_alloc_bioset(gfp_t, unsigned int, struct bio_set *);
extern void bio_put(struct bio *);
+extern void __bio_free(struct bio *);

extern void __bio_clone_fast(struct bio *, struct bio *);
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);

--
Jens Axboe

2017-06-28 18:38:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On Wed, Jun 28, 2017 at 12:34:15PM -0600, Jens Axboe wrote:
> That's what I sent out.

Where? Didn't see that anywhere..

> Here it is again. We should get this into 4.12,
> so would be great with a review or two.

Can we rename __bio_free to bio_uninit and add a comment to bio_init
that it must be paried with bio_uninit?

Except for that this looks fine, although there are a lot more callers
that should get this treatment..

2017-06-28 18:44:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On 06/28/2017 12:38 PM, Christoph Hellwig wrote:
> On Wed, Jun 28, 2017 at 12:34:15PM -0600, Jens Axboe wrote:
>> That's what I sent out.
>
> Where? Didn't see that anywhere..

Looks like you weren't CC'ed on the original thread. About an hour ago.

>> Here it is again. We should get this into 4.12,
>> so would be great with a review or two.
>
> Can we rename __bio_free to bio_uninit and add a comment to bio_init
> that it must be paried with bio_uninit?

Let's keep it small for 4.12. We can do a cleanup on top of this for
4.13.

> Except for that this looks fine, although there are a lot more callers
> that should get this treatment..

Should only be an issue for on-stack bio's, since we don't go through
the put/free path. Did a quick grep, looks like this is one of 3. One
is floppy, which probably neither has DIF or uses blk-throttle. Then
there's one in dm-bufio, didn't look too closely at that. Last one is
this one.

--
Jens Axboe

2017-06-28 18:52:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On Wed, Jun 28, 2017 at 12:44:00PM -0600, Jens Axboe wrote:
> On 06/28/2017 12:38 PM, Christoph Hellwig wrote:
> > On Wed, Jun 28, 2017 at 12:34:15PM -0600, Jens Axboe wrote:
> >> That's what I sent out.
> >
> > Where? Didn't see that anywhere..
>
> Looks like you weren't CC'ed on the original thread. About an hour ago.
>
> >> Here it is again. We should get this into 4.12,
> >> so would be great with a review or two.
> >
> > Can we rename __bio_free to bio_uninit and add a comment to bio_init
> > that it must be paried with bio_uninit?
>
> Let's keep it small for 4.12. We can do a cleanup on top of this for
> 4.13.

The rename is two additional lines for the patch, it's not going to
make a difference..

> > Except for that this looks fine, although there are a lot more callers
> > that should get this treatment..
>
> Should only be an issue for on-stack bio's, since we don't go through
> the put/free path. Did a quick grep, looks like this is one of 3. One
> is floppy, which probably neither has DIF or uses blk-throttle. Then
> there's one in dm-bufio, didn't look too closely at that. Last one is
> this one.

Well, it's really all callers but bio_alloc_bioset itself that
will need this handling, as only bios that come from bio_alloc_bioset
will be freed through bio_free. Most of them probably don't
support DIF, but they'll also miss the bio_disassociate_task call
this way, and will leak I/O context and css references if block
cgroup support is enabled.

2017-06-28 19:00:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On 06/28/2017 12:52 PM, Christoph Hellwig wrote:
> On Wed, Jun 28, 2017 at 12:44:00PM -0600, Jens Axboe wrote:
>> On 06/28/2017 12:38 PM, Christoph Hellwig wrote:
>>> On Wed, Jun 28, 2017 at 12:34:15PM -0600, Jens Axboe wrote:
>>>> That's what I sent out.
>>>
>>> Where? Didn't see that anywhere..
>>
>> Looks like you weren't CC'ed on the original thread. About an hour ago.
>>
>>>> Here it is again. We should get this into 4.12,
>>>> so would be great with a review or two.
>>>
>>> Can we rename __bio_free to bio_uninit and add a comment to bio_init
>>> that it must be paried with bio_uninit?
>>
>> Let's keep it small for 4.12. We can do a cleanup on top of this for
>> 4.13.
>
> The rename is two additional lines for the patch, it's not going to
> make a difference..

Sure, why not...

>>> Except for that this looks fine, although there are a lot more callers
>>> that should get this treatment..
>>
>> Should only be an issue for on-stack bio's, since we don't go through
>> the put/free path. Did a quick grep, looks like this is one of 3. One
>> is floppy, which probably neither has DIF or uses blk-throttle. Then
>> there's one in dm-bufio, didn't look too closely at that. Last one is
>> this one.
>
> Well, it's really all callers but bio_alloc_bioset itself that
> will need this handling, as only bios that come from bio_alloc_bioset
> will be freed through bio_free. Most of them probably don't
> support DIF, but they'll also miss the bio_disassociate_task call
> this way, and will leak I/O context and css references if block
> cgroup support is enabled.

I guess local allocs too will be affected.

I'm baffled we've had this issue for so long, and nobody's seen it.
I knew DIF was never used (basically), but blk-throttle must have some
users at least.



diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..f877d5e6d17f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -240,7 +240,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
return bvl;
}

-static void __bio_free(struct bio *bio)
+void bio_uninit(struct bio *bio)
{
bio_disassociate_task(bio);

@@ -253,7 +253,7 @@ static void bio_free(struct bio *bio)
struct bio_set *bs = bio->bi_pool;
void *p;

- __bio_free(bio);
+ bio_uninit(bio);

if (bs) {
bvec_free(bs->bvec_pool, bio->bi_io_vec, BVEC_POOL_IDX(bio));
@@ -271,6 +271,11 @@ static void bio_free(struct bio *bio)
}
}

+/*
+ * Users of this function have their own bio allocation. Subsequently,
+ * they must remember to pair any call to bio_init() with bio_uninit()
+ * when IO has completed, or when the bio is released.
+ */
void bio_init(struct bio *bio, struct bio_vec *table,
unsigned short max_vecs)
{
@@ -297,7 +302,7 @@ void bio_reset(struct bio *bio)
{
unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);

- __bio_free(bio);
+ bio_uninit(bio);

memset(bio, 0, BIO_RESET_BYTES);
bio->bi_flags = flags;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..0a7404ef9335 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -263,7 +263,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
kfree(vecs);

if (unlikely(bio.bi_error))
- return bio.bi_error;
+ ret = bio.bi_error;
+
+ bio_uninit(&bio);
+
return ret;
}

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0e99cf..a7e29fa0981f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -426,6 +426,7 @@ extern void bio_advance(struct bio *, unsigned);

extern void bio_init(struct bio *bio, struct bio_vec *table,
unsigned short max_vecs);
+extern void bio_uninit(struct bio *);
extern void bio_reset(struct bio *);
void bio_chain(struct bio *, struct bio *);


--
Jens Axboe

2017-06-28 21:11:42

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On Wed, Jun 28, 2017 at 12:57:50PM -0600, Jens Axboe wrote:
> On 06/28/2017 12:52 PM, Christoph Hellwig wrote:
> > On Wed, Jun 28, 2017 at 12:44:00PM -0600, Jens Axboe wrote:
> >> On 06/28/2017 12:38 PM, Christoph Hellwig wrote:
> >>> On Wed, Jun 28, 2017 at 12:34:15PM -0600, Jens Axboe wrote:
> >>>> That's what I sent out.
> >>>
> >>> Where? Didn't see that anywhere..
> >>
> >> Looks like you weren't CC'ed on the original thread. About an hour ago.
> >>
> >>>> Here it is again. We should get this into 4.12,
> >>>> so would be great with a review or two.
> >>>
> >>> Can we rename __bio_free to bio_uninit and add a comment to bio_init
> >>> that it must be paried with bio_uninit?
> >>
> >> Let's keep it small for 4.12. We can do a cleanup on top of this for
> >> 4.13.
> >
> > The rename is two additional lines for the patch, it's not going to
> > make a difference..
>
> Sure, why not...
>
> >>> Except for that this looks fine, although there are a lot more callers
> >>> that should get this treatment..
> >>
> >> Should only be an issue for on-stack bio's, since we don't go through
> >> the put/free path. Did a quick grep, looks like this is one of 3. One
> >> is floppy, which probably neither has DIF or uses blk-throttle. Then
> >> there's one in dm-bufio, didn't look too closely at that. Last one is
> >> this one.
> >
> > Well, it's really all callers but bio_alloc_bioset itself that
> > will need this handling, as only bios that come from bio_alloc_bioset
> > will be freed through bio_free. Most of them probably don't
> > support DIF, but they'll also miss the bio_disassociate_task call
> > this way, and will leak I/O context and css references if block
> > cgroup support is enabled.
>
> I guess local allocs too will be affected.
>
> I'm baffled we've had this issue for so long, and nobody's seen it.
> I knew DIF was never used (basically), but blk-throttle must have some
> users at least.

I fixed the same issue in the blktrace patches. It did the free in bio_endio.


>
>
> diff --git a/block/bio.c b/block/bio.c
> index 888e7801c638..f877d5e6d17f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -240,7 +240,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
> return bvl;
> }
>
> -static void __bio_free(struct bio *bio)
> +void bio_uninit(struct bio *bio)
> {
> bio_disassociate_task(bio);
>
> @@ -253,7 +253,7 @@ static void bio_free(struct bio *bio)
> struct bio_set *bs = bio->bi_pool;
> void *p;
>
> - __bio_free(bio);
> + bio_uninit(bio);
>
> if (bs) {
> bvec_free(bs->bvec_pool, bio->bi_io_vec, BVEC_POOL_IDX(bio));
> @@ -271,6 +271,11 @@ static void bio_free(struct bio *bio)
> }
> }
>
> +/*
> + * Users of this function have their own bio allocation. Subsequently,
> + * they must remember to pair any call to bio_init() with bio_uninit()
> + * when IO has completed, or when the bio is released.
> + */
> void bio_init(struct bio *bio, struct bio_vec *table,
> unsigned short max_vecs)
> {
> @@ -297,7 +302,7 @@ void bio_reset(struct bio *bio)
> {
> unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
>
> - __bio_free(bio);
> + bio_uninit(bio);
>
> memset(bio, 0, BIO_RESET_BYTES);
> bio->bi_flags = flags;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 519599dddd36..0a7404ef9335 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -263,7 +263,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> kfree(vecs);
>
> if (unlikely(bio.bi_error))
> - return bio.bi_error;
> + ret = bio.bi_error;
> +
> + bio_uninit(&bio);
> +
> return ret;
> }
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index d1b04b0e99cf..a7e29fa0981f 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -426,6 +426,7 @@ extern void bio_advance(struct bio *, unsigned);
>
> extern void bio_init(struct bio *bio, struct bio_vec *table,
> unsigned short max_vecs);
> +extern void bio_uninit(struct bio *);
> extern void bio_reset(struct bio *);
> void bio_chain(struct bio *, struct bio *);
>
>
> --
> Jens Axboe
>

2017-06-28 21:25:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

On Wed, Jun 28, 2017 at 02:11:31PM -0700, Shaohua Li wrote:
> I fixed the same issue in the blktrace patches. It did the free in bio_endio.

Moving all the frees to bio_endio does indeed seem sensible,
as it avoids the additional calls for the bio_init()ed bios.