LinuxLists.cc - [PULL] vhost: cleanups and fixes

2017-12-08 15:48:05

Subject: [PULL] vhost: cleanups and fixes

The following changes since commit d9e427f6ab8142d6868eb719e6a7851aafea56b6:

virtio_balloon: fix increment of vb->num_pfns in fill_balloon() (2017-12-01 16:55:45 +0200)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 03e9f8a05bce7330bcd9c5cc54c8e42d0fcbf993:

virtio_net: fix return value check in receive_mergeable() (2017-12-07 18:34:52 +0200)

----------------------------------------------------------------
virtio: bugfixes

A couple of minor bugfixes.

Signed-off-by: Michael S. Tsirkin <[email protected]>

----------------------------------------------------------------
Yunjian Wang (1):
virtio_net: fix return value check in receive_mergeable()

weiping zhang (2):
virtio_mmio: add cleanup for virtio_mmio_probe
virtio_mmio: add cleanup for virtio_mmio_remove

drivers/net/virtio_net.c | 2 +-
drivers/virtio/virtio_mmio.c | 57 ++++++++++++++++++++++++++++++++++++--------
2 files changed, 48 insertions(+), 11 deletions(-)

2018-06-11 18:33:39

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <[email protected]> wrote:
>
> virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

Is this really a good idea?

Plus it seems entirely broken.

The report_pfn_range() callback is done under the zone lock, but
virtio_balloon_send_free_pages() (which is the only callback used that
I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
1, vq, GFP_KERNEL);

So now we apparently do a GFP_KERNEL allocation insider the mm zone
lock, which is broken on just _so_ many levels.

Pulled and then unpulled again.

Either somebody needs to explain why I'm wrong and you can re-submit
this, or this kind of garbage needs to go away.

I do *not* want to be in the situation where I pull stuff from the
virtio people that adds completely broken core VM functionality.

Because if I'm in that situation, I will stop pulling from you guys.
Seriously. You have *no* place sending me broken shit that is outside
the virtio layer.

Subsystems that break code MM will get shunned. You just aren't
important enough to allow you breaking code VM.

Linus

2018-06-11 20:23:04

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
<[email protected]> wrote:
>
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.

Oh, I see the comment about how it doesn't actually do an allocation
at all because it's a single-entry.

Still too damn ugly to live, and much too fragile. No way in hell do
we even _hint_ at a GFP_KERNEL when we're inside a context that can't
do any memory allocation at all.

Plus I'm not convinced it's a "no allocation" path even despite that
comment, because it also does a "dma_map_page()" etc, which can cause
allocations to do the dma mapping thing afaik. No?

Maybe there's some reason why that doesn't happen either, but
basically this whole callchain looks *way* to complicated to be used
under a core VM spinlock.

Linus

2018-06-12 01:46:52

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

On Mon, Jun 11, 2018 at 11:44:05AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > So now we apparently do a GFP_KERNEL allocation insider the mm zone
> > lock, which is broken on just _so_ many levels.
>
> Oh, I see the comment about how it doesn't actually do an allocation
> at all because it's a single-entry.
>
> Still too damn ugly to live, and much too fragile. No way in hell do
> we even _hint_ at a GFP_KERNEL when we're inside a context that can't
> do any memory allocation at all.
>
> Plus I'm not convinced it's a "no allocation" path even despite that
> comment, because it also does a "dma_map_page()" etc, which can cause
> allocations to do the dma mapping thing afaik. No?

Well no because DMA is triggered by the IOMMU flag and
that is always off for the balloon. But I hear what you
are saying about it being fragile.

> Maybe there's some reason why that doesn't happen either, but
> basically this whole callchain looks *way* to complicated to be used
> under a core VM spinlock.
>
> Linus

Maybe it will help to have GFP_NONE which will make any allocation
fail if attempted. Linus, would this address your comment?
--
MST

2018-06-12 01:58:44

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

On Mon, Jun 11, 2018 at 11:32:41AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <[email protected]> wrote:
> >
> > virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
>
> Is this really a good idea?

Well knowing which pages are unused does seem to be useful. Do you
refer to this generally or to the idea of scanning the free list,
or to the specific implementation issues you listed below?

> Plus it seems entirely broken.
>
> The report_pfn_range() callback is done under the zone lock, but
> virtio_balloon_send_free_pages() (which is the only callback used that
> I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
> 1, vq, GFP_KERNEL);
>
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.
>
> Pulled and then unpulled again.
>
> Either somebody needs to explain why I'm wrong and you can re-submit
> this, or this kind of garbage needs to go away.
>
> I do *not* want to be in the situation where I pull stuff from the
> virtio people that adds completely broken core VM functionality.
>
> Because if I'm in that situation, I will stop pulling from you guys.
> Seriously. You have *no* place sending me broken shit that is outside
> the virtio layer.
>
> Subsystems that break code MM will get shunned. You just aren't
> important enough to allow you breaking code VM.
>
> Linus

So no, it doesn't do any allocations - GFP_KERNEL or otherwise, but I
agree how it might be confusing.

So I'll drop this for now, but it would be nice if there is a bit more
direction on this feature since I know several people are trying to work
on this.

--
MST

2018-06-12 02:00:34

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <[email protected]> wrote:
>
> Maybe it will help to have GFP_NONE which will make any allocation
> fail if attempted. Linus, would this address your comment?

It would definitely have helped me initially overlook that call chain.

But then when I started looking at the whole dma_map_page() thing, it
just raised my hackles again.

I would seriously suggest having a much simpler version for the "no
allocation, no dma mapping" case, so that it's *obvious* that that
never happens.

So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?

Honestly, I look at "add_one_sg()" and it really doesn't make me
happy. It looks hacky as hell. If I read the code right, you're really
trying to just queue up a simple tuple of <pfn,len>, except you encode
it as a page pointer in order to play games with the SG logic, and
then you hmap that to the ring, except in this case it's all a fake
ring that just adds the cpu-physical address instead.

And to figuer that out, it's like five layers of indirection through
different helper functions that *can* do more generic things but in
this case don't.

And you do all of this from a core VM callback function with some
_really_ core VM locks held.

That makes no sense to me.

How about this:

- get rid of all that code

- make the core VM callback save the "these are the free memory
regions" in a fixed and limited array. One that DOES JUST THAT. No
crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
size, pre-allocated for that virtio instance.

- make it obvious that what you do in that sequence is ten
instructions and no allocations ("Look ma, I wrote a value to an array
and incremented the array idex, and I'M DONE")

- then in that workqueue entry that you start *anyway*, you empty the
array and do all the crazy virtio stuff.

In fact, while at it, just simplify the VM interface too. Instead of
traversing a random number of buddy lists, just trraverse *one* - the
top-level one. Are you seriously ever going to shrink or mark
read-only anythin *but* something big enough to be in the maximum
order?

MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
want the balloon code to work on smaller things, particularly since
the whole interface is fundamentally racy and opportunistic to begin
with?

The whole sequence of events really looks "this is too much
complexity, and way too fragile" to me at so many levels.

Linus

2018-06-12 11:02:16

by Wang, Wei W

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

On 06/12/2018 09:59 AM, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <[email protected]> wrote:
>> Maybe it will help to have GFP_NONE which will make any allocation
>> fail if attempted. Linus, would this address your comment?
> It would definitely have helped me initially overlook that call chain.
>
> But then when I started looking at the whole dma_map_page() thing, it
> just raised my hackles again.
>
> I would seriously suggest having a much simpler version for the "no
> allocation, no dma mapping" case, so that it's *obvious* that that
> never happens.
>
> So instead of having virtio_balloon_send_free_pages() call a really
> generic complex chain of functions that in _some_ cases can do memory
> allocation, why isn't there a short-circuited "vitruque_add_datum()"
> that is guaranteed to never do anything like that?
>
> Honestly, I look at "add_one_sg()" and it really doesn't make me
> happy. It looks hacky as hell. If I read the code right, you're really
> trying to just queue up a simple tuple of <pfn,len>, except you encode
> it as a page pointer in order to play games with the SG logic, and
> then you hmap that to the ring, except in this case it's all a fake
> ring that just adds the cpu-physical address instead.
>
> And to figuer that out, it's like five layers of indirection through
> different helper functions that *can* do more generic things but in
> this case don't.
>
> And you do all of this from a core VM callback function with some
> _really_ core VM locks held.
>
> That makes no sense to me.
>
> How about this:
>
> - get rid of all that code
>
> - make the core VM callback save the "these are the free memory
> regions" in a fixed and limited array. One that DOES JUST THAT. No
> crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
> size, pre-allocated for that virtio instance.
>
> - make it obvious that what you do in that sequence is ten
> instructions and no allocations ("Look ma, I wrote a value to an array
> and incremented the array idex, and I'M DONE")
>
> - then in that workqueue entry that you start *anyway*, you empty the
> array and do all the crazy virtio stuff.
>
> In fact, while at it, just simplify the VM interface too. Instead of
> traversing a random number of buddy lists, just trraverse *one* - the
> top-level one. Are you seriously ever going to shrink or mark
> read-only anythin *but* something big enough to be in the maximum
> order?
>
> MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
> want the balloon code to work on smaller things, particularly since
> the whole interface is fundamentally racy and opportunistic to begin
> with?

OK, I will implement a new version based on the suggestions. Thanks.

Best,
Wei

2018-06-14 15:02:47

by Nitesh Lal

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

Hi Wei,

On 06/12/2018 07:05 AM, Wei Wang wrote:
> On 06/12/2018 09:59 AM, Linus Torvalds wrote:
>> On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <[email protected]>
>> wrote:
>>> Maybe it will help to have GFP_NONE which will make any allocation
>>> fail if attempted. Linus, would this address your comment?
>> It would definitely have helped me initially overlook that call chain.
>>
>> But then when I started looking at the whole dma_map_page() thing, it
>> just raised my hackles again.
>>
>> I would seriously suggest having a much simpler version for the "no
>> allocation, no dma mapping" case, so that it's *obvious* that that
>> never happens.
>>
>> So instead of having virtio_balloon_send_free_pages() call a really
>> generic complex chain of functions that in _some_ cases can do memory
>> allocation, why isn't there a short-circuited "vitruque_add_datum()"
>> that is guaranteed to never do anything like that?
>>
>> Honestly, I look at "add_one_sg()" and it really doesn't make me
>> happy. It looks hacky as hell. If I read the code right, you're really
>> trying to just queue up a simple tuple of <pfn,len>, except you encode
>> it as a page pointer in order to play games with the SG logic, and
>> then you hmap that to the ring, except in this case it's all a fake
>> ring that just adds the cpu-physical address instead.
>>
>> And to figuer that out, it's like five layers of indirection through
>> different helper functions that *can* do more generic things but in
>> this case don't.
>>
>> And you do all of this from a core VM callback function with some
>> _really_ core VM locks held.
>>
>> That makes no sense to me.
>>
>> How about this:
>>
>> - get rid of all that code
>>
>> - make the core VM callback save the "these are the free memory
>> regions" in a fixed and limited array. One that DOES JUST THAT. No
>> crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
>> size, pre-allocated for that virtio instance.
>>
>> - make it obvious that what you do in that sequence is ten
>> instructions and no allocations ("Look ma, I wrote a value to an array
>> and incremented the array idex, and I'M DONE")
>>
>> - then in that workqueue entry that you start *anyway*, you empty the
>> array and do all the crazy virtio stuff.
>>
>> In fact, while at it, just simplify the VM interface too. Instead of
>> traversing a random number of buddy lists, just trraverse *one* - the
>> top-level one. Are you seriously ever going to shrink or mark
>> read-only anythin *but* something big enough to be in the maximum
>> order?
>>
>> MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
>> want the balloon code to work on smaller things, particularly since
>> the whole interface is fundamentally racy and opportunistic to begin
>> with?
>
> OK, I will implement a new version based on the suggestions. Thanks.

I have been working on a similar series [1] that is more generic, which
solves the problem of giving unused memory back to the host and could be
used to solve the migration problem as well. Can you take a look and see
if you can use my series in some way?

[1] https://www.spinics.net/lists/kvm/msg170113.html

>
> Best,
> Wei
>

--
Regards
Nitesh

Attachments:

signature.asc (836.00 B)
OpenPGP digital signature

2018-06-15 03:50:08

by Wang, Wei W

[permalink] [raw]

Subject: Re: [PULL] vhost: cleanups and fixes

On 06/14/2018 11:01 PM, Nitesh Narayan Lal wrote:
> Hi Wei,
>
>
> On 06/12/2018 07:05 AM, Wei Wang wrote:
>> On 06/12/2018 09:59 AM, Linus Torvalds wrote:
>>> On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <[email protected]>
>>> wrote:
>>>> Maybe it will help to have GFP_NONE which will make any allocation
>>>> fail if attempted. Linus, would this address your comment?
>>> It would definitely have helped me initially overlook that call chain.
>>>
>>> But then when I started looking at the whole dma_map_page() thing, it
>>> just raised my hackles again.
>>>
>>> I would seriously suggest having a much simpler version for the "no
>>> allocation, no dma mapping" case, so that it's *obvious* that that
>>> never happens.
>>>
>>> So instead of having virtio_balloon_send_free_pages() call a really
>>> generic complex chain of functions that in _some_ cases can do memory
>>> allocation, why isn't there a short-circuited "vitruque_add_datum()"
>>> that is guaranteed to never do anything like that?
>>>
>>> Honestly, I look at "add_one_sg()" and it really doesn't make me
>>> happy. It looks hacky as hell. If I read the code right, you're really
>>> trying to just queue up a simple tuple of <pfn,len>, except you encode
>>> it as a page pointer in order to play games with the SG logic, and
>>> then you hmap that to the ring, except in this case it's all a fake
>>> ring that just adds the cpu-physical address instead.
>>>
>>> And to figuer that out, it's like five layers of indirection through
>>> different helper functions that *can* do more generic things but in
>>> this case don't.
>>>
>>> And you do all of this from a core VM callback function with some
>>> _really_ core VM locks held.
>>>
>>> That makes no sense to me.
>>>
>>> How about this:
>>>
>>> - get rid of all that code
>>>
>>> - make the core VM callback save the "these are the free memory
>>> regions" in a fixed and limited array. One that DOES JUST THAT. No
>>> crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
>>> size, pre-allocated for that virtio instance.
>>>
>>> - make it obvious that what you do in that sequence is ten
>>> instructions and no allocations ("Look ma, I wrote a value to an array
>>> and incremented the array idex, and I'M DONE")
>>>
>>> - then in that workqueue entry that you start *anyway*, you empty the
>>> array and do all the crazy virtio stuff.
>>>
>>> In fact, while at it, just simplify the VM interface too. Instead of
>>> traversing a random number of buddy lists, just trraverse *one* - the
>>> top-level one. Are you seriously ever going to shrink or mark
>>> read-only anythin *but* something big enough to be in the maximum
>>> order?
>>>
>>> MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
>>> want the balloon code to work on smaller things, particularly since
>>> the whole interface is fundamentally racy and opportunistic to begin
>>> with?
>> OK, I will implement a new version based on the suggestions. Thanks.
> I have been working on a similar series [1] that is more generic, which
> solves the problem of giving unused memory back to the host and could be
> used to solve the migration problem as well. Can you take a look and see
> if you can use my series in some way?

Hi Nitesh,

I actually checked the last version, which dates back to last year. It
seems the new version does not have fundamental differences.

Actually there are obvious differences between the two series. This
series provides a simple lightweight method (will continue to post out a
new version with simpler interfaces based on the above suggestions) to
offer free pages hints, and the hints are quit helpful for usages like
accelerating live migration and guest snapshot. If I read that
correctly, that series seems to provide true (guaranteed) free pages
with much more heavyweight logics, but true free pages are not necessary
for the live migration optimization, which is the goal and motivation of
this work. And from my point of view, that series seems more like an
alternative function to ballooning, which takes out free pages (or say
guest unused pages) via allocation.

I will join the discussion in that thread. Probably we would need to
think about other new usages for that series.

Best,
Wei