2014-02-08 19:12:13

by Richard Yao

[permalink] [raw]
Subject: [PATCH 0/2] Attention by Linus Torvalds needed to export symbol he wrote

Dear Linus,

Loading kernel modules off 9p-virtio in a Linux guest causes VM termination
because of a page fault in unmapped memory, so I wrote a patch to fix it. Dave
Miller initially accepted it, but then rejected it because it calls an
unexported symbol from a kernel module, which breaks the build when
CONFIG_NET_9P_VIRTIO=m is set in the kernel config:

https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y

>From what I can tell, I need the original author of a symbol to sign-off on any
patch exporting it. git blame says that the original author is you, so I am
sending this pull request to you for approval.

Richard Yao (2):
mm/vmalloc: export is_vmalloc_or_module_addr
9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers

mm/vmalloc.c | 1 +
net/9p/trans_virtio.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)

--
1.8.3.2


2014-02-08 19:12:16

by Richard Yao

[permalink] [raw]
Subject: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

9p-virtio needs is_vmalloc_or_module_addr exported before a patch can be
merged to prevent the virtio zero-copy routines from triggering a
hypervisor page fault when loading kernel modules:

https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y

Without this export, the kernel build breaks with that patch applied and
CONFIG_NET_9P_VIRTIO=m. With this export in place, all is well.

Signed-off-by: Richard Yao <[email protected]>
---
mm/vmalloc.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0fdf968..8a2e54f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -218,6 +218,7 @@ int is_vmalloc_or_module_addr(const void *x)
#endif
return is_vmalloc_addr(x);
}
+EXPORT_SYMBOL(is_vmalloc_or_module_addr);

/*
* Walk a vmap address to the struct page it maps.
--
1.8.3.2

2014-02-08 19:12:35

by Richard Yao

[permalink] [raw]
Subject: [PATCH 2/2] 9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers

The 9p-virtio transport does zero copy on things larger than 1024 bytes
in size. It accomplishes this by returning the physical addresses of
pages to the virtio-pci device. At present, the translation is usually a
bit shift.

However, that approach produces an invalid page address when we
read/write to vmalloc buffers, such as those used for Linux kernle
modules. This causes QEMU to die printing:

qemu-system-x86_64: virtio: trying to map MMIO memory

This patch enables 9p-virtio to correctly handle this case. This not
only enables us to load Linux kernel modules off virtfs, but also
enables ZFS file-based vdevs on virtfs to be used without killing QEMU.

Also, special thanks to both Avi Kivity and Alexander Graf for their
interpretation of QEMU backtraces. Without their guidence, tracking down
this bug would have taken much longer.

Signed-off-by: Richard Yao <[email protected]>
Acked-by: Alexander Graf <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
---
net/9p/trans_virtio.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index cd1e1ed..b2009bc 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -340,7 +340,10 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
int count = nr_pages;
while (nr_pages) {
s = rest_of_page(data);
- pages[index++] = kmap_to_page(data);
+ if (is_vmalloc_or_module_addr(data))
+ pages[index++] = vmalloc_to_page(data);
+ else
+ pages[index++] = kmap_to_page(data);
data += s;
nr_pages--;
}
--
1.8.3.2

2014-02-08 19:45:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On Sat, Feb 8, 2014 at 11:12 AM, Richard Yao <[email protected]> wrote:
> 9p-virtio needs is_vmalloc_or_module_addr exported before a patch can be
> merged to prevent the virtio zero-copy routines from triggering a
> hypervisor page fault when loading kernel modules:
>
> https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y
>
> Without this export, the kernel build breaks with that patch applied and
> CONFIG_NET_9P_VIRTIO=m. With this export in place, all is well.

I absolutely *detest* this patch.

Not so much because is_vmalloc_or_module_addr() should never be used,
but because the particular use in question is pure and utter garbage.

There are valid reasons to use "is_vmalloc_or_module_addr()", but
those are along the lines of drivers/char/mem.c, which says "return
-ENXIO for this crap".

And btw, that horrid crap called "kmap_to_page()" needs to die too.
When is it *ever* valid to use a kmap'ed page for IO? Here's a clue:
never.

I notice that we have a similar abortion in "get_kernel_page[s]()",
which probably has the same broken source.

We need to get *rid* of this crap, rather than add more of it. I
should have looked more closely, but it came in through Andrew, which
generally makes me go "ok, Andrew sends me a VM patch - apply". In
this case it was clearly a huge mistake.

So who the f*ck sends static module data as IO? Just stop doing that.
What's And that idiotic kmap_to_page() really needs to die too.

Those *disgusting* get_kernel_page[s]() functions came with a
commentary about "The initial user is expected to be NFS.." and that
is still the *only* user. The fact that *everybody* else has been able
to avoid that crap should tell us something.

Mel, Rik, this needs to die. I'm sorry I didn't notice how crap it was earlier.

And what's the backtrace that gets mentioned - but not quoted - for
the horrible 9p crap? So that we can see who the guilty party is that
thinks that it's somehow ok to pass module addresses off to other
kernel users..

Please let's just fix the real problem, don't add more horridness on
top of it because somebody messed up earlier.

Linus

2014-02-08 19:58:37

by Richard Yao

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On 02/08/2014 02:45 PM, Linus Torvalds wrote:
> On Sat, Feb 8, 2014 at 11:12 AM, Richard Yao <[email protected]> wrote:
> And what's the backtrace that gets mentioned - but not quoted - for
> the horrible 9p crap? So that we can see who the guilty party is that
> thinks that it's somehow ok to pass module addresses off to other
> kernel users..

My apologies for that. Here is the backtrace:

[<ffffffff814878ce>] p9_virtio_zc_request+0x45e/0x510
[<ffffffff814814ed>] p9_client_zc_rpc.constprop.16+0xfd/0x4f0
[<ffffffff814839dd>] p9_client_read+0x15d/0x240
[<ffffffff811c8440>] v9fs_fid_readn+0x50/0xa0
[<ffffffff811c84a0>] v9fs_file_readn+0x10/0x20
[<ffffffff811c84e7>] v9fs_file_read+0x37/0x70
[<ffffffff8114e3fb>] vfs_read+0x9b/0x160
[<ffffffff81153571>] kernel_read+0x41/0x60
[<ffffffff810c83ab>] copy_module_from_fd.isra.34+0xfb/0x180
[<ffffffff810cc420>] SyS_finit_module+0x70/0xd0
[<ffffffff814a08fd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

The addresses are being passed to permit zero-copy between the guest and
the host. You can see this in the original thread back from December,
which is from before I knew how to submit patches properly:

http://www.spinics.net/lists/linux-virtualization/msg21734.html

> Please let's just fix the real problem, don't add more horridness on
> top of it because somebody messed up earlier.

This code avoids a copy between QEMU's 9p server and a guest Linux
kernel. Without it, the 9p-virtio code must allocate additional buffers
into which QEMU will write the data and then copy from those buffers
into the actual kernel buffers. How would you like 9p-virtio to do this?


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-08 20:06:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On Sat, Feb 8, 2014 at 11:58 AM, Richard Yao <[email protected]> wrote:
>
> My apologies for that. Here is the backtrace:
>
> [<ffffffff814878ce>] p9_virtio_zc_request+0x45e/0x510
> [<ffffffff814814ed>] p9_client_zc_rpc.constprop.16+0xfd/0x4f0
> [<ffffffff814839dd>] p9_client_read+0x15d/0x240
> [<ffffffff811c8440>] v9fs_fid_readn+0x50/0xa0
> [<ffffffff811c84a0>] v9fs_file_readn+0x10/0x20
> [<ffffffff811c84e7>] v9fs_file_read+0x37/0x70
> [<ffffffff8114e3fb>] vfs_read+0x9b/0x160
> [<ffffffff81153571>] kernel_read+0x41/0x60
> [<ffffffff810c83ab>] copy_module_from_fd.isra.34+0xfb/0x180

So copy_module_from_fd() does read into a vmalloc'ed buffer (which
isn't pretty, but at least it's not like using some statically
allocated module data), but that's a regular vmalloc() afaik.

So I don't see the need for "is_vmalloc_or_module_addr()". The regular
"is_vmalloc_addr()" should be fine, and *is* usable from modules (it's
inline, not exported, but it comes to the same thing wrt module use),
exactly because we have traditionally allowed vmalloc'ed memory to be
used.

So is there some reason why it's not just using that simpler function?

The kmap_to_page() thing is actually worse, but that's preexisting damage.

Linus

2014-02-08 20:44:31

by Richard Yao

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On 02/08/2014 03:06 PM, Linus Torvalds wrote:
> On Sat, Feb 8, 2014 at 11:58 AM, Richard Yao <[email protected]> wrote:
>>
>> My apologies for that. Here is the backtrace:
>>
>> [<ffffffff814878ce>] p9_virtio_zc_request+0x45e/0x510
>> [<ffffffff814814ed>] p9_client_zc_rpc.constprop.16+0xfd/0x4f0
>> [<ffffffff814839dd>] p9_client_read+0x15d/0x240
>> [<ffffffff811c8440>] v9fs_fid_readn+0x50/0xa0
>> [<ffffffff811c84a0>] v9fs_file_readn+0x10/0x20
>> [<ffffffff811c84e7>] v9fs_file_read+0x37/0x70
>> [<ffffffff8114e3fb>] vfs_read+0x9b/0x160
>> [<ffffffff81153571>] kernel_read+0x41/0x60
>> [<ffffffff810c83ab>] copy_module_from_fd.isra.34+0xfb/0x180
>
> So copy_module_from_fd() does read into a vmalloc'ed buffer (which
> isn't pretty, but at least it's not like using some statically
> allocated module data), but that's a regular vmalloc() afaik.
>
> So I don't see the need for "is_vmalloc_or_module_addr()". The regular
> "is_vmalloc_addr()" should be fine, and *is* usable from modules (it's
> inline, not exported, but it comes to the same thing wrt module use),
> exactly because we have traditionally allowed vmalloc'ed memory to be
> used.
>
> So is there some reason why it's not just using that simpler function?

My first instinct was to use "is_vmalloc_addr()" as you suggest.
However, is_vmalloc_addr() only applies to the vmalloc region. While all
architectures load kernel modules into virtual memory (to my knowledge),
some architectures do not load them into the vmalloc region.
is_vmalloc_or_module_addr() contains a comment that clearly states this:

int is_vmalloc_or_module_addr(const void *x)
{
/*
* ARM, x86-64 and sparc64 put modules in a special place,
* and fall back on vmalloc() if that fails. Others
* just put it in the vmalloc space.
*/
#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
unsigned long addr = (unsigned long)x;
if (addr >= MODULES_VADDR && addr < MODULES_END)
return 1;
#endif
return is_vmalloc_addr(x);
}

My inspection of the actual code agreed with that comment (at least for
S390), so I decided against using "is_vmalloc_addr()" here.


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-08 22:24:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On Sat, Feb 8, 2014 at 12:44 PM, Richard Yao <[email protected]> wrote:
>
> However, is_vmalloc_addr() only applies to the vmalloc region. While all
> architectures load kernel modules into virtual memory (to my knowledge),
> some architectures do not load them into the vmalloc region.

So?

People shouldn't do IO to module data, so who cares if something is a
module address or not?

The thing is, even module *loading* doesn't do IO to the magic module
addresses - it loads the module data into regular vmalloc space, and
then copies it into the final location separately.

And no module should ever do any IO on random static data (and
certainly not on code).

So there is _zero_ reason for a driver or a filesystem to use
is_vmalloc_or_module_addr(). It's just not a valid question to ask.

If somebody uses module data/code addresses, we're *better* off with a
oops or other nasty behavior than to try to make it "work".

Linus

2014-02-08 23:39:16

by Richard Yao

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On Feb 8, 2014, at 5:24 PM, Linus Torvalds <[email protected]> wrote:

> On Sat, Feb 8, 2014 at 12:44 PM, Richard Yao <[email protected]> wrote:
>>
>> However, is_vmalloc_addr() only applies to the vmalloc region. While all
>> architectures load kernel modules into virtual memory (to my knowledge),
>> some architectures do not load them into the vmalloc region.
>
> So?
>
> People shouldn't do IO to module data, so who cares if something is a
> module address or not?
>
> The thing is, even module *loading* doesn't do IO to the magic module
> addresses - it loads the module data into regular vmalloc space, and
> then copies it into the final location separately.
>
> And no module should ever do any IO on random static data (and
> certainly not on code).
>
> So there is _zero_ reason for a driver or a filesystem to use
> is_vmalloc_or_module_addr(). It's just not a valid question to ask.
>
> If somebody uses module data/code addresses, we're *better* off with a
> oops or other nasty behavior than to try to make it "work".
>
> Linus

I will modify this to use is_vmalloc_addr() and send it back to the appropriate subsystem maintainer(s). Thank-you for taking the time to write that explanation.-

2014-02-10 12:10:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On Sat, Feb 08, 2014 at 11:45:53AM -0800, Linus Torvalds wrote:
> On Sat, Feb 8, 2014 at 11:12 AM, Richard Yao <[email protected]> wrote:
> > 9p-virtio needs is_vmalloc_or_module_addr exported before a patch can be
> > merged to prevent the virtio zero-copy routines from triggering a
> > hypervisor page fault when loading kernel modules:
> >
> > https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y
> >
> > Without this export, the kernel build breaks with that patch applied and
> > CONFIG_NET_9P_VIRTIO=m. With this export in place, all is well.
>
> I absolutely *detest* this patch.
>
> Not so much because is_vmalloc_or_module_addr() should never be used,
> but because the particular use in question is pure and utter garbage.
>
> There are valid reasons to use "is_vmalloc_or_module_addr()", but
> those are along the lines of drivers/char/mem.c, which says "return
> -ENXIO for this crap".
>
> And btw, that horrid crap called "kmap_to_page()" needs to die too.
> When is it *ever* valid to use a kmap'ed page for IO? Here's a clue:
> never.
>
> <SNIP>
>
> Those *disgusting* get_kernel_page[s]() functions came with a
> commentary about "The initial user is expected to be NFS.." and that
> is still the *only* user. The fact that *everybody* else has been able
> to avoid that crap should tell us something.

kmap_to_page, get_kernel_page and the NFS user for it are already scheduled
for death. Dave Kleikamp has been working on a series "Issue O_DIRECT aio
using bio_vec" that replaces what NFS currently does with something far
more sane. The series does not remove the helpers on top, probably because
of the 9P use of it, but if that was changed then the helpers would finally
die. I'm not sure why Dave's series never got merged so added him to the
cc to find out.

--
Mel Gorman
SUSE Labs

2014-02-10 14:41:41

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On 02/10/2014 06:10 AM, Mel Gorman wrote:
> On Sat, Feb 08, 2014 at 11:45:53AM -0800, Linus Torvalds wrote:
>> On Sat, Feb 8, 2014 at 11:12 AM, Richard Yao <[email protected]> wrote:
>>> 9p-virtio needs is_vmalloc_or_module_addr exported before a patch can be
>>> merged to prevent the virtio zero-copy routines from triggering a
>>> hypervisor page fault when loading kernel modules:
>>>
>>> https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y
>>>
>>> Without this export, the kernel build breaks with that patch applied and
>>> CONFIG_NET_9P_VIRTIO=m. With this export in place, all is well.
>>
>> I absolutely *detest* this patch.
>>
>> Not so much because is_vmalloc_or_module_addr() should never be used,
>> but because the particular use in question is pure and utter garbage.
>>
>> There are valid reasons to use "is_vmalloc_or_module_addr()", but
>> those are along the lines of drivers/char/mem.c, which says "return
>> -ENXIO for this crap".
>>
>> And btw, that horrid crap called "kmap_to_page()" needs to die too.
>> When is it *ever* valid to use a kmap'ed page for IO? Here's a clue:
>> never.
>>
>> <SNIP>
>>
>> Those *disgusting* get_kernel_page[s]() functions came with a
>> commentary about "The initial user is expected to be NFS.." and that
>> is still the *only* user. The fact that *everybody* else has been able
>> to avoid that crap should tell us something.
>
> kmap_to_page, get_kernel_page and the NFS user for it are already scheduled
> for death. Dave Kleikamp has been working on a series "Issue O_DIRECT aio
> using bio_vec" that replaces what NFS currently does with something far
> more sane. The series does not remove the helpers on top, probably because
> of the 9P use of it, but if that was changed then the helpers would finally
> die. I'm not sure why Dave's series never got merged so added him to the
> cc to find out.

The patchset I submitted was too ugly to live. Linus and Al Viro both
had numerous issues with the implementation, but not the overall concept.

Al has the same idea for a io-vector container that can hold either an
iovec or a kernel vector and is starting to work on his own
implementation. Once he settles on the particulars of new
aio_write/aio_read methods, I should be able to get the rest of my
patchset to fall into line with it.

http://www.spinics.net/lists/linux-fsdevel/msg71938.html

Shaggy

2014-02-10 19:34:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On 02/08/2014 02:24 PM, Linus Torvalds wrote:
> On Sat, Feb 8, 2014 at 12:44 PM, Richard Yao <[email protected]> wrote:
>>
>> However, is_vmalloc_addr() only applies to the vmalloc region. While all
>> architectures load kernel modules into virtual memory (to my knowledge),
>> some architectures do not load them into the vmalloc region.
>
> So?
>
> People shouldn't do IO to module data, so who cares if something is a
> module address or not?
>
> The thing is, even module *loading* doesn't do IO to the magic module
> addresses - it loads the module data into regular vmalloc space, and
> then copies it into the final location separately.
>
> And no module should ever do any IO on random static data (and
> certainly not on code).
>
> So there is _zero_ reason for a driver or a filesystem to use
> is_vmalloc_or_module_addr(). It's just not a valid question to ask.
>
> If somebody uses module data/code addresses, we're *better* off with a
> oops or other nasty behavior than to try to make it "work".

I agree that reading to module space is awful, but is it obviously
terrible for a module to do this:

static const char header[] = {...};
kernel_write(file, header, sizeof(header), 0);

The current nasty behavior is doing the I/O to the wrong place if the
appropriate CONFIG_DEBUG option isn't set. That IMO sucks.

--Andy

2014-02-10 19:43:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr

On Mon, Feb 10, 2014 at 11:33 AM, Andy Lutomirski <[email protected]> wrote:
>
> I agree that reading to module space is awful, but is it obviously
> terrible for a module to do this:
>
> static const char header[] = {...};
> kernel_write(file, header, sizeof(header), 0);

Yes, it is terrible. Don't do it. It shouldn't work.

Now, whether it is "obvious" or not is immaterial. It might be hard to
notice, but let's face it, we are kernel programmers. Our interfaces
aren't designed for convenience, they are for working well and
efficiently. And if that happens to mean that you shouldn't do IO on
kmap'ed pages, or use random static data in your modules, that is what
it means.

We have lots of other rules too. People should deal with it, and
realize that in the kernel we absolutely *have* to try to minimize the
problem space as much as possible.

The usual computer sciencey approach of "make things as generic as
possible so that you can reuse the code" stuff is generally bullshit
even in other contexts (example: STL), but it's _particularly_ bad for
the kernel. It's much better to have strict rules about what is
acceptable.

Linus