On Thu, Jan 19, 2023 at 03:57:17PM +0200, Alexander Shishkin wrote:
> From: Andi Kleen <[email protected]>
>
> The ADD_PORT operation reads and sanity checks the port id multiple
> times from the untrusted host. This is not safe because a malicious
> host could change it between reads.
>
> Read the port id only once and cache it for subsequent uses.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Alexander Shishkin <[email protected]>
> Cc: Amit Shah <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> ---
> drivers/char/virtio_console.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> index f4fd5fe7cd3a..6599c2956ba4 100644
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -1563,10 +1563,13 @@ static void handle_control_message(struct virtio_device *vdev,
> struct port *port;
> size_t name_size;
> int err;
> + unsigned id;
>
> cpkt = (struct virtio_console_control *)(buf->buf + buf->offset);
>
> - port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt->id));
> + /* Make sure the host cannot change id under us */
> + id = virtio32_to_cpu(vdev, READ_ONCE(cpkt->id));
Why READ_ONCE()?
And how can it change under us? Is the message still under control of
the "host"? If so, that feels wrong as this is all in kernel memory,
not userspace memory right?
If you are dealing with memory from a different process that you do not
trust, then you need to copy EVERYTHING at once. Don't piece-meal copy
bits and bobs in all different places please. Do it once and then parse
the local structure properly.
Otherwise this is going to be impossible to actually maintain over
time...
thanks,
greg k-h
Greg Kroah-Hartman <[email protected]> writes:
> On Thu, Jan 19, 2023 at 03:57:17PM +0200, Alexander Shishkin wrote:
>> From: Andi Kleen <[email protected]>
>>
>> The ADD_PORT operation reads and sanity checks the port id multiple
>> times from the untrusted host. This is not safe because a malicious
>> host could change it between reads.
>>
>> Read the port id only once and cache it for subsequent uses.
>>
>> Signed-off-by: Andi Kleen <[email protected]>
>> Signed-off-by: Alexander Shishkin <[email protected]>
>> Cc: Amit Shah <[email protected]>
>> Cc: Arnd Bergmann <[email protected]>
>> Cc: Greg Kroah-Hartman <[email protected]>
>> ---
>> drivers/char/virtio_console.c | 10 ++++++----
>> 1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
>> index f4fd5fe7cd3a..6599c2956ba4 100644
>> --- a/drivers/char/virtio_console.c
>> +++ b/drivers/char/virtio_console.c
>> @@ -1563,10 +1563,13 @@ static void handle_control_message(struct virtio_device *vdev,
>> struct port *port;
>> size_t name_size;
>> int err;
>> + unsigned id;
>>
>> cpkt = (struct virtio_console_control *)(buf->buf + buf->offset);
>>
>> - port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt->id));
>> + /* Make sure the host cannot change id under us */
>> + id = virtio32_to_cpu(vdev, READ_ONCE(cpkt->id));
>
> Why READ_ONCE()?
>
> And how can it change under us? Is the message still under control of
> the "host"? If so, that feels wrong as this is all in kernel memory,
> not userspace memory right?
>
> If you are dealing with memory from a different process that you do not
> trust, then you need to copy EVERYTHING at once. Don't piece-meal copy
> bits and bobs in all different places please. Do it once and then parse
> the local structure properly.
This is the device memory or the VM host memory, not userspace or
another process. And it can change under us willy-nilly.
The thing is, we only need to cache two things to correctly process the
request. Copying everything, on the other hand, would involve the entire
buffer, not just the *cpkt, but also stuff that follows, which also
differs between different event types. And we also don't care if the
rest of it changes under us.
> Otherwise this is going to be impossible to actually maintain over
> time...
An 'id' can't possibly be worse to maintain than multiple instances of
'virtio32_to_cpu(vdev, cpkt->id)' sprinkled around the code.
Thanks,
--
Alex
On Thu, Jan 19, 2023 at 07:48:35PM +0200, Alexander Shishkin wrote:
> Greg Kroah-Hartman <[email protected]> writes:
>
> > On Thu, Jan 19, 2023 at 03:57:17PM +0200, Alexander Shishkin wrote:
> >> From: Andi Kleen <[email protected]>
> >>
> >> The ADD_PORT operation reads and sanity checks the port id multiple
> >> times from the untrusted host. This is not safe because a malicious
> >> host could change it between reads.
> >>
> >> Read the port id only once and cache it for subsequent uses.
> >>
> >> Signed-off-by: Andi Kleen <[email protected]>
> >> Signed-off-by: Alexander Shishkin <[email protected]>
> >> Cc: Amit Shah <[email protected]>
> >> Cc: Arnd Bergmann <[email protected]>
> >> Cc: Greg Kroah-Hartman <[email protected]>
> >> ---
> >> drivers/char/virtio_console.c | 10 ++++++----
> >> 1 file changed, 6 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> >> index f4fd5fe7cd3a..6599c2956ba4 100644
> >> --- a/drivers/char/virtio_console.c
> >> +++ b/drivers/char/virtio_console.c
> >> @@ -1563,10 +1563,13 @@ static void handle_control_message(struct virtio_device *vdev,
> >> struct port *port;
> >> size_t name_size;
> >> int err;
> >> + unsigned id;
> >>
> >> cpkt = (struct virtio_console_control *)(buf->buf + buf->offset);
> >>
> >> - port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt->id));
> >> + /* Make sure the host cannot change id under us */
> >> + id = virtio32_to_cpu(vdev, READ_ONCE(cpkt->id));
> >
> > Why READ_ONCE()?
> >
> > And how can it change under us? Is the message still under control of
> > the "host"? If so, that feels wrong as this is all in kernel memory,
> > not userspace memory right?
> >
> > If you are dealing with memory from a different process that you do not
> > trust, then you need to copy EVERYTHING at once. Don't piece-meal copy
> > bits and bobs in all different places please. Do it once and then parse
> > the local structure properly.
>
> This is the device memory or the VM host memory, not userspace or
> another process. And it can change under us willy-nilly.
Then you need to copy it out once, and then only deal with the local
copy. Otherwise you have an incomplete snapshot.
> The thing is, we only need to cache two things to correctly process the
> request. Copying everything, on the other hand, would involve the entire
> buffer, not just the *cpkt, but also stuff that follows, which also
> differs between different event types. And we also don't care if the
> rest of it changes under us.
That feels broken if you do not "trust" that other side. And what
prevents the buffer from changing after you validated the other part?
For virtio, I thought you always implied that you did trust the other
side, when has that changed? Where was that new security model for the
kernel discussed?
Are you sure this is even viable? What is the threat model you are
attempting to add to the driver here?
> > Otherwise this is going to be impossible to actually maintain over
> > time...
>
> An 'id' can't possibly be worse to maintain than multiple instances of
> 'virtio32_to_cpu(vdev, cpkt->id)' sprinkled around the code.
Again, copy what you want out and then act on that. If it can change
under you, and you do not trust it, then you have to work only on a
snapshot that you have verified.
thanks,
greg k-h
Greg Kroah-Hartman <[email protected]> writes:
> Then you need to copy it out once, and then only deal with the local
> copy. Otherwise you have an incomplete snapshot.
Ok, would you be partial to something like this:
From 1bc9bb84004154376c2a0cf643d53257da6d1cd7 Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <[email protected]>
Date: Thu, 19 Jan 2023 21:59:02 +0200
Subject: [PATCH] virtio console: Keep a local copy of the control structure
When handling control messages, instead of peeking at the device memory
to obtain bits of the control structure, take a snapshot of it once and
use it instead, to prevent it from changing under us. This avoids races
between port id validation and control event decoding, which can lead
to, for example, a NULL dereference in port removal of a nonexistent
port.
The control structure is small enough (8 bytes) that it can be cached
directly on the stack.
Signed-off-by: Alexander Shishkin <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Amit Shah <[email protected]>
---
drivers/char/virtio_console.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 6a821118d553..42be0991a72f 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1559,23 +1559,24 @@ static void handle_control_message(struct virtio_device *vdev,
struct ports_device *portdev,
struct port_buffer *buf)
{
- struct virtio_console_control *cpkt;
+ struct virtio_console_control cpkt;
struct port *port;
size_t name_size;
int err;
- cpkt = (struct virtio_console_control *)(buf->buf + buf->offset);
+ /* Keep a local copy of the control structure */
+ memcpy(&cpkt, buf->buf + buf->offset, sizeof(cpkt));
- port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt->id));
+ port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt.id));
if (!port &&
- cpkt->event != cpu_to_virtio16(vdev, VIRTIO_CONSOLE_PORT_ADD)) {
+ cpkt.event != cpu_to_virtio16(vdev, VIRTIO_CONSOLE_PORT_ADD)) {
/* No valid header at start of buffer. Drop it. */
dev_dbg(&portdev->vdev->dev,
- "Invalid index %u in control packet\n", cpkt->id);
+ "Invalid index %u in control packet\n", cpkt.id);
return;
}
- switch (virtio16_to_cpu(vdev, cpkt->event)) {
+ switch (virtio16_to_cpu(vdev, cpkt.event)) {
case VIRTIO_CONSOLE_PORT_ADD:
if (port) {
dev_dbg(&portdev->vdev->dev,
@@ -1583,21 +1584,21 @@ static void handle_control_message(struct virtio_device *vdev,
send_control_msg(port, VIRTIO_CONSOLE_PORT_READY, 1);
break;
}
- if (virtio32_to_cpu(vdev, cpkt->id) >=
+ if (virtio32_to_cpu(vdev, cpkt.id) >=
portdev->max_nr_ports) {
dev_warn(&portdev->vdev->dev,
"Request for adding port with "
"out-of-bound id %u, max. supported id: %u\n",
- cpkt->id, portdev->max_nr_ports - 1);
+ cpkt.id, portdev->max_nr_ports - 1);
break;
}
- add_port(portdev, virtio32_to_cpu(vdev, cpkt->id));
+ add_port(portdev, virtio32_to_cpu(vdev, cpkt.id));
break;
case VIRTIO_CONSOLE_PORT_REMOVE:
unplug_port(port);
break;
case VIRTIO_CONSOLE_CONSOLE_PORT:
- if (!cpkt->value)
+ if (!cpkt.value)
break;
if (is_console_port(port))
break;
@@ -1618,7 +1619,7 @@ static void handle_control_message(struct virtio_device *vdev,
if (!is_console_port(port))
break;
- memcpy(&size, buf->buf + buf->offset + sizeof(*cpkt),
+ memcpy(&size, buf->buf + buf->offset + sizeof(cpkt),
sizeof(size));
set_console_size(port, size.rows, size.cols);
@@ -1627,7 +1628,7 @@ static void handle_control_message(struct virtio_device *vdev,
break;
}
case VIRTIO_CONSOLE_PORT_OPEN:
- port->host_connected = virtio16_to_cpu(vdev, cpkt->value);
+ port->host_connected = virtio16_to_cpu(vdev, cpkt.value);
wake_up_interruptible(&port->waitqueue);
/*
* If the host port got closed and the host had any
@@ -1658,7 +1659,7 @@ static void handle_control_message(struct virtio_device *vdev,
* Skip the size of the header and the cpkt to get the size
* of the name that was sent
*/
- name_size = buf->len - buf->offset - sizeof(*cpkt) + 1;
+ name_size = buf->len - buf->offset - sizeof(cpkt) + 1;
port->name = kmalloc(name_size, GFP_KERNEL);
if (!port->name) {
@@ -1666,7 +1667,7 @@ static void handle_control_message(struct virtio_device *vdev,
"Not enough space to store port name\n");
break;
}
- strncpy(port->name, buf->buf + buf->offset + sizeof(*cpkt),
+ strncpy(port->name, buf->buf + buf->offset + sizeof(cpkt),
name_size - 1);
port->name[name_size - 1] = 0;
--
2.39.0
On Thu, Jan 19, 2023 at 10:13:18PM +0200, Alexander Shishkin wrote:
> Greg Kroah-Hartman <[email protected]> writes:
>
> > Then you need to copy it out once, and then only deal with the local
> > copy. Otherwise you have an incomplete snapshot.
>
> Ok, would you be partial to something like this:
>
> >From 1bc9bb84004154376c2a0cf643d53257da6d1cd7 Mon Sep 17 00:00:00 2001
> From: Alexander Shishkin <[email protected]>
> Date: Thu, 19 Jan 2023 21:59:02 +0200
> Subject: [PATCH] virtio console: Keep a local copy of the control structure
>
> When handling control messages, instead of peeking at the device memory
> to obtain bits of the control structure, take a snapshot of it once and
> use it instead, to prevent it from changing under us. This avoids races
> between port id validation and control event decoding, which can lead
> to, for example, a NULL dereference in port removal of a nonexistent
> port.
>
> The control structure is small enough (8 bytes) that it can be cached
> directly on the stack.
>
> Signed-off-by: Alexander Shishkin <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Amit Shah <[email protected]>
> ---
> drivers/char/virtio_console.c | 29 +++++++++++++++--------------
> 1 file changed, 15 insertions(+), 14 deletions(-)
Yes, this looks much better, thanks!
Reviewed-by: Greg Kroah-Hartman <[email protected]>
On Thu, Jan 19, 2023 at 10:13:18PM +0200, Alexander Shishkin wrote:
> Greg Kroah-Hartman <[email protected]> writes:
>
> > Then you need to copy it out once, and then only deal with the local
> > copy. Otherwise you have an incomplete snapshot.
>
> Ok, would you be partial to something like this:
>
> >From 1bc9bb84004154376c2a0cf643d53257da6d1cd7 Mon Sep 17 00:00:00 2001
> From: Alexander Shishkin <[email protected]>
> Date: Thu, 19 Jan 2023 21:59:02 +0200
> Subject: [PATCH] virtio console: Keep a local copy of the control structure
>
> When handling control messages, instead of peeking at the device memory
> to obtain bits of the control structure,
Except the message makes it seem that we are getting data from
device memory, when we do nothing of the kind.
> take a snapshot of it once and
> use it instead, to prevent it from changing under us. This avoids races
> between port id validation and control event decoding, which can lead
> to, for example, a NULL dereference in port removal of a nonexistent
> port.
>
> The control structure is small enough (8 bytes) that it can be cached
> directly on the stack.
I still have no real idea why we want a copy here.
If device can poke anywhere at memory then it can crash kernel anyway.
If there's a bounce buffer or an iommu or some other protection
in place, then this memory can no longer change by the time
we look at it.
> Signed-off-by: Alexander Shishkin <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Amit Shah <[email protected]>
> ---
> drivers/char/virtio_console.c | 29 +++++++++++++++--------------
> 1 file changed, 15 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> index 6a821118d553..42be0991a72f 100644
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -1559,23 +1559,24 @@ static void handle_control_message(struct virtio_device *vdev,
> struct ports_device *portdev,
> struct port_buffer *buf)
> {
> - struct virtio_console_control *cpkt;
> + struct virtio_console_control cpkt;
> struct port *port;
> size_t name_size;
> int err;
>
> - cpkt = (struct virtio_console_control *)(buf->buf + buf->offset);
> + /* Keep a local copy of the control structure */
> + memcpy(&cpkt, buf->buf + buf->offset, sizeof(cpkt));
>
> - port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt->id));
> + port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt.id));
> if (!port &&
> - cpkt->event != cpu_to_virtio16(vdev, VIRTIO_CONSOLE_PORT_ADD)) {
> + cpkt.event != cpu_to_virtio16(vdev, VIRTIO_CONSOLE_PORT_ADD)) {
> /* No valid header at start of buffer. Drop it. */
> dev_dbg(&portdev->vdev->dev,
> - "Invalid index %u in control packet\n", cpkt->id);
> + "Invalid index %u in control packet\n", cpkt.id);
> return;
> }
>
> - switch (virtio16_to_cpu(vdev, cpkt->event)) {
> + switch (virtio16_to_cpu(vdev, cpkt.event)) {
> case VIRTIO_CONSOLE_PORT_ADD:
> if (port) {
> dev_dbg(&portdev->vdev->dev,
> @@ -1583,21 +1584,21 @@ static void handle_control_message(struct virtio_device *vdev,
> send_control_msg(port, VIRTIO_CONSOLE_PORT_READY, 1);
> break;
> }
> - if (virtio32_to_cpu(vdev, cpkt->id) >=
> + if (virtio32_to_cpu(vdev, cpkt.id) >=
> portdev->max_nr_ports) {
> dev_warn(&portdev->vdev->dev,
> "Request for adding port with "
> "out-of-bound id %u, max. supported id: %u\n",
> - cpkt->id, portdev->max_nr_ports - 1);
> + cpkt.id, portdev->max_nr_ports - 1);
> break;
> }
> - add_port(portdev, virtio32_to_cpu(vdev, cpkt->id));
> + add_port(portdev, virtio32_to_cpu(vdev, cpkt.id));
> break;
> case VIRTIO_CONSOLE_PORT_REMOVE:
> unplug_port(port);
> break;
> case VIRTIO_CONSOLE_CONSOLE_PORT:
> - if (!cpkt->value)
> + if (!cpkt.value)
> break;
> if (is_console_port(port))
> break;
> @@ -1618,7 +1619,7 @@ static void handle_control_message(struct virtio_device *vdev,
> if (!is_console_port(port))
> break;
>
> - memcpy(&size, buf->buf + buf->offset + sizeof(*cpkt),
> + memcpy(&size, buf->buf + buf->offset + sizeof(cpkt),
> sizeof(size));
> set_console_size(port, size.rows, size.cols);
>
> @@ -1627,7 +1628,7 @@ static void handle_control_message(struct virtio_device *vdev,
> break;
> }
> case VIRTIO_CONSOLE_PORT_OPEN:
> - port->host_connected = virtio16_to_cpu(vdev, cpkt->value);
> + port->host_connected = virtio16_to_cpu(vdev, cpkt.value);
> wake_up_interruptible(&port->waitqueue);
> /*
> * If the host port got closed and the host had any
> @@ -1658,7 +1659,7 @@ static void handle_control_message(struct virtio_device *vdev,
> * Skip the size of the header and the cpkt to get the size
> * of the name that was sent
> */
> - name_size = buf->len - buf->offset - sizeof(*cpkt) + 1;
> + name_size = buf->len - buf->offset - sizeof(cpkt) + 1;
>
> port->name = kmalloc(name_size, GFP_KERNEL);
> if (!port->name) {
> @@ -1666,7 +1667,7 @@ static void handle_control_message(struct virtio_device *vdev,
> "Not enough space to store port name\n");
> break;
> }
> - strncpy(port->name, buf->buf + buf->offset + sizeof(*cpkt),
> + strncpy(port->name, buf->buf + buf->offset + sizeof(cpkt),
> name_size - 1);
> port->name[name_size - 1] = 0;
>
> --
> 2.39.0
"Michael S. Tsirkin" <[email protected]> writes:
> On Thu, Jan 19, 2023 at 10:13:18PM +0200, Alexander Shishkin wrote:
>> When handling control messages, instead of peeking at the device memory
>> to obtain bits of the control structure,
>
> Except the message makes it seem that we are getting data from
> device memory, when we do nothing of the kind.
We can be, see below.
>> take a snapshot of it once and
>> use it instead, to prevent it from changing under us. This avoids races
>> between port id validation and control event decoding, which can lead
>> to, for example, a NULL dereference in port removal of a nonexistent
>> port.
>>
>> The control structure is small enough (8 bytes) that it can be cached
>> directly on the stack.
>
> I still have no real idea why we want a copy here.
> If device can poke anywhere at memory then it can crash kernel anyway.
> If there's a bounce buffer or an iommu or some other protection
> in place, then this memory can no longer change by the time
> we look at it.
We can have shared pages between the host and guest without bounce
buffers in between, so they can be both looking directly at the same
page.
Regards,
--
Alex
On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
> "Michael S. Tsirkin" <[email protected]> writes:
>
> > On Thu, Jan 19, 2023 at 10:13:18PM +0200, Alexander Shishkin wrote:
> >> When handling control messages, instead of peeking at the device memory
> >> to obtain bits of the control structure,
> >
> > Except the message makes it seem that we are getting data from
> > device memory, when we do nothing of the kind.
>
> We can be, see below.
>
> >> take a snapshot of it once and
> >> use it instead, to prevent it from changing under us. This avoids races
> >> between port id validation and control event decoding, which can lead
> >> to, for example, a NULL dereference in port removal of a nonexistent
> >> port.
> >>
> >> The control structure is small enough (8 bytes) that it can be cached
> >> directly on the stack.
> >
> > I still have no real idea why we want a copy here.
> > If device can poke anywhere at memory then it can crash kernel anyway.
> > If there's a bounce buffer or an iommu or some other protection
> > in place, then this memory can no longer change by the time
> > we look at it.
>
> We can have shared pages between the host and guest without bounce
> buffers in between, so they can be both looking directly at the same
> page.
>
> Regards,
How does this configuration work? What else is in this page?
> --
> Alex
"Michael S. Tsirkin" <[email protected]> writes:
> On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
>> "Michael S. Tsirkin" <[email protected]> writes:
>>
>> > On Thu, Jan 19, 2023 at 10:13:18PM +0200, Alexander Shishkin wrote:
>> >> When handling control messages, instead of peeking at the device memory
>> >> to obtain bits of the control structure,
>> >
>> > Except the message makes it seem that we are getting data from
>> > device memory, when we do nothing of the kind.
>>
>> We can be, see below.
>>
>> >> take a snapshot of it once and
>> >> use it instead, to prevent it from changing under us. This avoids races
>> >> between port id validation and control event decoding, which can lead
>> >> to, for example, a NULL dereference in port removal of a nonexistent
>> >> port.
>> >>
>> >> The control structure is small enough (8 bytes) that it can be cached
>> >> directly on the stack.
>> >
>> > I still have no real idea why we want a copy here.
>> > If device can poke anywhere at memory then it can crash kernel anyway.
>> > If there's a bounce buffer or an iommu or some other protection
>> > in place, then this memory can no longer change by the time
>> > we look at it.
>>
>> We can have shared pages between the host and guest without bounce
>> buffers in between, so they can be both looking directly at the same
>> page.
>>
>> Regards,
>
> How does this configuration work? What else is in this page?
So, for example in TDX, you have certain pages as "shared", as in
between guest and hypervisor. You can have virtio ring(s) in such
pages. It's likely that there'd be a swiotlb buffer there instead, but
sharing pages between host virtio and guest virtio drivers is possible.
Apologies if the language is confusing, I hope I'm answering the
question.
Regards,
--
Alex
On Fri, Jan 27, 2023 at 02:47:55PM +0200, Alexander Shishkin wrote:
> "Michael S. Tsirkin" <[email protected]> writes:
>
> > On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
> >> "Michael S. Tsirkin" <[email protected]> writes:
> >>
> >> > On Thu, Jan 19, 2023 at 10:13:18PM +0200, Alexander Shishkin wrote:
> >> >> When handling control messages, instead of peeking at the device memory
> >> >> to obtain bits of the control structure,
> >> >
> >> > Except the message makes it seem that we are getting data from
> >> > device memory, when we do nothing of the kind.
> >>
> >> We can be, see below.
> >>
> >> >> take a snapshot of it once and
> >> >> use it instead, to prevent it from changing under us. This avoids races
> >> >> between port id validation and control event decoding, which can lead
> >> >> to, for example, a NULL dereference in port removal of a nonexistent
> >> >> port.
> >> >>
> >> >> The control structure is small enough (8 bytes) that it can be cached
> >> >> directly on the stack.
> >> >
> >> > I still have no real idea why we want a copy here.
> >> > If device can poke anywhere at memory then it can crash kernel anyway.
> >> > If there's a bounce buffer or an iommu or some other protection
> >> > in place, then this memory can no longer change by the time
> >> > we look at it.
> >>
> >> We can have shared pages between the host and guest without bounce
> >> buffers in between, so they can be both looking directly at the same
> >> page.
> >>
> >> Regards,
> >
> > How does this configuration work? What else is in this page?
>
> So, for example in TDX, you have certain pages as "shared", as in
> between guest and hypervisor. You can have virtio ring(s) in such
> pages. It's likely that there'd be a swiotlb buffer there instead, but
> sharing pages between host virtio and guest virtio drivers is possible.
If it is shared, then what does this mean? Do we then need to copy
everything out of that buffer first before doing anything with it
because the data could change later on? Or do we not trust anything in
it at all and we throw it away? Or something else (trust for a short
while and then we don't?)
Please be specific as to what you want to see happen here, and why.
thanks,
greg k-h
On Fri, Jan 27, 2023 at 02:47:55PM +0200, Alexander Shishkin wrote:
> "Michael S. Tsirkin" <[email protected]> writes:
>
> > On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
> >> "Michael S. Tsirkin" <[email protected]> writes:
> >>
> >> > On Thu, Jan 19, 2023 at 10:13:18PM +0200, Alexander Shishkin wrote:
> >> >> When handling control messages, instead of peeking at the device memory
> >> >> to obtain bits of the control structure,
> >> >
> >> > Except the message makes it seem that we are getting data from
> >> > device memory, when we do nothing of the kind.
> >>
> >> We can be, see below.
> >>
> >> >> take a snapshot of it once and
> >> >> use it instead, to prevent it from changing under us. This avoids races
> >> >> between port id validation and control event decoding, which can lead
> >> >> to, for example, a NULL dereference in port removal of a nonexistent
> >> >> port.
> >> >>
> >> >> The control structure is small enough (8 bytes) that it can be cached
> >> >> directly on the stack.
> >> >
> >> > I still have no real idea why we want a copy here.
> >> > If device can poke anywhere at memory then it can crash kernel anyway.
> >> > If there's a bounce buffer or an iommu or some other protection
> >> > in place, then this memory can no longer change by the time
> >> > we look at it.
> >>
> >> We can have shared pages between the host and guest without bounce
> >> buffers in between, so they can be both looking directly at the same
> >> page.
> >>
> >> Regards,
> >
> > How does this configuration work? What else is in this page?
>
> So, for example in TDX, you have certain pages as "shared", as in
> between guest and hypervisor. You can have virtio ring(s) in such
> pages.
That one's marked as dma coherent.
> It's likely that there'd be a swiotlb buffer there instead, but
> sharing pages between host virtio and guest virtio drivers is possible.
It's not something console does though, does it?
> Apologies if the language is confusing, I hope I'm answering the
> question.
>
> Regards,
> --
> Alex
I'd like an answer to when does the console driver share the buffer
in question, not when generally some pages shared.
--
MST
Greg Kroah-Hartman <[email protected]> writes:
> On Fri, Jan 27, 2023 at 02:47:55PM +0200, Alexander Shishkin wrote:
>> "Michael S. Tsirkin" <[email protected]> writes:
>>
>> > On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
>> >> We can have shared pages between the host and guest without bounce
>> >> buffers in between, so they can be both looking directly at the same
>> >> page.
>> >>
>> >> Regards,
>> >
>> > How does this configuration work? What else is in this page?
>>
>> So, for example in TDX, you have certain pages as "shared", as in
>> between guest and hypervisor. You can have virtio ring(s) in such
>> pages. It's likely that there'd be a swiotlb buffer there instead, but
>> sharing pages between host virtio and guest virtio drivers is possible.
>
> If it is shared, then what does this mean? Do we then need to copy
> everything out of that buffer first before doing anything with it
> because the data could change later on? Or do we not trust anything in
> it at all and we throw it away? Or something else (trust for a short
> while and then we don't?)
The first one, we need a consistent view of the metadata (the ckpt in
this case), so we take a snapshot of it. Then, we validate it (because
we don't trust it) to be correct. If it is not, we discard it, otherwise
we act on it. Since this is a ring, we just move on to the next record
if there is one.
Meanwhile, in the shared page, it can change from correct to incorrect,
but it won't affect us because we have this consistent view at the
moment the snapshot was taken.
> Please be specific as to what you want to see happen here, and why.
For example, if we get a control message to add a port and
cpkt->event==PORT_ADD, we skip validation of cpkt->id (port id), because
we're intending to add a new one. At this point, the device can change
cpkt->event to PORT_REMOVE, which does require a valid cpkt->id and the
subsequent code runs into a NULL dereference on the port value, which
should have been looked up from cpkt->id.
Now, if we take a snapshot of cpkt, we naturally don't have this
problem, because we're looking at a consistent state of cpkt: it's
either PORT_ADD or PORT_REMOVE all the way. Which is what this patch
does.
Does this answer your question?
Thanks,
--
Alex
On Fri, Jan 27, 2023 at 04:17:46PM +0200, Alexander Shishkin wrote:
> Greg Kroah-Hartman <[email protected]> writes:
>
> > On Fri, Jan 27, 2023 at 02:47:55PM +0200, Alexander Shishkin wrote:
> >> "Michael S. Tsirkin" <[email protected]> writes:
> >>
> >> > On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
> >> >> We can have shared pages between the host and guest without bounce
> >> >> buffers in between, so they can be both looking directly at the same
> >> >> page.
> >> >>
> >> >> Regards,
> >> >
> >> > How does this configuration work? What else is in this page?
> >>
> >> So, for example in TDX, you have certain pages as "shared", as in
> >> between guest and hypervisor. You can have virtio ring(s) in such
> >> pages. It's likely that there'd be a swiotlb buffer there instead, but
> >> sharing pages between host virtio and guest virtio drivers is possible.
> >
> > If it is shared, then what does this mean? Do we then need to copy
> > everything out of that buffer first before doing anything with it
> > because the data could change later on? Or do we not trust anything in
> > it at all and we throw it away? Or something else (trust for a short
> > while and then we don't?)
>
> The first one, we need a consistent view of the metadata (the ckpt in
> this case), so we take a snapshot of it. Then, we validate it (because
> we don't trust it) to be correct. If it is not, we discard it, otherwise
> we act on it. Since this is a ring, we just move on to the next record
> if there is one.
So you do an additional extra copy of everything, making the bounce
buffer useless? :)
> Meanwhile, in the shared page, it can change from correct to incorrect,
> but it won't affect us because we have this consistent view at the
> moment the snapshot was taken.
Wonderful, copy everything out then, the whole page, don't do it
piecemeal field by field. And then justify it to everyone whose
throughput you just tanked...
good luck!
greg k-h
On Fri, Jan 27, 2023 at 04:17:46PM +0200, Alexander Shishkin wrote:
> Greg Kroah-Hartman <[email protected]> writes:
>
> > On Fri, Jan 27, 2023 at 02:47:55PM +0200, Alexander Shishkin wrote:
> >> "Michael S. Tsirkin" <[email protected]> writes:
> >>
> >> > On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
> >> >> We can have shared pages between the host and guest without bounce
> >> >> buffers in between, so they can be both looking directly at the same
> >> >> page.
> >> >>
> >> >> Regards,
> >> >
> >> > How does this configuration work? What else is in this page?
> >>
> >> So, for example in TDX, you have certain pages as "shared", as in
> >> between guest and hypervisor. You can have virtio ring(s) in such
> >> pages. It's likely that there'd be a swiotlb buffer there instead, but
> >> sharing pages between host virtio and guest virtio drivers is possible.
> >
> > If it is shared, then what does this mean? Do we then need to copy
> > everything out of that buffer first before doing anything with it
> > because the data could change later on? Or do we not trust anything in
> > it at all and we throw it away? Or something else (trust for a short
> > while and then we don't?)
>
> The first one, we need a consistent view of the metadata (the ckpt in
> this case), so we take a snapshot of it. Then, we validate it (because
> we don't trust it) to be correct. If it is not, we discard it, otherwise
> we act on it. Since this is a ring, we just move on to the next record
> if there is one.
>
> Meanwhile, in the shared page, it can change from correct to incorrect,
> but it won't affect us because we have this consistent view at the
> moment the snapshot was taken.
>
> > Please be specific as to what you want to see happen here, and why.
>
> For example, if we get a control message to add a port and
> cpkt->event==PORT_ADD, we skip validation of cpkt->id (port id), because
> we're intending to add a new one. At this point, the device can change
> cpkt->event to PORT_REMOVE, which does require a valid cpkt->id and the
> subsequent code runs into a NULL dereference on the port value, which
> should have been looked up from cpkt->id.
>
> Now, if we take a snapshot of cpkt, we naturally don't have this
> problem, because we're looking at a consistent state of cpkt: it's
> either PORT_ADD or PORT_REMOVE all the way. Which is what this patch
> does.
>
> Does this answer your question?
>
> Thanks,
> --
> Alex
Not sure about Greg but it doesn't answer my question because either the
bad device has access to all memory at which point it's not clear why
is it changing cpkt->event and not e.g. stack. Or it's restricted to
only access memory when mapped through the DMA API. Which is not the
case here.
--
MST
> On Fri, Jan 27, 2023 at 04:17:46PM +0200, Alexander Shishkin wrote:
> > Greg Kroah-Hartman <[email protected]> writes:
> >
> > > On Fri, Jan 27, 2023 at 02:47:55PM +0200, Alexander Shishkin wrote:
> > >> "Michael S. Tsirkin" <[email protected]> writes:
> > >>
> > >> > On Fri, Jan 27, 2023 at 01:55:43PM +0200, Alexander Shishkin wrote:
> > >> >> We can have shared pages between the host and guest without bounce
> > >> >> buffers in between, so they can be both looking directly at the same
> > >> >> page.
> > >> >>
> > >> >> Regards,
> > >> >
> > >> > How does this configuration work? What else is in this page?
> > >>
> > >> So, for example in TDX, you have certain pages as "shared", as in
> > >> between guest and hypervisor. You can have virtio ring(s) in such
> > >> pages. It's likely that there'd be a swiotlb buffer there instead, but
> > >> sharing pages between host virtio and guest virtio drivers is possible.
> > >
> > > If it is shared, then what does this mean? Do we then need to copy
> > > everything out of that buffer first before doing anything with it
> > > because the data could change later on? Or do we not trust anything in
> > > it at all and we throw it away? Or something else (trust for a short
> > > while and then we don't?)
> >
> > The first one, we need a consistent view of the metadata (the ckpt in
> > this case), so we take a snapshot of it. Then, we validate it (because
> > we don't trust it) to be correct. If it is not, we discard it, otherwise
> > we act on it. Since this is a ring, we just move on to the next record
> > if there is one.
> >
> > Meanwhile, in the shared page, it can change from correct to incorrect,
> > but it won't affect us because we have this consistent view at the
> > moment the snapshot was taken.
> >
> > > Please be specific as to what you want to see happen here, and why.
> >
> > For example, if we get a control message to add a port and
> > cpkt->event==PORT_ADD, we skip validation of cpkt->id (port id), because
> > we're intending to add a new one. At this point, the device can change
> > cpkt->event to PORT_REMOVE, which does require a valid cpkt->id and the
> > subsequent code runs into a NULL dereference on the port value, which
> > should have been looked up from cpkt->id.
> >
> > Now, if we take a snapshot of cpkt, we naturally don't have this
> > problem, because we're looking at a consistent state of cpkt: it's
> > either PORT_ADD or PORT_REMOVE all the way. Which is what this patch
> > does.
> >
> > Does this answer your question?
> >
> > Thanks,
> > --
> > Alex
>
>
> Not sure about Greg but it doesn't answer my question because either the
> bad device has access to all memory at which point it's not clear why
> is it changing cpkt->event and not e.g. stack. Or it's restricted to
> only access memory when mapped through the DMA API. Which is not the
> case here.
We do enforce virtio usage via DMA API only for TDX guest. Alex has a patch
queued for that also.
But not sure if this addresses your concern here.
Best Regards,
Elena.