2019-03-07 16:37:35

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

This patch adds a check for the presence of the ring buffer in
hv_get_bytes_to_read/write() to avoid possible NULL pointer dereferences.
If the ring buffer is not yet allocated, return 0 bytes to be read/written.

The root cause is that code that accesses the ring buffer including
hv_get_bytes_to_read/write() could be vulnerable to the race condition
discussed in https://lkml.org/lkml/2018/10/18/779

This race is being addressed by the patch series by Kimberly Brown in
https://lkml.org/lkml/2019/2/21/1236 which is not final yet

Signed-off-by: Mohammed Gamal <[email protected]>
---
include/linux/hyperv.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 64698ec8f2ac..7b2f566250b2 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -148,6 +148,9 @@ static inline u32 hv_get_bytes_to_read(const struct hv_ring_buffer_info *rbi)
{
u32 read_loc, write_loc, dsize, read;

+ if (!rbi->ring_buffer)
+ return 0;
+
dsize = rbi->ring_datasize;
read_loc = rbi->ring_buffer->read_index;
write_loc = READ_ONCE(rbi->ring_buffer->write_index);
@@ -162,6 +165,9 @@ static inline u32 hv_get_bytes_to_write(const struct hv_ring_buffer_info *rbi)
{
u32 read_loc, write_loc, dsize, write;

+ if (!rbi->ring_buffer)
+ return 0;
+
dsize = rbi->ring_datasize;
read_loc = READ_ONCE(rbi->ring_buffer->read_index);
write_loc = rbi->ring_buffer->write_index;
--
2.18.1



2019-03-07 17:34:05

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

From: Mohammed Gamal <[email protected]> Sent: Thursday, March 7, 2019 8:36 AM
>
> This patch adds a check for the presence of the ring buffer in
> hv_get_bytes_to_read/write() to avoid possible NULL pointer dereferences.
> If the ring buffer is not yet allocated, return 0 bytes to be read/written.
>
> The root cause is that code that accesses the ring buffer including
> hv_get_bytes_to_read/write() could be vulnerable to the race condition
> discussed in https://lkml.org/lkml/2018/10/18/779>
>
> This race is being addressed by the patch series by Kimberly Brown in
> https://lkml.org/lkml/2019/2/21/1236 which is not final yet
>
> Signed-off-by: Mohammed Gamal <[email protected]>

Could you elaborate on the code paths where
hv_get_bytes_to_read/write() could be called when the ring buffer
isn't yet allocated? My sense is that Kim Brown's patch will address
all of the code paths that involved sysfs access from outside the
driver. And within a driver, the ring buffer should never be accessed
unless it is already allocated. Is there another code path we're not
aware of? I'm wondering if these changes are really needed once
Kim Brown's patch is finished.

Michael

2019-03-07 18:35:31

by Mohammed Gamal

[permalink] [raw]
Subject: Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

On Thu, 2019-03-07 at 17:33 +0000, Michael Kelley wrote:
> From: Mohammed Gamal <[email protected]> Sent: Thursday, March 7,
> 2019 8:36 AM
> >
> > This patch adds a check for the presence of the ring buffer in
> > hv_get_bytes_to_read/write() to avoid possible NULL pointer
> > dereferences.
> > If the ring buffer is not yet allocated, return 0 bytes to be
> > read/written.
> >
> > The root cause is that code that accesses the ring buffer including
> > hv_get_bytes_to_read/write() could be vulnerable to the race
> > condition
> > discussed in https://lkml.org/lkml/2018/10/18/779>;
> >
> > This race is being addressed by the patch series by Kimberly Brown
> > in
> > https://lkml.org/lkml/2019/2/21/1236 which is not final yet
> >
> > Signed-off-by: Mohammed Gamal <[email protected]>
>
> Could you elaborate on the code paths where
> hv_get_bytes_to_read/write() could be called when the ring buffer
> isn't yet allocated?  My sense is that Kim Brown's patch will address
> all of the code paths that involved sysfs access from outside the
> driver.  And within a driver, the ring buffer should never be
> accessed
> unless it is already allocated.  Is there another code path we're not
> aware of?  I'm wondering if these changes are really needed once
> Kim Brown's patch is finished.
>
> Michael

I've seen one instance of the race in the netvsc driver when running
traffic through it with iperf3 while continuously changing the channel
settings.

The following code path deallocates the ring buffer:
netvsc_set_channels() -> netvsc_detach() ->
rndis_filter_device_remove() -> netvsc_device_remove() -> vmbus_close()
-> vmbus_free_ring() -> hv_ringbuffer_cleanup().

netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
concurrently after vmbus_close() and before vmbus_open() returns and
sets up the new ring buffer.

The race is fairly hard to reproduce on recent upstream kernels, but I
still managed to reproduce it.

2019-03-07 19:35:39

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

From: Mohammed Gamal <[email protected]> Sent: Thursday, March 7, 2019 10:32 AM
> >
> > Could you elaborate on the code paths where
> > hv_get_bytes_to_read/write() could be called when the ring buffer
> > isn't yet allocated?  My sense is that Kim Brown's patch will address
> > all of the code paths that involved sysfs access from outside the
> > driver.  And within a driver, the ring buffer should never be
> > accessed
> > unless it is already allocated.  Is there another code path we're not
> > aware of?  I'm wondering if these changes are really needed once
> > Kim Brown's patch is finished.
> >
> > Michael
>
> I've seen one instance of the race in the netvsc driver when running
> traffic through it with iperf3 while continuously changing the channel
> settings.
>
> The following code path deallocates the ring buffer:
> netvsc_set_channels() -> netvsc_detach() ->
> rndis_filter_device_remove() -> netvsc_device_remove() -> vmbus_close()
> -> vmbus_free_ring() -> hv_ringbuffer_cleanup().
>
> netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
> concurrently after vmbus_close() and before vmbus_open() returns and
> sets up the new ring buffer.
>
> The race is fairly hard to reproduce on recent upstream kernels, but I
> still managed to reproduce it.

My thought is that a race like the above needs to be addressed in the
netvsc driver. The race may have other problems beyond just
accessing the ring buffer before it is (re)allocated. While adding the tests
in hv_get_bytes_to_read/write() isn't harmful, doing so has the potential
to mask the real problem. These routines are also somewhat performance
sensitive so we don't want any unnecessary overhead.

Michael

2019-03-13 10:26:09

by Mohammed Gamal

[permalink] [raw]
Subject: Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write

On Tue, 2019-03-12 at 18:02 +0000, Haiyang Zhang wrote:
>  
>  
> > -----Original Message-----
> > From: Mohammed Gamal <[email protected]>
> > Sent: Thursday, March 7, 2019 1:32 PM
> > To: Michael Kelley <[email protected]>; [email protected]
> el.org;
> > kimbrownkd <[email protected]>
> > Cc: Sasha Levin <[email protected]>; Dexuan Cui
> > <[email protected]>; Stephen Hemminger <[email protected]>;
> > Long Li <[email protected]>; KY Srinivasan <[email protected]>;
> Haiyang
> > Zhang <[email protected]>; vkuznets <[email protected]>;
> linux-
> > [email protected]
> > Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> > hv_get_bytes_to_read/write
> >
> > On Thu, 2019-03-07 at 17:33 +0000, Michael Kelley wrote:
> > > From: Mohammed Gamal <[email protected]> Sent: Thursday, March 7,
> > > 2019 8:36 AM
> > > >
> > > > This patch adds a check for the presence of the ring buffer in
> > > > hv_get_bytes_to_read/write() to avoid possible NULL pointer
> > > > dereferences.
> > > > If the ring buffer is not yet allocated, return 0 bytes to be
> > > > read/written.
> > > >
> > > > The root cause is that code that accesses the ring buffer
> including
> > > > hv_get_bytes_to_read/write() could be vulnerable to the race
> > > > condition discussed in
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> %2Flk
> > > >
> > ml.org%2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Chaiyangz
> > %40m
> > > >
> > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9
> > 1
> > > >
> > ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=b51Xc5GUN
> > nHX0K
> > > > 08LrH3ShTyFcRZ4mYHUATd%2BDpvYDw%3D&amp;reserved=0>;
> > > >
> > > > This race is being addressed by the patch series by Kimberly
> Brown
> > > > in
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F
> %2Flk
> > > >
> > ml.org%2Flkml%2F2019%2F2%2F21%2F1236&amp;data=02%7C01%7Chaiyangz
> > %40m
> > > >
> > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141af9
> > 1
> > > >
> > ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=js1ff15Gbk7
> > 0MD
> > > > A2hkMZExxvAAbDuKDhfBvc5ZrckzM%3D&amp;reserved=0 which is not
> > final
> > > > yet
> > > >
> > > > Signed-off-by: Mohammed Gamal <[email protected]>
> > >
> > > Could you elaborate on the code paths where
> > > hv_get_bytes_to_read/write() could be called when the ring buffer
> > > isn't yet allocated?  My sense is that Kim Brown's patch will
> address
> > > all of the code paths that involved sysfs access from outside the
> > > driver.  And within a driver, the ring buffer should never be
> accessed
> > > unless it is already allocated.  Is there another code path we're
> not
> > > aware of?  I'm wondering if these changes are really needed once
> Kim
> > > Brown's patch is finished.
> > >
> > > Michael
> >
> > I've seen one instance of the race in the netvsc driver when
> running traffic
> > through it with iperf3 while continuously changing the channel
> settings.
> >
> > The following code path deallocates the ring buffer:
> > netvsc_set_channels() -> netvsc_detach() ->
> > rndis_filter_device_remove() -> netvsc_device_remove() ->
> vmbus_close()
> > -> vmbus_free_ring() -> hv_ringbuffer_cleanup().
> >
> > netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
> concurrently
> > after vmbus_close() and before vmbus_open() returns and sets up the
> new ring
> > buffer.
> >
> > The race is fairly hard to reproduce on recent upstream kernels,
> but I still
> > managed to reproduce it.
>  
> Looking at the code from netvsc_detach() –
>          netif_tx_disable(ndev) is called before
> rndis_filter_device_remove(hdev, nvdev).
> So there should be no call to netvsc_send_pkt() after detaching.
> What’s the crash stack trace?
>  
> static int netvsc_detach(struct net_device *ndev,
>                          struct netvsc_device *nvdev)
> {
>         struct net_device_context *ndev_ctx = netdev_priv(ndev);
>         struct hv_device *hdev = ndev_ctx->device_ctx;
>         int ret;
>  
>         /* Don't try continuing to try and setup sub channels */
>         if (cancel_work_sync(&nvdev->subchan_work))
>                 nvdev->num_chn = 1;
>  
>         /* If device was up (receiving) then shutdown */
>         if (netif_running(ndev)) {
>                 netif_tx_disable(ndev);
>  
>                 ret = rndis_filter_close(nvdev);
>                 if (ret) {
>                         netdev_err(ndev,
>                                    "unable to close device (ret
> %d).\n", ret);
>                         return ret;
>                 }
>  
>                 ret = netvsc_wait_until_empty(nvdev);
>                 if (ret) {
>                         netdev_err(ndev,
>                                    "Ring buffer not empty after
> closing rndis\n");
>                         return ret;
>                 }
>         }
>  
>         netif_device_detach(ndev);
>  
>         rndis_filter_device_remove(hdev, nvdev);
>  
>         return 0;
> }
>  
> Thanks,
> Haiyang

Here is one stack trace on a 4.18 kernel, the most recent kernel I
managed to reproduce this bug on. 
I haven't managed to reproduce on 5.0.0 yet, but I guess some recent
changes to the netvsc driver could be masking the problem, as I tried
backporting those changes to older RHEL 7 kernels and still managed to
reproduce the problem there. I could however be wrong, and any pointers
are still appreciated:

[  545.308507] BUG: unable to handle kernel NULL pointer dereference at
0000000000000004
[  545.308656] PGD 0 P4D 0 
[  545.308763] Oops: 0000 [#1] SMP PTI
[  545.308855] CPU: 2 PID: 1800 Comm: iperf3 Kdump: loaded Not tainted
4.18.0-64.el8.test.x86_64 #1
[  545.308990] Hardware name: Microsoft Corporation Virtual
Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
[  545.309143] RIP: 0010:netvsc_send+0x2c9/0xce0 [hv_netvsc]
[  545.309298] Code: 4c 8b b1 c0 00 00 00 4f 8d 2c 64 49 c1 e5 07 4d 03
ae c0 03 00 00 48 8b 84 03 30 01 00 00 4c 89 6c 24 18 48 8b 90 20 01 00
00 <8b> 72 04 8b 0a 8b 90 38 01 00 00 89 f7 01 f2 29 cf 29 ca 39 ce
 0f
[  545.309321] RSP: 0018:ffffb8a305d5b6c0 EFLAGS: 00010282
[  545.309321] RAX: ffff926928bd7000 RBX: ffff92687dbe0000 RCX:
ffff92687d5bec00
[  545.309321] RDX: 0000000000000000 RSI: ffff92691b61c654 RDI:
0000000000000000
[  545.309321] RBP: ffff926915dcde28 R08: ffff926915dcde00 R09:
0000000000000000
[  545.309321] R10: 00000000000db61c R11: 0000000000000f7e R12:
0000000000000001
[  545.309321] R13: ffff926931808180 R14: ffff926931801000 R15:
0000000000000000
[  545.309321] FS:  00007feca6a4b740(0000) GS:ffff926940080000(0000)
knlGS:0000000000000000
[  545.309321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  545.309321] CR2: 0000000000000004 CR3: 00000000dfccc004 CR4:
00000000003606e0
[  545.309321] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  545.309321] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  545.309321] Call Trace:
[  545.309321]  netvsc_start_xmit+0x3c9/0x800 [hv_netvsc]
[  545.309321]  ? __switch_to_asm+0x34/0x70
[  545.309321]  ? __switch_to_asm+0x34/0x70
[  545.309321]  ? ___slab_alloc+0x269/0x4e0
[  545.309321]  ? __alloc_skb+0x82/0x1c0
[  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
[  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
[  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
[  545.309321]  ? _cond_resched+0x15/0x30
[  545.309321]  ? netif_skb_features+0x118/0x280
[  545.309321]  dev_hard_start_xmit+0xa5/0x210
[  545.309321]  sch_direct_xmit+0x14f/0x340
[  545.309321]  __dev_queue_xmit+0x799/0x8f0
[  545.309321]  ip_finish_output2+0x2e0/0x430
[  545.309321]  ? ip_finish_output+0x139/0x270
[  545.309321]  ip_output+0x6c/0xe0
[  545.309321]  ? ip_append_data.part.50+0xc0/0xc0
[  545.309321]  ip_send_skb+0x15/0x40
[  545.309321]  udp_send_skb.isra.43+0x153/0x340
[  545.309321]  udp_sendmsg+0xac2/0xd30
[  545.309321]  ? set_fd_set.part.7+0x40/0x40
[  545.309321]  ? set_fd_set.part.7+0x40/0x40
[  545.309321]  ? __check_object_size+0xa3/0x181
[  545.309321]  ? sock_has_perm+0x78/0xa0
[  545.309321]  ? core_sys_select+0x242/0x2f0
[  545.309321]  ? sock_sendmsg+0x36/0x40
[  545.309321]  ? udp_push_pending_frames+0x60/0x60
[  545.309321]  sock_sendmsg+0x36/0x40
[  545.309321]  sock_write_iter+0x8f/0xf0
[  545.309321]  __vfs_write+0x156/0x1a0
[  545.309321]  vfs_write+0xa5/0x1a0
[  545.309321]  ksys_write+0x4f/0xb0
[  545.309321]  do_syscall_64+0x5b/0x1b0
[  545.309321]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  545.309321] RIP: 0033:0x7feca5fb5348
[  545.309321] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00
00 f3 0f 1e fa 48 8d 05 d5 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  545.309321] RSP: 002b:00007ffc3ff1f108 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  545.309321] RAX: ffffffffffffffda RBX: 00000000000005a8 RCX:
00007feca5fb5348
[  545.309321] RDX: 00000000000005a8 RSI: 00007feca6a59000 RDI:
0000000000000009
[  545.309321] RBP: 00007feca6a59000 R08: 0000000000000002 R09:
00cd09a3238b4e43
[  545.309321] R10: 0002961ecea49016 R11: 0000000000000246 R12:
0000000000000009
[  545.309321] R13: 00000000000005a8 R14: 00007ffc3ff1f180 R15:
0000563c1e05b260
[  545.309321] Modules linked in: nft_chain_nat_ipv6 nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
nft_chain_route_ipv4 nf_conntrack ip_set nf_tables nfnetlink vfat fat
sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
intel_rapl_perf sg hv_utils hv_balloon pcspkr joydev xfs libcrc32c
sd_mod sr_mod cdrom serio_raw hv_storvsc hv_netvsc scsi_transport_fc
hyperv_fb hyperv_keyboard hid_hyperv crc32c_intel hv_vmbus dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: nft_compat]
[  545.309321] CR2: 0000000000000004

From the stack trace netvsc_send+0x2c9 points to this line:

static inline u32 hv_get_bytes_to_write(const struct hv_ring_bu
ffer_info *rbi)
{
        u32 read_loc, write_loc, dsize, write;

        dsize = rbi->ring_datasize;
        read_loc = READ_ONCE(rbi->ring_buffer->read_index); <---------
        write_loc = rbi->ring_buffer->write_index;

        write = write_loc >= read_loc ? dsize - (write_loc - read_loc)
                read_loc - write_loc;
        return write;
}

which gets called from netvsc_send_pkt().