2022-05-03 00:07:43

by Larry Finger

[permalink] [raw]
Subject: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines

Jason,

I maintain VirtualBox for openSUSE. When kernel 5.18-rc1 was released, I fixed
the usual set of API changes needed to compile the external kernel modules for
VB. Despite a clean compile, I am still getting random crashes in the VMs. For
Linux instances, the desktop disappears, but for Windows guests, the VM crashes
with unhandled kernel exceptions. As I have no experience tracing such crashes,
I decided to bisect the kernel to find the commit that started these problems.

Surprisingly, the bisection pointed to commit 6e8ec2552c7d ("random: use
computational hash for entropy extraction"). I am very sure of the bisection as
the kernel built from the commit that immediately precedes this one,
cfb92440ee71 - a tag commit by Linus, runs correctly.

Note that I do not believe there is anything wrong with your changes to the
random number generators. It seems to be a problem with the way the emulator is
accessing them. The VirtualBox code is quite complicated, and I am no expert
with C++.

Are there changes that would be required to the X86_64 emulator's access to the
random number code as a result of your changes? I have found places where the
emulator accesses /dev/urandom or /dev/random. There are also places that use
the rdrand and reseed instructions.

Thanks for reading this,

Larry


2022-05-03 00:46:57

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines

Hi Larry,

On Sun, May 01, 2022 at 04:07:39PM -0500, Larry Finger wrote:
> 1. Yes, the problem happens with 5.18-rc4 and -rc5.

Do you still have your bisection logs handy? Something about this seems
a bit fishy to me, and it might be helpful.

> 2. My answer here will be incomplete. There are no stacktraces or console ouput

You're going to have to make it more complete somehow...

> on the host from any of the guest crashes, either in dmesg or under journalctl.
> The desktop just disappears. The VirtualBox log files show nothing for the Linux

What do you mean "just disappears"? What is the "desktop"? Do you mean
that the X server segfaults or something? Can you attach a debugger
somewhere and try again? There's got to be something you can do to get
more info.

> guest, and the following for the Windows instance:
>
> 00:00:57.908011 GUI: UIMachineLogicNormal::sltCheckForRequestedVisualStateType:
> Requested-state=0, Machine-state=5
> 00:01:24.502961 GIM: HyperV: Guest indicates a fatal condition! P0=0x1e
> P1=0xffffffffc0000005 P2=0xfffff8054c61e97c P3=0x0 P4=0x28
> 00:01:24.503053 GIMHv: BugCheck 1e {ffffffffc0000005, fffff8054c61e97c, 0, 28}
> 00:01:24.503054 KMODE_EXCEPTION_NOT_HANDLED
> 00:01:24.503054 P1: ffffffffc0000005 - exception code - STATUS_ACCESS_VIOLATION
> 00:01:24.503054 P2: fffff8054c61e97c - EIP/RIP
> 00:01:24.503054 P3: 0000000000000000 - Xcpt param #0
> 00:01:24.503054 P4: 0000000000000028 - Xcpt param #1
>
> Running a 3rd party dump analyzer shows that the crash happens at
> ntoskrnl.exe+3f7d50. I have installed the Windows debugger, but I think the
> learning curve will be steep. At this point, I have no further info available.

Can you email me the minidump files from the crash? In another life
that's not supposed to intersect with lkml, windbg keeps me up at
night...

Also, if you've got some easy steps at repro, that'd be helpful. If I
have to install OpenSUSE in a VM or something and type some commands and
twiddle things here and there, let me know what it takes to get an
environment going. Or, better, if you've got a VM already baked with vbox
installed in it with a VM inside of that that exhibits the issue, that'd
let me take a look.

Jason

2022-05-18 04:51:06

by Larry Finger

[permalink] [raw]
Subject: Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines

On 5/17/22 12:27, Vadim Galitsin wrote:
> Hi Larry and Jason,
>
> I am from VirtualBox team. I noticed your conversation here:
>
> https://lore.kernel.org/lkml/[email protected]/T/#mea7aa731b5524a05ac3b3e8588c0c42235bb33d6
> <https://lore.kernel.org/lkml/[email protected]/T/#mea7aa731b5524a05ac3b3e8588c0c42235bb33d6>
>
> Please let me add my 5c. I agree with Larry, the issue start happen after
> 6e8ec2552c7d. I did not do complete bisecting, but rather tried this revision
> and the one before (with dcd03ba15947cbad1a34cfed370c4feb41058469 -- I do not
> see the issue).
>
> For me this issue is quite reproducible with Ubuntu 20.04 Linux guest (other
> guests are also affected). It happens even if there is no VBox Guest Additions
> installed into guest. Guest kernel version does not play much role. Running
> kernel 5.18-rc1+ on the host side is essential.
>
> The first way for me to reproduce it -- is to run stress-ng(1) tool inside guest
> and perform random mouse cursor movements (basically, mouse or keyboard
> interrupts generation is somehow essential here). Tool will report the following
> error:
>
> root@test-VirtualBox:~# stress-ng --vm 4 -t 10
> stress-ng: info:  [5463] dispatching hogs: 4 vm
> stress-ng: fail:  [5464] stress-ng-vm: detected 194065152 bit errors while
> stressing memory
> stress-ng: error: [5463] process 5464 (stress-ng-vm) terminated with an error,
> exit status=1 (stress-ng core failure)
> stress-ng: info:  [5463] unsuccessful run completed in 10.06s
>
> This approach does not work in 100% cases, but triggers issue quite frequently.
>
> The second approach is much more reliable for me. I basically, start compiling
> kernel inside guest (say, with make -j4) and start moving mouse (or generate
> keyboard interrupts, pressing keys randomly). In this case, gcc processes will
> randomly receive SEGFAULT.
>
> Important note: if I do not touch mouse or keyboard in both cases above -- all
> works as normal.
>
> My initial guess was that this might have something to do with kstack
> randomization, but booting host kernel with randomize_kstack_offset=0 seem does
> not change anything in this regard.
>
> I am currently running out of ideas what exactly might trigger such behavior.
> Hopefully, this additional info might shed additional light.
>
> Best regards,
> Vadim
>

Vadim,

I had an extended E-mail interchange with Jason Donenfeld over this issue. Sorry
that most of this was private because some large files needed to be transmitted
that were not appropriate for LKML. LKML is added back in to this reply.

My test for the fault was to start a VM running Windows 10 and use Edge to load
the VirtualBox web page. Usually within a few seconds, Edge or Windows would
crash. In the latter case, the log for the VM might show an unhandled exception
while in kernel mode. I thought the browser was hitting the random number
generator hard, but there is mouse activity, of course.

Jason has created a patch entitled "random: do not use input pool from hard
IRQs" that fixes the problem for me. It can be found at
https://lore.kernel.org/lkml/[email protected]/. I had
expected this patch to be merged into the mainline kernel by now. Jason should
be able to shed light on any delays.

The bottom line and good news for Oracle/VirtualBox and those of us that package
VB for distros is that this is a kernel regression - which is a conclusion I
hesitated to make earlier. It is not a problem with VirtualBox, VB just exposes
the kernel problem.

I certainly hope that this problem is fixed before 5.18 is released. If not, I
will need to campaign to prevent openSUSE Tumbleweed from switching to 5.18.
That would normally happen with the release of 5.18.1!

Larry