Hi!
I would like to publish two debug features which were needed for other stuff
I work on.
One is the reworked lx-symbols script which now actually works on at least
gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel
for some reason, not related to this patch) and upstream qemu.
The other feature is the ability to trap all guest exceptions (on SVM for now)
and see them in kvmtrace prior to potential merge to double/triple fault.
This can be very useful and I already had to manually patch KVM a few
times for this.
I will, once time permits, implement this feature on Intel as well.
Best regards,
Maxim Levitsky
Maxim Levitsky (3):
scripts/gdb: rework lx-symbols gdb script
KVM: x86: guest debug: don't inject interrupts while single stepping
KVM: SVM: allow to intercept all exceptions for debug
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/svm/svm.c | 77 ++++++++++++++++++++++-
arch/x86/kvm/svm/svm.h | 5 +-
arch/x86/kvm/x86.c | 11 +++-
kernel/module.c | 8 ++-
scripts/gdb/linux/symbols.py | 106 +++++++++++++++++++++++---------
6 files changed, 174 insertions(+), 35 deletions(-)
--
2.26.2
This change greatly helps with two issues:
* Resuming from a breakpoint is much more reliable.
When resuming execution from a breakpoint, with interrupts enabled, more often
than not, KVM would inject an interrupt and make the CPU jump immediately to
the interrupt handler and eventually return to the breakpoint, to trigger it
again.
From the user point of view it looks like the CPU never executed a
single instruction and in some cases that can even prevent forward progress,
for example, when the breakpoint is placed by an automated script
(e.g lx-symbols), which does something in response to the breakpoint and then
continues the guest automatically.
If the script execution takes enough time for another interrupt to arrive,
the guest will be stuck on the same breakpoint RIP forever.
* Normal single stepping is much more predictable, since it won't land the
debugger into an interrupt handler, so it is much more usable.
(If entry to an interrupt handler is desired, the user can still place a
breakpoint at it and resume the guest, which won't activate this workaround
and let the gdb still stop at the interrupt handler)
Since this change is only active when guest is debugged, it won't affect
KVM running normal 'production' VMs.
Signed-off-by: Maxim Levitsky <[email protected]>
Tested-by: Stefano Garzarella <[email protected]>
---
arch/x86/kvm/x86.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d95f90a0487..b75d990fcf12b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
can_inject = false;
}
+ /*
+ * Don't inject interrupts while single stepping to make guest debug easier
+ */
+ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ return;
+
/*
* Finally, inject interrupt events. If an event cannot be injected
* due to architectural conditions (e.g. IF=0) a window-open exit
--
2.26.2
Fix several issues that are present in lx-symbols script:
* Track module unloads by placing another software breakpoint at 'free_module'
(force uninline this symbol just in case), and use remove-symbol-file
gdb command to unload the symobls of the module that is unloading.
That gives the gdb a chance to mark all software breakpoints from
this module as pending again.
Also remove the module from the 'known' module list once it is unloaded.
* Since we now track module unload, we don't need to reload all
symbols anymore when 'known' module loaded again (that can't happen anymore).
This allows reloading a module in the debugged kernel to finish much faster,
while lx-symbols tracks module loads and unloads.
* Disable/enable all gdb breakpoints on both module load and unload breakpoint
hits, and not only in 'load_all_symbols' as was done before.
(load_all_symbols is no longer called on breakpoint hit)
That allows gdb to avoid getting confused about the state of the (now two)
internal breakpoints we place.
Otherwise it will leave them in the kernel code segment, when continuing
which triggers a guest kernel panic as soon as it skips over the 'int3'
instruction and executes the garbage tail of the optcode on which
the breakpoint was placed.
Signed-off-by: Maxim Levitsky <[email protected]>
---
kernel/module.c | 8 ++-
scripts/gdb/linux/symbols.py | 106 +++++++++++++++++++++++++----------
2 files changed, 83 insertions(+), 31 deletions(-)
diff --git a/kernel/module.c b/kernel/module.c
index 30479355ab850..ea81fc06ea1f5 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -901,8 +901,12 @@ int module_refcount(struct module *mod)
}
EXPORT_SYMBOL(module_refcount);
-/* This exists whether we can unload or not */
-static void free_module(struct module *mod);
+/* This exists whether we can unload or not
+ * Keep it uninlined to provide a reliable breakpoint target,
+ * e.g. for the gdb helper command 'lx-symbols'.
+ */
+
+static noinline void free_module(struct module *mod);
SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
unsigned int, flags)
diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py
index 1be9763cf8bb2..4ce879548a1ae 100644
--- a/scripts/gdb/linux/symbols.py
+++ b/scripts/gdb/linux/symbols.py
@@ -17,6 +17,24 @@ import re
from linux import modules, utils
+def save_state():
+ breakpoints = []
+ if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
+ for bp in gdb.breakpoints():
+ breakpoints.append({'breakpoint': bp, 'enabled': bp.enabled})
+ bp.enabled = False
+
+ show_pagination = gdb.execute("show pagination", to_string=True)
+ pagination = show_pagination.endswith("on.\n")
+ gdb.execute("set pagination off")
+
+ return {"breakpoints":breakpoints, "show_pagination": show_pagination}
+
+def load_state(state):
+ for breakpoint in state["breakpoints"]:
+ breakpoint['breakpoint'].enabled = breakpoint['enabled']
+ gdb.execute("set pagination %s" % ("on" if state["show_pagination"] else "off"))
+
if hasattr(gdb, 'Breakpoint'):
class LoadModuleBreakpoint(gdb.Breakpoint):
@@ -30,26 +48,38 @@ if hasattr(gdb, 'Breakpoint'):
module_name = module['name'].string()
cmd = self.gdb_command
+ # module already loaded, false alarm
+ if module_name in cmd.loaded_modules:
+ return False
+
# enforce update if object file is not found
cmd.module_files_updated = False
# Disable pagination while reporting symbol (re-)loading.
# The console input is blocked in this context so that we would
# get stuck waiting for the user to acknowledge paged output.
- show_pagination = gdb.execute("show pagination", to_string=True)
- pagination = show_pagination.endswith("on.\n")
- gdb.execute("set pagination off")
+ state = save_state()
+ cmd.load_module_symbols(module)
+ load_state(state)
+ return False
- if module_name in cmd.loaded_modules:
- gdb.write("refreshing all symbols to reload module "
- "'{0}'\n".format(module_name))
- cmd.load_all_symbols()
- else:
- cmd.load_module_symbols(module)
+ class UnLoadModuleBreakpoint(gdb.Breakpoint):
+ def __init__(self, spec, gdb_command):
+ super(UnLoadModuleBreakpoint, self).__init__(spec, internal=True)
+ self.silent = True
+ self.gdb_command = gdb_command
+
+ def stop(self):
+ module = gdb.parse_and_eval("mod")
+ module_name = module['name'].string()
+ cmd = self.gdb_command
- # restore pagination state
- gdb.execute("set pagination %s" % ("on" if pagination else "off"))
+ if not module_name in cmd.loaded_modules:
+ return False
+ state = save_state()
+ cmd.unload_module_symbols(module)
+ load_state(state)
return False
@@ -64,8 +94,9 @@ lx-symbols command."""
module_paths = []
module_files = []
module_files_updated = False
- loaded_modules = []
- breakpoint = None
+ loaded_modules = {}
+ module_load_breakpoint = None
+ module_unload_breakpoint = None
def __init__(self):
super(LxSymbols, self).__init__("lx-symbols", gdb.COMMAND_FILES,
@@ -129,21 +160,32 @@ lx-symbols command."""
filename=module_file,
addr=module_addr,
sections=self._section_arguments(module))
+
gdb.execute(cmdline, to_string=True)
- if module_name not in self.loaded_modules:
- self.loaded_modules.append(module_name)
+ self.loaded_modules[module_name] = {"module_file": module_file,
+ "module_addr": module_addr}
else:
gdb.write("no module object found for '{0}'\n".format(module_name))
+ def unload_module_symbols(self, module):
+ module_name = module['name'].string()
+
+ module_file = self.loaded_modules[module_name]["module_file"]
+ module_addr = self.loaded_modules[module_name]["module_addr"]
+
+ gdb.write("unloading @{addr}: {filename}\n".format(
+ addr=module_addr, filename=module_file))
+ cmdline = "remove-symbol-file {filename}".format(
+ filename=module_file)
+
+ gdb.execute(cmdline, to_string=True)
+ del self.loaded_modules[module_name]
+
+
def load_all_symbols(self):
gdb.write("loading vmlinux\n")
- # Dropping symbols will disable all breakpoints. So save their states
- # and restore them afterward.
- saved_states = []
- if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
- for bp in gdb.breakpoints():
- saved_states.append({'breakpoint': bp, 'enabled': bp.enabled})
+ state = save_state()
# drop all current symbols and reload vmlinux
orig_vmlinux = 'vmlinux'
@@ -153,15 +195,14 @@ lx-symbols command."""
gdb.execute("symbol-file", to_string=True)
gdb.execute("symbol-file {0}".format(orig_vmlinux))
- self.loaded_modules = []
+ self.loaded_modules = {}
module_list = modules.module_list()
if not module_list:
gdb.write("no modules found\n")
else:
[self.load_module_symbols(module) for module in module_list]
- for saved_state in saved_states:
- saved_state['breakpoint'].enabled = saved_state['enabled']
+ load_state(state)
def invoke(self, arg, from_tty):
self.module_paths = [os.path.expanduser(p) for p in arg.split()]
@@ -174,11 +215,18 @@ lx-symbols command."""
self.load_all_symbols()
if hasattr(gdb, 'Breakpoint'):
- if self.breakpoint is not None:
- self.breakpoint.delete()
- self.breakpoint = None
- self.breakpoint = LoadModuleBreakpoint(
- "kernel/module.c:do_init_module", self)
+ if self.module_load_breakpoint is not None:
+ self.module_load_breakpoint.delete()
+ self.module_load_breakpoint = None
+ self.module_load_breakpoint = \
+ LoadModuleBreakpoint("kernel/module.c:do_init_module", self)
+
+ if self.module_unload_breakpoint is not None:
+ self.module_unload_breakpoint.delete()
+ self.module_unload_breakpoint = None
+ self.module_unload_breakpoint = \
+ UnLoadModuleBreakpoint("kernel/module.c:free_module", self)
+
else:
gdb.write("Note: symbol update on module loading not supported "
"with this gdb version\n")
--
2.26.2
Add a new debug module param 'debug_intercept_exceptions' which will allow the
KVM to intercept any guest exception, and forward it to the guest.
This can be very useful for guest debugging and/or KVM debugging with kvm trace.
This is not intended to be used on production systems.
This is based on an idea first shown here:
https://patchwork.kernel.org/project/kvm/patch/[email protected]/
CC: Borislav Petkov <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/svm/svm.c | 77 ++++++++++++++++++++++++++++++++-
arch/x86/kvm/svm/svm.h | 5 ++-
arch/x86/kvm/x86.c | 5 ++-
4 files changed, 85 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6d..c8f44a88b3153 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1564,6 +1564,8 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu);
void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
+void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
+ u32 error_code, unsigned long payload);
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495f..94156a367a663 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -197,6 +197,9 @@ module_param(sev_es, int, 0444);
bool __read_mostly dump_invalid_vmcb;
module_param(dump_invalid_vmcb, bool, 0644);
+uint debug_intercept_exceptions;
+module_param(debug_intercept_exceptions, uint, 0444);
+
static bool svm_gp_erratum_intercept = true;
static u8 rsm_ins_bytes[] = "\x0f\xaa";
@@ -220,6 +223,8 @@ static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000};
#define MSRS_RANGE_SIZE 2048
#define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2)
+static void init_debug_exceptions_intercept(struct vcpu_svm *svm);
+
u32 svm_msrpm_offset(u32 msr)
{
u32 offset;
@@ -1137,6 +1142,8 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
set_exception_intercept(svm, MC_VECTOR);
set_exception_intercept(svm, AC_VECTOR);
set_exception_intercept(svm, DB_VECTOR);
+
+ init_debug_exceptions_intercept(svm);
/*
* Guest access to VMware backdoor ports could legitimately
* trigger #GP because of TSS I/O permission bitmap.
@@ -1913,6 +1920,17 @@ static int pf_interception(struct kvm_vcpu *vcpu)
u64 fault_address = svm->vmcb->control.exit_info_2;
u64 error_code = svm->vmcb->control.exit_info_1;
+ if ((debug_intercept_exceptions & (1 << PF_VECTOR)))
+ if (npt_enabled && !vcpu->arch.apf.host_apf_flags) {
+ /* If #PF was only intercepted for debug, inject
+ * it directly to the guest, since the mmu code
+ * is not ready to deal with such page faults
+ */
+ kvm_queue_exception_e_p(vcpu, PF_VECTOR,
+ error_code, fault_address);
+ return 1;
+ }
+
return kvm_handle_page_fault(vcpu, error_code, fault_address,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
@@ -3025,7 +3043,7 @@ static int invpcid_interception(struct kvm_vcpu *vcpu)
return kvm_handle_invpcid(vcpu, type, gva);
}
-static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
+static int (*svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[SVM_EXIT_READ_CR0] = cr_interception,
[SVM_EXIT_READ_CR3] = cr_interception,
[SVM_EXIT_READ_CR4] = cr_interception,
@@ -3099,6 +3117,63 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[SVM_EXIT_VMGEXIT] = sev_handle_vmgexit,
};
+static int generic_exception_interception(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Generic exception handler which forwards a guest exception
+ * as-is to the guest.
+ * For exceptions that don't have a special intercept handler.
+ *
+ * Used for 'debug_intercept_exceptions' KVM debug feature only.
+ */
+ struct vcpu_svm *svm = to_svm(vcpu);
+ int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE;
+
+ WARN_ON(exc < 0 || exc > 31);
+
+ if (exc == TS_VECTOR) {
+ /*
+ * SVM doesn't provide us with an error code to be able to
+ * re-inject the #TS exception, so just disable its
+ * interception, and let the guest re-execute the instruction.
+ */
+ vmcb_clr_intercept(&svm->vmcb01.ptr->control,
+ INTERCEPT_EXCEPTION_OFFSET + TS_VECTOR);
+ recalc_intercepts(svm);
+ return 1;
+ } else if (exc == DF_VECTOR) {
+ /* SVM doesn't provide us with an error code for the #DF */
+ kvm_queue_exception_e(vcpu, exc, 0);
+ return 1;
+ }
+
+ if (x86_exception_has_error_code(exc))
+ kvm_queue_exception_e(vcpu, exc, svm->vmcb->control.exit_info_1);
+ else
+ kvm_queue_exception(vcpu, exc);
+ return 1;
+}
+
+static void init_debug_exceptions_intercept(struct vcpu_svm *svm)
+{
+ int exc;
+
+ for (exc = 0 ; exc < 32 ; exc++) {
+ if (!(debug_intercept_exceptions & (1 << exc)))
+ continue;
+
+ /* Those are defined to have undefined behavior in the SVM spec */
+ if (exc == 2 || exc == 9)
+ continue;
+
+ set_exception_intercept(svm, exc);
+
+ if (!svm_exit_handlers[SVM_EXIT_EXCP_BASE + exc])
+ svm_exit_handlers[SVM_EXIT_EXCP_BASE + exc] =
+ generic_exception_interception;
+ }
+}
+
static void dump_vmcb(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 8e276c4fb33df..e0ff9ca996df8 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -32,6 +32,7 @@ static const u32 host_save_user_msrs[] = {
#define MSRPM_OFFSETS 16
extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
extern bool npt_enabled;
+extern uint debug_intercept_exceptions;
enum {
VMCB_INTERCEPTS, /* Intercept vectors, TSC offset,
@@ -333,7 +334,9 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit)
struct vmcb *vmcb = svm->vmcb01.ptr;
WARN_ON_ONCE(bit >= 32);
- vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);
+
+ if (!((1 << bit) & debug_intercept_exceptions))
+ vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);
recalc_intercepts(svm);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b75d990fcf12b..be509944622bd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -627,12 +627,13 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr,
}
EXPORT_SYMBOL_GPL(kvm_queue_exception_p);
-static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
- u32 error_code, unsigned long payload)
+void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
+ u32 error_code, unsigned long payload)
{
kvm_multiple_exception(vcpu, nr, true, error_code,
true, payload, false);
}
+EXPORT_SYMBOL_GPL(kvm_queue_exception_e_p);
int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err)
{
--
2.26.2
On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> This change greatly helps with two issues:
>
> * Resuming from a breakpoint is much more reliable.
>
> When resuming execution from a breakpoint, with interrupts enabled, more often
> than not, KVM would inject an interrupt and make the CPU jump immediately to
> the interrupt handler and eventually return to the breakpoint, to trigger it
> again.
>
> From the user point of view it looks like the CPU never executed a
> single instruction and in some cases that can even prevent forward progress,
> for example, when the breakpoint is placed by an automated script
> (e.g lx-symbols), which does something in response to the breakpoint and then
> continues the guest automatically.
> If the script execution takes enough time for another interrupt to arrive,
> the guest will be stuck on the same breakpoint RIP forever.
>
> * Normal single stepping is much more predictable, since it won't land the
> debugger into an interrupt handler, so it is much more usable.
>
> (If entry to an interrupt handler is desired, the user can still place a
> breakpoint at it and resume the guest, which won't activate this workaround
> and let the gdb still stop at the interrupt handler)
>
> Since this change is only active when guest is debugged, it won't affect
> KVM running normal 'production' VMs.
>
>
> Signed-off-by: Maxim Levitsky <[email protected]>
> Tested-by: Stefano Garzarella <[email protected]>
> ---
> arch/x86/kvm/x86.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a9d95f90a0487..b75d990fcf12b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
> can_inject = false;
> }
>
> + /*
> + * Don't inject interrupts while single stepping to make guest debug easier
> + */
> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> + return;
Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
blocking at the start of single-stepping, unwind at the end? Deviating this far
from architectural behavior will end in tears at some point.
> +
> /*
> * Finally, inject interrupt events. If an event cannot be injected
> * due to architectural conditions (e.g. IF=0) a window-open exit
> --
> 2.26.2
>
Hi Maxim,
On Tue, Mar 16, 2021 at 12:10:20AM +0200, Maxim Levitsky wrote:
> -static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> +static int (*svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
Can you keep this const and always set the necessary handlers? If
exceptions are not intercepted they will not be used.
> @@ -333,7 +334,9 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit)
> struct vmcb *vmcb = svm->vmcb01.ptr;
>
> WARN_ON_ONCE(bit >= 32);
> - vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);
> +
> + if (!((1 << bit) & debug_intercept_exceptions))
> + vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);
This will break SEV-ES guests, as those will not cause an intercept but
now start to get #VC exceptions on every other exception that is raised.
SEV-ES guests are not prepared for that and will not even boot, so
please don't enable this feature for them.
On Tue, Mar 16, 2021 at 12:10:20AM +0200, Maxim Levitsky wrote:
> Add a new debug module param 'debug_intercept_exceptions' which will allow the
> KVM to intercept any guest exception, and forward it to the guest.
>
> This can be very useful for guest debugging and/or KVM debugging with kvm trace.
> This is not intended to be used on production systems.
>
> This is based on an idea first shown here:
> https://patchwork.kernel.org/project/kvm/patch/[email protected]/
>
> CC: Borislav Petkov <[email protected]>
> Signed-off-by: Maxim Levitsky <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/svm/svm.c | 77 ++++++++++++++++++++++++++++++++-
> arch/x86/kvm/svm/svm.h | 5 ++-
> arch/x86/kvm/x86.c | 5 ++-
> 4 files changed, 85 insertions(+), 4 deletions(-)
Looks interesting, I'll give it a try when I get a chance in the coming days.
Thx!
--
Regards/Gruss,
Boris.
SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg
On 16.03.21 00:37, Sean Christopherson wrote:
> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>> This change greatly helps with two issues:
>>
>> * Resuming from a breakpoint is much more reliable.
>>
>> When resuming execution from a breakpoint, with interrupts enabled, more often
>> than not, KVM would inject an interrupt and make the CPU jump immediately to
>> the interrupt handler and eventually return to the breakpoint, to trigger it
>> again.
>>
>> From the user point of view it looks like the CPU never executed a
>> single instruction and in some cases that can even prevent forward progress,
>> for example, when the breakpoint is placed by an automated script
>> (e.g lx-symbols), which does something in response to the breakpoint and then
>> continues the guest automatically.
>> If the script execution takes enough time for another interrupt to arrive,
>> the guest will be stuck on the same breakpoint RIP forever.
>>
>> * Normal single stepping is much more predictable, since it won't land the
>> debugger into an interrupt handler, so it is much more usable.
>>
>> (If entry to an interrupt handler is desired, the user can still place a
>> breakpoint at it and resume the guest, which won't activate this workaround
>> and let the gdb still stop at the interrupt handler)
>>
>> Since this change is only active when guest is debugged, it won't affect
>> KVM running normal 'production' VMs.
>>
>>
>> Signed-off-by: Maxim Levitsky <[email protected]>
>> Tested-by: Stefano Garzarella <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a9d95f90a0487..b75d990fcf12b 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
>> can_inject = false;
>> }
>>
>> + /*
>> + * Don't inject interrupts while single stepping to make guest debug easier
>> + */
>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>> + return;
>
> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
> blocking at the start of single-stepping, unwind at the end? Deviating this far
> from architectural behavior will end in tears at some point.
>
Does this happen to address this suspicious workaround in the kernel?
/*
* The kernel doesn't use TF single-step outside of:
*
* - Kprobes, consumed through kprobe_debug_handler()
* - KGDB, consumed through notify_debug()
*
* So if we get here with DR_STEP set, something is wonky.
*
* A known way to trigger this is through QEMU's GDB stub,
* which leaks #DB into the guest and causes IST recursion.
*/
if (WARN_ON_ONCE(dr6 & DR_STEP))
regs->flags &= ~X86_EFLAGS_TF;
(arch/x86/kernel/traps.c, exc_debug_kernel)
I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
yeah, question to myself as well, dancing around broken guest debugging
for a long time while trying to fix other issues...
Jan
>> +
>> /*
>> * Finally, inject interrupt events. If an event cannot be injected
>> * due to architectural conditions (e.g. IF=0) a window-open exit
>> --
>> 2.26.2
>>
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Mon, 2021-03-15 at 16:37 -0700, Sean Christopherson wrote:
> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > This change greatly helps with two issues:
> >
> > * Resuming from a breakpoint is much more reliable.
> >
> > When resuming execution from a breakpoint, with interrupts enabled, more often
> > than not, KVM would inject an interrupt and make the CPU jump immediately to
> > the interrupt handler and eventually return to the breakpoint, to trigger it
> > again.
> >
> > From the user point of view it looks like the CPU never executed a
> > single instruction and in some cases that can even prevent forward progress,
> > for example, when the breakpoint is placed by an automated script
> > (e.g lx-symbols), which does something in response to the breakpoint and then
> > continues the guest automatically.
> > If the script execution takes enough time for another interrupt to arrive,
> > the guest will be stuck on the same breakpoint RIP forever.
> >
> > * Normal single stepping is much more predictable, since it won't land the
> > debugger into an interrupt handler, so it is much more usable.
> >
> > (If entry to an interrupt handler is desired, the user can still place a
> > breakpoint at it and resume the guest, which won't activate this workaround
> > and let the gdb still stop at the interrupt handler)
> >
> > Since this change is only active when guest is debugged, it won't affect
> > KVM running normal 'production' VMs.
> >
> >
> > Signed-off-by: Maxim Levitsky <[email protected]>
> > Tested-by: Stefano Garzarella <[email protected]>
> > ---
> > arch/x86/kvm/x86.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a9d95f90a0487..b75d990fcf12b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
> > can_inject = false;
> > }
> >
> > + /*
> > + * Don't inject interrupts while single stepping to make guest debug easier
> > + */
> > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > + return;
>
> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
> blocking at the start of single-stepping, unwind at the end? Deviating this far
> from architectural behavior will end in tears at some point.
I don't worry about NMI, but for IRQs, userspace can clear EFLAGS.IF, but that
can be messy to unwind, if an instruction that clears the interrupt flag was
single stepped over.
There is also notion of interrupt shadow but it also is reserved for things
like delaying interrupts for one cycle after sti, and such.
IMHO KVM_GUESTDBG_SINGLESTEP is already non architectural feature (userspace
basically tell the KVM to single step the guest but it doesn't set TF flag
or something like that), so changing its definition shouldn't be a problem.
If you worry about some automated script breaking due to the change,
(I expect that KVM_GUESTDBG_SINGLESTEP is mostly used manually, especially
since single stepping is never 100% reliable due to various issues like that),
I can add another flag to it which will block all the interrupts.
(like say KVM_GUESTDBG_BLOCKEVENTS).
In fact qemu already has single step flags, enabled over special qemu gdb extension
'maintenance packet qqemu.sstepbits'
Those single step flags allow to disable interrupts and qemu timers during the single stepping,
(and both modes are enabled by default)
However kvm code in qemu ignores these bits.
What do you think?
Best regards,
Maxim Levitsky
>
> > +
> > /*
> > * Finally, inject interrupt events. If an event cannot be injected
> > * due to architectural conditions (e.g. IF=0) a window-open exit
> > --
> > 2.26.2
> >
On Tue, 2021-03-16 at 09:32 +0100, Joerg Roedel wrote:
> Hi Maxim,
>
> On Tue, Mar 16, 2021 at 12:10:20AM +0200, Maxim Levitsky wrote:
> > -static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> > +static int (*svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>
> Can you keep this const and always set the necessary handlers? If
> exceptions are not intercepted they will not be used.
>
> > @@ -333,7 +334,9 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit)
> > struct vmcb *vmcb = svm->vmcb01.ptr;
> >
> > WARN_ON_ONCE(bit >= 32);
> > - vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);
> > +
> > + if (!((1 << bit) & debug_intercept_exceptions))
> > + vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);
>
> This will break SEV-ES guests, as those will not cause an intercept but
> now start to get #VC exceptions on every other exception that is raised.
> SEV-ES guests are not prepared for that and will not even boot, so
> please don't enable this feature for them.
I agree but what is wrong with that?
This is a debug feature, and it only can be enabled by the root,
and so someone might actually want this case to happen
(e.g to see if a SEV guest can cope with extra #VC exceptions).
I have nothing against not allowing this for SEV-ES guests though.
What do you think?
Best regards,
Maxim Levitsky
On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
> On 16.03.21 00:37, Sean Christopherson wrote:
> > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > This change greatly helps with two issues:
> > >
> > > * Resuming from a breakpoint is much more reliable.
> > >
> > > When resuming execution from a breakpoint, with interrupts enabled, more often
> > > than not, KVM would inject an interrupt and make the CPU jump immediately to
> > > the interrupt handler and eventually return to the breakpoint, to trigger it
> > > again.
> > >
> > > From the user point of view it looks like the CPU never executed a
> > > single instruction and in some cases that can even prevent forward progress,
> > > for example, when the breakpoint is placed by an automated script
> > > (e.g lx-symbols), which does something in response to the breakpoint and then
> > > continues the guest automatically.
> > > If the script execution takes enough time for another interrupt to arrive,
> > > the guest will be stuck on the same breakpoint RIP forever.
> > >
> > > * Normal single stepping is much more predictable, since it won't land the
> > > debugger into an interrupt handler, so it is much more usable.
> > >
> > > (If entry to an interrupt handler is desired, the user can still place a
> > > breakpoint at it and resume the guest, which won't activate this workaround
> > > and let the gdb still stop at the interrupt handler)
> > >
> > > Since this change is only active when guest is debugged, it won't affect
> > > KVM running normal 'production' VMs.
> > >
> > >
> > > Signed-off-by: Maxim Levitsky <[email protected]>
> > > Tested-by: Stefano Garzarella <[email protected]>
> > > ---
> > > arch/x86/kvm/x86.c | 6 ++++++
> > > 1 file changed, 6 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index a9d95f90a0487..b75d990fcf12b 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
> > > can_inject = false;
> > > }
> > >
> > > + /*
> > > + * Don't inject interrupts while single stepping to make guest debug easier
> > > + */
> > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > > + return;
> >
> > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
> > blocking at the start of single-stepping, unwind at the end? Deviating this far
> > from architectural behavior will end in tears at some point.
> >
>
> Does this happen to address this suspicious workaround in the kernel?
>
> /*
> * The kernel doesn't use TF single-step outside of:
> *
> * - Kprobes, consumed through kprobe_debug_handler()
> * - KGDB, consumed through notify_debug()
> *
> * So if we get here with DR_STEP set, something is wonky.
> *
> * A known way to trigger this is through QEMU's GDB stub,
> * which leaks #DB into the guest and causes IST recursion.
> */
> if (WARN_ON_ONCE(dr6 & DR_STEP))
> regs->flags &= ~X86_EFLAGS_TF;
>
> (arch/x86/kernel/traps.c, exc_debug_kernel)
>
> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
> yeah, question to myself as well, dancing around broken guest debugging
> for a long time while trying to fix other issues...
To be honest I didn't see that warning even once, but I can imagine KVM
leaking #DB due to bugs in that code. That area historically didn't receive
much attention since it can only be triggered by
KVM_GET/SET_GUEST_DEBUG which isn't used in production.
The only issue that I on the other hand did
see which is mostly gdb fault is that it fails to remove a software breakpoint
when resuming over it, if that breakpoint's python handler messes up
with gdb's symbols, which is what lx-symbols does.
And that despite the fact that lx-symbol doesn't mess with the object
(that is the kernel) where the breakpoint is defined.
Just adding/removing one symbol file is enough to trigger this issue.
Since lx-symbols already works this around when it reloads all symbols,
I extended that workaround to happen also when loading/unloading
only a single symbol file.
Best regards,
Maxim Levitsky
>
> Jan
>
> > > +
> > > /*
> > > * Finally, inject interrupt events. If an event cannot be injected
> > > * due to architectural conditions (e.g. IF=0) a window-open exit
> > > --
> > > 2.26.2
> > >
On 16.03.21 11:59, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
>> On 16.03.21 00:37, Sean Christopherson wrote:
>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>>>> This change greatly helps with two issues:
>>>>
>>>> * Resuming from a breakpoint is much more reliable.
>>>>
>>>> When resuming execution from a breakpoint, with interrupts enabled, more often
>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to
>>>> the interrupt handler and eventually return to the breakpoint, to trigger it
>>>> again.
>>>>
>>>> From the user point of view it looks like the CPU never executed a
>>>> single instruction and in some cases that can even prevent forward progress,
>>>> for example, when the breakpoint is placed by an automated script
>>>> (e.g lx-symbols), which does something in response to the breakpoint and then
>>>> continues the guest automatically.
>>>> If the script execution takes enough time for another interrupt to arrive,
>>>> the guest will be stuck on the same breakpoint RIP forever.
>>>>
>>>> * Normal single stepping is much more predictable, since it won't land the
>>>> debugger into an interrupt handler, so it is much more usable.
>>>>
>>>> (If entry to an interrupt handler is desired, the user can still place a
>>>> breakpoint at it and resume the guest, which won't activate this workaround
>>>> and let the gdb still stop at the interrupt handler)
>>>>
>>>> Since this change is only active when guest is debugged, it won't affect
>>>> KVM running normal 'production' VMs.
>>>>
>>>>
>>>> Signed-off-by: Maxim Levitsky <[email protected]>
>>>> Tested-by: Stefano Garzarella <[email protected]>
>>>> ---
>>>> arch/x86/kvm/x86.c | 6 ++++++
>>>> 1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index a9d95f90a0487..b75d990fcf12b 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
>>>> can_inject = false;
>>>> }
>>>>
>>>> + /*
>>>> + * Don't inject interrupts while single stepping to make guest debug easier
>>>> + */
>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>>>> + return;
>>>
>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
>>> blocking at the start of single-stepping, unwind at the end? Deviating this far
>>> from architectural behavior will end in tears at some point.
>>>
>>
>> Does this happen to address this suspicious workaround in the kernel?
>>
>> /*
>> * The kernel doesn't use TF single-step outside of:
>> *
>> * - Kprobes, consumed through kprobe_debug_handler()
>> * - KGDB, consumed through notify_debug()
>> *
>> * So if we get here with DR_STEP set, something is wonky.
>> *
>> * A known way to trigger this is through QEMU's GDB stub,
>> * which leaks #DB into the guest and causes IST recursion.
>> */
>> if (WARN_ON_ONCE(dr6 & DR_STEP))
>> regs->flags &= ~X86_EFLAGS_TF;
>>
>> (arch/x86/kernel/traps.c, exc_debug_kernel)
>>
>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
>> yeah, question to myself as well, dancing around broken guest debugging
>> for a long time while trying to fix other issues...
>
> To be honest I didn't see that warning even once, but I can imagine KVM
> leaking #DB due to bugs in that code. That area historically didn't receive
> much attention since it can only be triggered by
> KVM_GET/SET_GUEST_DEBUG which isn't used in production.
I've triggered it recently while debugging a guest, that's why I got
aware of the code path. Long ago, all this used to work (soft BPs,
single-stepping etc.)
>
> The only issue that I on the other hand did
> see which is mostly gdb fault is that it fails to remove a software breakpoint
> when resuming over it, if that breakpoint's python handler messes up
> with gdb's symbols, which is what lx-symbols does.
>
> And that despite the fact that lx-symbol doesn't mess with the object
> (that is the kernel) where the breakpoint is defined.
>
> Just adding/removing one symbol file is enough to trigger this issue.
>
> Since lx-symbols already works this around when it reloads all symbols,
> I extended that workaround to happen also when loading/unloading
> only a single symbol file.
You have no issue with interactive debugging when NOT using gdb scripts
/ lx-symbol?
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Tue, 2021-03-16 at 14:38 +0100, Jan Kiszka wrote:
> On 15.03.21 23:10, Maxim Levitsky wrote:
> > Fix several issues that are present in lx-symbols script:
> >
> > * Track module unloads by placing another software breakpoint at 'free_module'
> > (force uninline this symbol just in case), and use remove-symbol-file
> > gdb command to unload the symobls of the module that is unloading.
> >
> > That gives the gdb a chance to mark all software breakpoints from
> > this module as pending again.
> > Also remove the module from the 'known' module list once it is unloaded.
> >
> > * Since we now track module unload, we don't need to reload all
> > symbols anymore when 'known' module loaded again (that can't happen anymore).
> > This allows reloading a module in the debugged kernel to finish much faster,
> > while lx-symbols tracks module loads and unloads.
> >
> > * Disable/enable all gdb breakpoints on both module load and unload breakpoint
> > hits, and not only in 'load_all_symbols' as was done before.
> > (load_all_symbols is no longer called on breakpoint hit)
> > That allows gdb to avoid getting confused about the state of the (now two)
> > internal breakpoints we place.
> >
> > Otherwise it will leave them in the kernel code segment, when continuing
> > which triggers a guest kernel panic as soon as it skips over the 'int3'
> > instruction and executes the garbage tail of the optcode on which
> > the breakpoint was placed.
> >
> > Signed-off-by: Maxim Levitsky <[email protected]>
> > ---
> > kernel/module.c | 8 ++-
> > scripts/gdb/linux/symbols.py | 106 +++++++++++++++++++++++++----------
> > 2 files changed, 83 insertions(+), 31 deletions(-)
> >
> > diff --git a/kernel/module.c b/kernel/module.c
> > index 30479355ab850..ea81fc06ea1f5 100644
> > --- a/kernel/module.c
> > +++ b/kernel/module.c
> > @@ -901,8 +901,12 @@ int module_refcount(struct module *mod)
> > }
> > EXPORT_SYMBOL(module_refcount);
> >
> > -/* This exists whether we can unload or not */
> > -static void free_module(struct module *mod);
> > +/* This exists whether we can unload or not
> > + * Keep it uninlined to provide a reliable breakpoint target,
> > + * e.g. for the gdb helper command 'lx-symbols'.
> > + */
> > +
> > +static noinline void free_module(struct module *mod);
> >
> > SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
> > unsigned int, flags)
> > diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py
> > index 1be9763cf8bb2..4ce879548a1ae 100644
> > --- a/scripts/gdb/linux/symbols.py
> > +++ b/scripts/gdb/linux/symbols.py
> > @@ -17,6 +17,24 @@ import re
> >
> > from linux import modules, utils
> >
> > +def save_state():
>
> Naming is a bit too generic. And it's not only saving the state, it's
> also disabling things.
>
> > + breakpoints = []
> > + if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
> > + for bp in gdb.breakpoints():
> > + breakpoints.append({'breakpoint': bp, 'enabled': bp.enabled})
> > + bp.enabled = False
> > +
> > + show_pagination = gdb.execute("show pagination", to_string=True)
> > + pagination = show_pagination.endswith("on.\n")
> > + gdb.execute("set pagination off")
> > +
> > + return {"breakpoints":breakpoints, "show_pagination": show_pagination}
> > +
> > +def load_state(state):
>
> Maybe rather something with "restore", to make naming balanced. Or is
> there a use case where "state" is not coming from the function above?
I didn't put much thought into naming these functions.
I'll think of something better.
>
> > + for breakpoint in state["breakpoints"]:
> > + breakpoint['breakpoint'].enabled = breakpoint['enabled']
> > + gdb.execute("set pagination %s" % ("on" if state["show_pagination"] else "off"))
> > +
> >
> > if hasattr(gdb, 'Breakpoint'):
> > class LoadModuleBreakpoint(gdb.Breakpoint):
> > @@ -30,26 +48,38 @@ if hasattr(gdb, 'Breakpoint'):
> > module_name = module['name'].string()
> > cmd = self.gdb_command
> >
> > + # module already loaded, false alarm
> > + if module_name in cmd.loaded_modules:
> > + return False
>
> Possibly at all, now that we track unloading? Can our state tracking
> become out-of-sync?
Sadly yes and that happens a lot unless kvm is patched to
avoid injecting interrupts on a single step.
What is happening is that this breakpoint is hit, then symbols
are loaded which is a relatively slow process, and by the time
the gdb resumes the guest, a timer interrupt is already pending
(kernel local apic timer is running while vcpus are stopped),
which makes the guest kernel take the interrupt (interrupts are
not disabled in these two intercepted functions) and eventually
return to the breakpoint, and trigger its python handler again.
This happens so often that especially with multiple vcpus,
it is possible to enter live-lock like state where guest+gdb
are stuck in this loop forever.
When KVM is patched, that indeed shouldn't happen but the check
won't hurt here.
Plus due to the feedback I received on the patch 2, I will end up
implementing the 'don't inject interrupts on single step' as
a new KVM debug feature flag, thus you will need a newer qemu
to make use of it.
>
> > +
> > # enforce update if object file is not found
> > cmd.module_files_updated = False
> >
> > # Disable pagination while reporting symbol (re-)loading.
> > # The console input is blocked in this context so that we would
> > # get stuck waiting for the user to acknowledge paged output.
> > - show_pagination = gdb.execute("show pagination", to_string=True)
> > - pagination = show_pagination.endswith("on.\n")
> > - gdb.execute("set pagination off")
> > + state = save_state()
> > + cmd.load_module_symbols(module)
> > + load_state(state)
> > + return False
> >
> > - if module_name in cmd.loaded_modules:
> > - gdb.write("refreshing all symbols to reload module "
> > - "'{0}'\n".format(module_name))
> > - cmd.load_all_symbols()
> > - else:
> > - cmd.load_module_symbols(module)
> > + class UnLoadModuleBreakpoint(gdb.Breakpoint):
> > + def __init__(self, spec, gdb_command):
> > + super(UnLoadModuleBreakpoint, self).__init__(spec, internal=True)
> > + self.silent = True
> > + self.gdb_command = gdb_command
> > +
> > + def stop(self):
> > + module = gdb.parse_and_eval("mod")
> > + module_name = module['name'].string()
> > + cmd = self.gdb_command
> >
> > - # restore pagination state
> > - gdb.execute("set pagination %s" % ("on" if pagination else "off"))
> > + if not module_name in cmd.loaded_modules:
> > + return False
> >
>
> Same question as above. For robustness, checking is not bad. But maybe
> it's worth reporting as well.
>
> > + state = save_state()
> > + cmd.unload_module_symbols(module)
> > + load_state(state)
> > return False
> >
> >
> > @@ -64,8 +94,9 @@ lx-symbols command."""
> > module_paths = []
> > module_files = []
> > module_files_updated = False
> > - loaded_modules = []
> > - breakpoint = None
> > + loaded_modules = {}
> > + module_load_breakpoint = None
> > + module_unload_breakpoint = None
> >
> > def __init__(self):
> > super(LxSymbols, self).__init__("lx-symbols", gdb.COMMAND_FILES,
> > @@ -129,21 +160,32 @@ lx-symbols command."""
> > filename=module_file,
> > addr=module_addr,
> > sections=self._section_arguments(module))
> > +
> > gdb.execute(cmdline, to_string=True)
> > - if module_name not in self.loaded_modules:
> > - self.loaded_modules.append(module_name)
> > + self.loaded_modules[module_name] = {"module_file": module_file,
> > + "module_addr": module_addr}
> > else:
> > gdb.write("no module object found for '{0}'\n".format(module_name))
> >
> > + def unload_module_symbols(self, module):
> > + module_name = module['name'].string()
> > +
> > + module_file = self.loaded_modules[module_name]["module_file"]
> > + module_addr = self.loaded_modules[module_name]["module_addr"]
> > +
> > + gdb.write("unloading @{addr}: {filename}\n".format(
> > + addr=module_addr, filename=module_file))
> > + cmdline = "remove-symbol-file {filename}".format(
> > + filename=module_file)
> > +
> > + gdb.execute(cmdline, to_string=True)
> > + del self.loaded_modules[module_name]
> > +
> > +
> > def load_all_symbols(self):
> > gdb.write("loading vmlinux\n")
> >
> > - # Dropping symbols will disable all breakpoints. So save their states
> > - # and restore them afterward.
> > - saved_states = []
> > - if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
> > - for bp in gdb.breakpoints():
> > - saved_states.append({'breakpoint': bp, 'enabled': bp.enabled})
> > + state = save_state()
> >
> > # drop all current symbols and reload vmlinux
> > orig_vmlinux = 'vmlinux'
> > @@ -153,15 +195,14 @@ lx-symbols command."""
> > gdb.execute("symbol-file", to_string=True)
> > gdb.execute("symbol-file {0}".format(orig_vmlinux))
> >
> > - self.loaded_modules = []
> > + self.loaded_modules = {}
> > module_list = modules.module_list()
> > if not module_list:
> > gdb.write("no modules found\n")
> > else:
> > [self.load_module_symbols(module) for module in module_list]
> >
> > - for saved_state in saved_states:
> > - saved_state['breakpoint'].enabled = saved_state['enabled']
> > + load_state(state)
> >
> > def invoke(self, arg, from_tty):
> > self.module_paths = [os.path.expanduser(p) for p in arg.split()]
> > @@ -174,11 +215,18 @@ lx-symbols command."""
> > self.load_all_symbols()
> >
> > if hasattr(gdb, 'Breakpoint'):
> > - if self.breakpoint is not None:
> > - self.breakpoint.delete()
> > - self.breakpoint = None
> > - self.breakpoint = LoadModuleBreakpoint(
> > - "kernel/module.c:do_init_module", self)
> > + if self.module_load_breakpoint is not None:
> > + self.module_load_breakpoint.delete()
> > + self.module_load_breakpoint = None
> > + self.module_load_breakpoint = \
> > + LoadModuleBreakpoint("kernel/module.c:do_init_module", self)
> > +
> > + if self.module_unload_breakpoint is not None:
> > + self.module_unload_breakpoint.delete()
> > + self.module_unload_breakpoint = None
> > + self.module_unload_breakpoint = \
> > + UnLoadModuleBreakpoint("kernel/module.c:free_module", self)
> > +
> > else:
> > gdb.write("Note: symbol update on module loading not supported "
> > "with this gdb version\n")
> >
>
> Good improvement!
Thanks a lot!
Best regards,
Maxim Levitsky
>
> Jan
>
On 16.03.21 16:49, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote:
>> On 16.03.21 15:34, Maxim Levitsky wrote:
>>> On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote:
>>>> On 16.03.21 13:34, Maxim Levitsky wrote:
>>>>> On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
>>>>>> On 16.03.21 11:59, Maxim Levitsky wrote:
>>>>>>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
>>>>>>>> On 16.03.21 00:37, Sean Christopherson wrote:
>>>>>>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>>>>>>>>>> This change greatly helps with two issues:
>>>>>>>>>>
>>>>>>>>>> * Resuming from a breakpoint is much more reliable.
>>>>>>>>>>
>>>>>>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often
>>>>>>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to
>>>>>>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it
>>>>>>>>>> again.
>>>>>>>>>>
>>>>>>>>>> From the user point of view it looks like the CPU never executed a
>>>>>>>>>> single instruction and in some cases that can even prevent forward progress,
>>>>>>>>>> for example, when the breakpoint is placed by an automated script
>>>>>>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then
>>>>>>>>>> continues the guest automatically.
>>>>>>>>>> If the script execution takes enough time for another interrupt to arrive,
>>>>>>>>>> the guest will be stuck on the same breakpoint RIP forever.
>>>>>>>>>>
>>>>>>>>>> * Normal single stepping is much more predictable, since it won't land the
>>>>>>>>>> debugger into an interrupt handler, so it is much more usable.
>>>>>>>>>>
>>>>>>>>>> (If entry to an interrupt handler is desired, the user can still place a
>>>>>>>>>> breakpoint at it and resume the guest, which won't activate this workaround
>>>>>>>>>> and let the gdb still stop at the interrupt handler)
>>>>>>>>>>
>>>>>>>>>> Since this change is only active when guest is debugged, it won't affect
>>>>>>>>>> KVM running normal 'production' VMs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Maxim Levitsky <[email protected]>
>>>>>>>>>> Tested-by: Stefano Garzarella <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> arch/x86/kvm/x86.c | 6 ++++++
>>>>>>>>>> 1 file changed, 6 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>>>>>> index a9d95f90a0487..b75d990fcf12b 100644
>>>>>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
>>>>>>>>>> can_inject = false;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> + /*
>>>>>>>>>> + * Don't inject interrupts while single stepping to make guest debug easier
>>>>>>>>>> + */
>>>>>>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>>>>>>>>>> + return;
>>>>>>>>>
>>>>>>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
>>>>>>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far
>>>>>>>>> from architectural behavior will end in tears at some point.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Does this happen to address this suspicious workaround in the kernel?
>>>>>>>>
>>>>>>>> /*
>>>>>>>> * The kernel doesn't use TF single-step outside of:
>>>>>>>> *
>>>>>>>> * - Kprobes, consumed through kprobe_debug_handler()
>>>>>>>> * - KGDB, consumed through notify_debug()
>>>>>>>> *
>>>>>>>> * So if we get here with DR_STEP set, something is wonky.
>>>>>>>> *
>>>>>>>> * A known way to trigger this is through QEMU's GDB stub,
>>>>>>>> * which leaks #DB into the guest and causes IST recursion.
>>>>>>>> */
>>>>>>>> if (WARN_ON_ONCE(dr6 & DR_STEP))
>>>>>>>> regs->flags &= ~X86_EFLAGS_TF;
>>>>>>>>
>>>>>>>> (arch/x86/kernel/traps.c, exc_debug_kernel)
>>>>>>>>
>>>>>>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
>>>>>>>> yeah, question to myself as well, dancing around broken guest debugging
>>>>>>>> for a long time while trying to fix other issues...
>>>>>>>
>>>>>>> To be honest I didn't see that warning even once, but I can imagine KVM
>>>>>>> leaking #DB due to bugs in that code. That area historically didn't receive
>>>>>>> much attention since it can only be triggered by
>>>>>>> KVM_GET/SET_GUEST_DEBUG which isn't used in production.
>>>>>>
>>>>>> I've triggered it recently while debugging a guest, that's why I got
>>>>>> aware of the code path. Long ago, all this used to work (soft BPs,
>>>>>> single-stepping etc.)
>>>>>>
>>>>>>> The only issue that I on the other hand did
>>>>>>> see which is mostly gdb fault is that it fails to remove a software breakpoint
>>>>>>> when resuming over it, if that breakpoint's python handler messes up
>>>>>>> with gdb's symbols, which is what lx-symbols does.
>>>>>>>
>>>>>>> And that despite the fact that lx-symbol doesn't mess with the object
>>>>>>> (that is the kernel) where the breakpoint is defined.
>>>>>>>
>>>>>>> Just adding/removing one symbol file is enough to trigger this issue.
>>>>>>>
>>>>>>> Since lx-symbols already works this around when it reloads all symbols,
>>>>>>> I extended that workaround to happen also when loading/unloading
>>>>>>> only a single symbol file.
>>>>>>
>>>>>> You have no issue with interactive debugging when NOT using gdb scripts
>>>>>> / lx-symbol?
>>>>>
>>>>> To be honest I don't use guest debugging that much,
>>>>> so I probably missed some issues.
>>>>>
>>>>> Now that I fixed lx-symbols though I'll probably use
>>>>> guest debugging much more.
>>>>> I will keep an eye on any issues that I find.
>>>>>
>>>>> The main push to fix lx-symbols actually came
>>>>> from me wanting to understand if there is something
>>>>> broken with KVM's guest debugging knowing that
>>>>> lx-symbols crashes the guest when module is loaded
>>>>> after lx-symbols was executed.
>>>>>
>>>>> That lx-symbols related guest crash I traced to issue
>>>>> with gdb as I explained, and the lack of blocking of the interrupts
>>>>> on single step is not a bug but more a missing feature
>>>>> that should be implemented to make single step easier to use.
>>>>
>>>> Again, this used to work fine. But maybe this patch can change the
>>>> picture by avoid that the unavoidable short TF leakage into the guest
>>>> escalates beyond the single instruction to step over.
>>>
>>>
>>> Actually now I think I understand what is going on.
>>>
>>> The TF flag isn't auto cleared as RF flag is, and if the instruction
>>> which is single stepped gets an interrupt it is pushed onto the interrupt stack.
>>> (then it is cleared for the duration of the interrupt handler)
>>> Since we use the TF flag for single stepping the guest, this indeed can
>>> cause it to be leaked.
>>>
>>> So this patch actually should mitigate this almost completely.
>>>
>>> Also now I understand why Intel has the 'monitor trap' feature, I think it
>>> is exactly for the cases when hypervisor wants to single step the guest
>>> without the fear of changing of the guest visible cpu state.
>>
>> Exactly.
>>
>>>
>>> KVM on VMX should probably switch to using monitor trap for single stepping.
>>
>> Back then, when I was hacking on the gdb-stub and KVM support, the
>> monitor trap flag was not yet broadly available, but the idea to once
>> use it was already there. Now it can be considered broadly available,
>> but it would still require some changes to get it in.
>>
>> Unfortunately, we don't have such thing with SVM, even recent versions,
>> right? So, a proper way of avoiding diverting event injections while we
>> are having the guest in an "incorrect" state should definitely be the goal.
> Yes, I am not aware of anything like monitor trap on SVM.
>
>>
>> Given that KVM knows whether TF originates solely from guest debugging
>> or was (also) injected by the guest, we should be able to identify the
>> cases where your approach is best to apply. And that without any extra
>> control knob that everyone will only forget to set.
> Well I think that the downside of this patch is that the user might actually
> want to single step into an interrupt handler,
> and this patch makes it a bit more complicated, and changes the default
> behavior.
If the default makes debugging practically impossible and breaks the
guest by leaking host state, that is also not OK.
We must not leak, that is priority one. So, even if we consider stepping
into the interrupt a use case (surely not the default one, so having
this opt-in only), we must ensure that TF will never end up on the guest
stack saved for the interrupt handler. That is KVM's default responsibility.
Jan
>
> I have no objections though to use this patch as is, or at least make this
> the new default with a new flag to override this.
>
> Sean Christopherson, what do you think?
>
> Best regards,
> Maxim Levitsky
>
>>
>> Jan
>>
>
>
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote:
> > Back then, when I was hacking on the gdb-stub and KVM support, the
> > monitor trap flag was not yet broadly available, but the idea to once
> > use it was already there. Now it can be considered broadly available,
> > but it would still require some changes to get it in.
> >
> > Unfortunately, we don't have such thing with SVM, even recent versions,
> > right? So, a proper way of avoiding diverting event injections while we
> > are having the guest in an "incorrect" state should definitely be the goal.
> Yes, I am not aware of anything like monitor trap on SVM.
>
> >
> > Given that KVM knows whether TF originates solely from guest debugging
> > or was (also) injected by the guest, we should be able to identify the
> > cases where your approach is best to apply. And that without any extra
> > control knob that everyone will only forget to set.
> Well I think that the downside of this patch is that the user might actually
> want to single step into an interrupt handler, and this patch makes it a bit
> more complicated, and changes the default behavior.
Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't
prevent weirdness if the guest is running in L2, since IRQs for L1 will cause
exits from L2 during nested_ops->check_events().
> I have no objections though to use this patch as is, or at least make this
> the new default with a new flag to override this.
That's less bad, but IMO still violates the principle of least surprise, e.g.
someone that is single-stepping a guest and is expecting an IRQ to fire will be
all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc...
settings, but no interrupt.
> Sean Christopherson, what do you think?
Rather than block all events in KVM, what about having QEMU "pause" the timer?
E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
which flavor it's using), clear them to zero, then restore both when
single-stepping is disabled. I think that will work?
On Tue, Mar 16, 2021, Jan Kiszka wrote:
> On 16.03.21 17:50, Sean Christopherson wrote:
> > Rather than block all events in KVM, what about having QEMU "pause" the timer?
> > E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
> > which flavor it's using), clear them to zero, then restore both when
> > single-stepping is disabled. I think that will work?
> >
>
> No one can stop the clock, and timers are only one source of interrupts.
> Plus they do not all come from QEMU, some also from KVM or in-kernel
> sources directly.
But are any other sources of interrupts a chronic problem? I 100% agree that
this would not be a robust solution, but neither is blocking events in KVM. At
least with this approach, the blast radius is somewhat contained.
> Would quickly become a mess.
Maybe, but it'd be Qemu's mess ;-)
On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
> On 16.03.21 11:59, Maxim Levitsky wrote:
> > On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
> > > On 16.03.21 00:37, Sean Christopherson wrote:
> > > > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > > > This change greatly helps with two issues:
> > > > >
> > > > > * Resuming from a breakpoint is much more reliable.
> > > > >
> > > > > When resuming execution from a breakpoint, with interrupts enabled, more often
> > > > > than not, KVM would inject an interrupt and make the CPU jump immediately to
> > > > > the interrupt handler and eventually return to the breakpoint, to trigger it
> > > > > again.
> > > > >
> > > > > From the user point of view it looks like the CPU never executed a
> > > > > single instruction and in some cases that can even prevent forward progress,
> > > > > for example, when the breakpoint is placed by an automated script
> > > > > (e.g lx-symbols), which does something in response to the breakpoint and then
> > > > > continues the guest automatically.
> > > > > If the script execution takes enough time for another interrupt to arrive,
> > > > > the guest will be stuck on the same breakpoint RIP forever.
> > > > >
> > > > > * Normal single stepping is much more predictable, since it won't land the
> > > > > debugger into an interrupt handler, so it is much more usable.
> > > > >
> > > > > (If entry to an interrupt handler is desired, the user can still place a
> > > > > breakpoint at it and resume the guest, which won't activate this workaround
> > > > > and let the gdb still stop at the interrupt handler)
> > > > >
> > > > > Since this change is only active when guest is debugged, it won't affect
> > > > > KVM running normal 'production' VMs.
> > > > >
> > > > >
> > > > > Signed-off-by: Maxim Levitsky <[email protected]>
> > > > > Tested-by: Stefano Garzarella <[email protected]>
> > > > > ---
> > > > > arch/x86/kvm/x86.c | 6 ++++++
> > > > > 1 file changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index a9d95f90a0487..b75d990fcf12b 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
> > > > > can_inject = false;
> > > > > }
> > > > >
> > > > > + /*
> > > > > + * Don't inject interrupts while single stepping to make guest debug easier
> > > > > + */
> > > > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > > > > + return;
> > > >
> > > > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
> > > > blocking at the start of single-stepping, unwind at the end? Deviating this far
> > > > from architectural behavior will end in tears at some point.
> > > >
> > >
> > > Does this happen to address this suspicious workaround in the kernel?
> > >
> > > /*
> > > * The kernel doesn't use TF single-step outside of:
> > > *
> > > * - Kprobes, consumed through kprobe_debug_handler()
> > > * - KGDB, consumed through notify_debug()
> > > *
> > > * So if we get here with DR_STEP set, something is wonky.
> > > *
> > > * A known way to trigger this is through QEMU's GDB stub,
> > > * which leaks #DB into the guest and causes IST recursion.
> > > */
> > > if (WARN_ON_ONCE(dr6 & DR_STEP))
> > > regs->flags &= ~X86_EFLAGS_TF;
> > >
> > > (arch/x86/kernel/traps.c, exc_debug_kernel)
> > >
> > > I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
> > > yeah, question to myself as well, dancing around broken guest debugging
> > > for a long time while trying to fix other issues...
> >
> > To be honest I didn't see that warning even once, but I can imagine KVM
> > leaking #DB due to bugs in that code. That area historically didn't receive
> > much attention since it can only be triggered by
> > KVM_GET/SET_GUEST_DEBUG which isn't used in production.
>
> I've triggered it recently while debugging a guest, that's why I got
> aware of the code path. Long ago, all this used to work (soft BPs,
> single-stepping etc.)
>
> > The only issue that I on the other hand did
> > see which is mostly gdb fault is that it fails to remove a software breakpoint
> > when resuming over it, if that breakpoint's python handler messes up
> > with gdb's symbols, which is what lx-symbols does.
> >
> > And that despite the fact that lx-symbol doesn't mess with the object
> > (that is the kernel) where the breakpoint is defined.
> >
> > Just adding/removing one symbol file is enough to trigger this issue.
> >
> > Since lx-symbols already works this around when it reloads all symbols,
> > I extended that workaround to happen also when loading/unloading
> > only a single symbol file.
>
> You have no issue with interactive debugging when NOT using gdb scripts
> / lx-symbol?
To be honest I don't use guest debugging that much,
so I probably missed some issues.
Now that I fixed lx-symbols though I'll probably use
guest debugging much more.
I will keep an eye on any issues that I find.
The main push to fix lx-symbols actually came
from me wanting to understand if there is something
broken with KVM's guest debugging knowing that
lx-symbols crashes the guest when module is loaded
after lx-symbols was executed.
That lx-symbols related guest crash I traced to issue
with gdb as I explained, and the lack of blocking of the interrupts
on single step is not a bug but more a missing feature
that should be implemented to make single step easier to use.
Another issue which isn't a bug is that you can't place a software
breakpoint if kernel is not loaded (since there is no code in memory)
or if the kernel haven't done basic paging initialization
(since there is no paging yet to know where to place the breakpoint).
Hardware breakpoints work for this fine though.
So in summary I haven't found any major issues with KVM's guest debug
yet.
If I do notice issues with guest debug, I will try to isolate
and debug them.
For the issue that you mentioned, do you have a way to reproduce it?
Best regards,
Maxim Levitsky
>
> Jan
>
On 15.03.21 23:10, Maxim Levitsky wrote:
> Fix several issues that are present in lx-symbols script:
>
> * Track module unloads by placing another software breakpoint at 'free_module'
> (force uninline this symbol just in case), and use remove-symbol-file
> gdb command to unload the symobls of the module that is unloading.
>
> That gives the gdb a chance to mark all software breakpoints from
> this module as pending again.
> Also remove the module from the 'known' module list once it is unloaded.
>
> * Since we now track module unload, we don't need to reload all
> symbols anymore when 'known' module loaded again (that can't happen anymore).
> This allows reloading a module in the debugged kernel to finish much faster,
> while lx-symbols tracks module loads and unloads.
>
> * Disable/enable all gdb breakpoints on both module load and unload breakpoint
> hits, and not only in 'load_all_symbols' as was done before.
> (load_all_symbols is no longer called on breakpoint hit)
> That allows gdb to avoid getting confused about the state of the (now two)
> internal breakpoints we place.
>
> Otherwise it will leave them in the kernel code segment, when continuing
> which triggers a guest kernel panic as soon as it skips over the 'int3'
> instruction and executes the garbage tail of the optcode on which
> the breakpoint was placed.
>
> Signed-off-by: Maxim Levitsky <[email protected]>
> ---
> kernel/module.c | 8 ++-
> scripts/gdb/linux/symbols.py | 106 +++++++++++++++++++++++++----------
> 2 files changed, 83 insertions(+), 31 deletions(-)
>
> diff --git a/kernel/module.c b/kernel/module.c
> index 30479355ab850..ea81fc06ea1f5 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -901,8 +901,12 @@ int module_refcount(struct module *mod)
> }
> EXPORT_SYMBOL(module_refcount);
>
> -/* This exists whether we can unload or not */
> -static void free_module(struct module *mod);
> +/* This exists whether we can unload or not
> + * Keep it uninlined to provide a reliable breakpoint target,
> + * e.g. for the gdb helper command 'lx-symbols'.
> + */
> +
> +static noinline void free_module(struct module *mod);
>
> SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
> unsigned int, flags)
> diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py
> index 1be9763cf8bb2..4ce879548a1ae 100644
> --- a/scripts/gdb/linux/symbols.py
> +++ b/scripts/gdb/linux/symbols.py
> @@ -17,6 +17,24 @@ import re
>
> from linux import modules, utils
>
> +def save_state():
Naming is a bit too generic. And it's not only saving the state, it's
also disabling things.
> + breakpoints = []
> + if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
> + for bp in gdb.breakpoints():
> + breakpoints.append({'breakpoint': bp, 'enabled': bp.enabled})
> + bp.enabled = False
> +
> + show_pagination = gdb.execute("show pagination", to_string=True)
> + pagination = show_pagination.endswith("on.\n")
> + gdb.execute("set pagination off")
> +
> + return {"breakpoints":breakpoints, "show_pagination": show_pagination}
> +
> +def load_state(state):
Maybe rather something with "restore", to make naming balanced. Or is
there a use case where "state" is not coming from the function above?
> + for breakpoint in state["breakpoints"]:
> + breakpoint['breakpoint'].enabled = breakpoint['enabled']
> + gdb.execute("set pagination %s" % ("on" if state["show_pagination"] else "off"))
> +
>
> if hasattr(gdb, 'Breakpoint'):
> class LoadModuleBreakpoint(gdb.Breakpoint):
> @@ -30,26 +48,38 @@ if hasattr(gdb, 'Breakpoint'):
> module_name = module['name'].string()
> cmd = self.gdb_command
>
> + # module already loaded, false alarm
> + if module_name in cmd.loaded_modules:
> + return False
Possibly at all, now that we track unloading? Can our state tracking
become out-of-sync?
> +
> # enforce update if object file is not found
> cmd.module_files_updated = False
>
> # Disable pagination while reporting symbol (re-)loading.
> # The console input is blocked in this context so that we would
> # get stuck waiting for the user to acknowledge paged output.
> - show_pagination = gdb.execute("show pagination", to_string=True)
> - pagination = show_pagination.endswith("on.\n")
> - gdb.execute("set pagination off")
> + state = save_state()
> + cmd.load_module_symbols(module)
> + load_state(state)
> + return False
>
> - if module_name in cmd.loaded_modules:
> - gdb.write("refreshing all symbols to reload module "
> - "'{0}'\n".format(module_name))
> - cmd.load_all_symbols()
> - else:
> - cmd.load_module_symbols(module)
> + class UnLoadModuleBreakpoint(gdb.Breakpoint):
> + def __init__(self, spec, gdb_command):
> + super(UnLoadModuleBreakpoint, self).__init__(spec, internal=True)
> + self.silent = True
> + self.gdb_command = gdb_command
> +
> + def stop(self):
> + module = gdb.parse_and_eval("mod")
> + module_name = module['name'].string()
> + cmd = self.gdb_command
>
> - # restore pagination state
> - gdb.execute("set pagination %s" % ("on" if pagination else "off"))
> + if not module_name in cmd.loaded_modules:
> + return False
>
Same question as above. For robustness, checking is not bad. But maybe
it's worth reporting as well.
> + state = save_state()
> + cmd.unload_module_symbols(module)
> + load_state(state)
> return False
>
>
> @@ -64,8 +94,9 @@ lx-symbols command."""
> module_paths = []
> module_files = []
> module_files_updated = False
> - loaded_modules = []
> - breakpoint = None
> + loaded_modules = {}
> + module_load_breakpoint = None
> + module_unload_breakpoint = None
>
> def __init__(self):
> super(LxSymbols, self).__init__("lx-symbols", gdb.COMMAND_FILES,
> @@ -129,21 +160,32 @@ lx-symbols command."""
> filename=module_file,
> addr=module_addr,
> sections=self._section_arguments(module))
> +
> gdb.execute(cmdline, to_string=True)
> - if module_name not in self.loaded_modules:
> - self.loaded_modules.append(module_name)
> + self.loaded_modules[module_name] = {"module_file": module_file,
> + "module_addr": module_addr}
> else:
> gdb.write("no module object found for '{0}'\n".format(module_name))
>
> + def unload_module_symbols(self, module):
> + module_name = module['name'].string()
> +
> + module_file = self.loaded_modules[module_name]["module_file"]
> + module_addr = self.loaded_modules[module_name]["module_addr"]
> +
> + gdb.write("unloading @{addr}: {filename}\n".format(
> + addr=module_addr, filename=module_file))
> + cmdline = "remove-symbol-file {filename}".format(
> + filename=module_file)
> +
> + gdb.execute(cmdline, to_string=True)
> + del self.loaded_modules[module_name]
> +
> +
> def load_all_symbols(self):
> gdb.write("loading vmlinux\n")
>
> - # Dropping symbols will disable all breakpoints. So save their states
> - # and restore them afterward.
> - saved_states = []
> - if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
> - for bp in gdb.breakpoints():
> - saved_states.append({'breakpoint': bp, 'enabled': bp.enabled})
> + state = save_state()
>
> # drop all current symbols and reload vmlinux
> orig_vmlinux = 'vmlinux'
> @@ -153,15 +195,14 @@ lx-symbols command."""
> gdb.execute("symbol-file", to_string=True)
> gdb.execute("symbol-file {0}".format(orig_vmlinux))
>
> - self.loaded_modules = []
> + self.loaded_modules = {}
> module_list = modules.module_list()
> if not module_list:
> gdb.write("no modules found\n")
> else:
> [self.load_module_symbols(module) for module in module_list]
>
> - for saved_state in saved_states:
> - saved_state['breakpoint'].enabled = saved_state['enabled']
> + load_state(state)
>
> def invoke(self, arg, from_tty):
> self.module_paths = [os.path.expanduser(p) for p in arg.split()]
> @@ -174,11 +215,18 @@ lx-symbols command."""
> self.load_all_symbols()
>
> if hasattr(gdb, 'Breakpoint'):
> - if self.breakpoint is not None:
> - self.breakpoint.delete()
> - self.breakpoint = None
> - self.breakpoint = LoadModuleBreakpoint(
> - "kernel/module.c:do_init_module", self)
> + if self.module_load_breakpoint is not None:
> + self.module_load_breakpoint.delete()
> + self.module_load_breakpoint = None
> + self.module_load_breakpoint = \
> + LoadModuleBreakpoint("kernel/module.c:do_init_module", self)
> +
> + if self.module_unload_breakpoint is not None:
> + self.module_unload_breakpoint.delete()
> + self.module_unload_breakpoint = None
> + self.module_unload_breakpoint = \
> + UnLoadModuleBreakpoint("kernel/module.c:free_module", self)
> +
> else:
> gdb.write("Note: symbol update on module loading not supported "
> "with this gdb version\n")
>
Good improvement!
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On 16.03.21 13:34, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
>> On 16.03.21 11:59, Maxim Levitsky wrote:
>>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
>>>> On 16.03.21 00:37, Sean Christopherson wrote:
>>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>>>>>> This change greatly helps with two issues:
>>>>>>
>>>>>> * Resuming from a breakpoint is much more reliable.
>>>>>>
>>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often
>>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to
>>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it
>>>>>> again.
>>>>>>
>>>>>> From the user point of view it looks like the CPU never executed a
>>>>>> single instruction and in some cases that can even prevent forward progress,
>>>>>> for example, when the breakpoint is placed by an automated script
>>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then
>>>>>> continues the guest automatically.
>>>>>> If the script execution takes enough time for another interrupt to arrive,
>>>>>> the guest will be stuck on the same breakpoint RIP forever.
>>>>>>
>>>>>> * Normal single stepping is much more predictable, since it won't land the
>>>>>> debugger into an interrupt handler, so it is much more usable.
>>>>>>
>>>>>> (If entry to an interrupt handler is desired, the user can still place a
>>>>>> breakpoint at it and resume the guest, which won't activate this workaround
>>>>>> and let the gdb still stop at the interrupt handler)
>>>>>>
>>>>>> Since this change is only active when guest is debugged, it won't affect
>>>>>> KVM running normal 'production' VMs.
>>>>>>
>>>>>>
>>>>>> Signed-off-by: Maxim Levitsky <[email protected]>
>>>>>> Tested-by: Stefano Garzarella <[email protected]>
>>>>>> ---
>>>>>> arch/x86/kvm/x86.c | 6 ++++++
>>>>>> 1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index a9d95f90a0487..b75d990fcf12b 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
>>>>>> can_inject = false;
>>>>>> }
>>>>>>
>>>>>> + /*
>>>>>> + * Don't inject interrupts while single stepping to make guest debug easier
>>>>>> + */
>>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>>>>>> + return;
>>>>>
>>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
>>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far
>>>>> from architectural behavior will end in tears at some point.
>>>>>
>>>>
>>>> Does this happen to address this suspicious workaround in the kernel?
>>>>
>>>> /*
>>>> * The kernel doesn't use TF single-step outside of:
>>>> *
>>>> * - Kprobes, consumed through kprobe_debug_handler()
>>>> * - KGDB, consumed through notify_debug()
>>>> *
>>>> * So if we get here with DR_STEP set, something is wonky.
>>>> *
>>>> * A known way to trigger this is through QEMU's GDB stub,
>>>> * which leaks #DB into the guest and causes IST recursion.
>>>> */
>>>> if (WARN_ON_ONCE(dr6 & DR_STEP))
>>>> regs->flags &= ~X86_EFLAGS_TF;
>>>>
>>>> (arch/x86/kernel/traps.c, exc_debug_kernel)
>>>>
>>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
>>>> yeah, question to myself as well, dancing around broken guest debugging
>>>> for a long time while trying to fix other issues...
>>>
>>> To be honest I didn't see that warning even once, but I can imagine KVM
>>> leaking #DB due to bugs in that code. That area historically didn't receive
>>> much attention since it can only be triggered by
>>> KVM_GET/SET_GUEST_DEBUG which isn't used in production.
>>
>> I've triggered it recently while debugging a guest, that's why I got
>> aware of the code path. Long ago, all this used to work (soft BPs,
>> single-stepping etc.)
>>
>>> The only issue that I on the other hand did
>>> see which is mostly gdb fault is that it fails to remove a software breakpoint
>>> when resuming over it, if that breakpoint's python handler messes up
>>> with gdb's symbols, which is what lx-symbols does.
>>>
>>> And that despite the fact that lx-symbol doesn't mess with the object
>>> (that is the kernel) where the breakpoint is defined.
>>>
>>> Just adding/removing one symbol file is enough to trigger this issue.
>>>
>>> Since lx-symbols already works this around when it reloads all symbols,
>>> I extended that workaround to happen also when loading/unloading
>>> only a single symbol file.
>>
>> You have no issue with interactive debugging when NOT using gdb scripts
>> / lx-symbol?
>
> To be honest I don't use guest debugging that much,
> so I probably missed some issues.
>
> Now that I fixed lx-symbols though I'll probably use
> guest debugging much more.
> I will keep an eye on any issues that I find.
>
> The main push to fix lx-symbols actually came
> from me wanting to understand if there is something
> broken with KVM's guest debugging knowing that
> lx-symbols crashes the guest when module is loaded
> after lx-symbols was executed.
>
> That lx-symbols related guest crash I traced to issue
> with gdb as I explained, and the lack of blocking of the interrupts
> on single step is not a bug but more a missing feature
> that should be implemented to make single step easier to use.
Again, this used to work fine. But maybe this patch can change the
picture by avoid that the unavoidable short TF leakage into the guest
escalates beyond the single instruction to step over.
>
> Another issue which isn't a bug is that you can't place a software
> breakpoint if kernel is not loaded (since there is no code in memory)
> or if the kernel haven't done basic paging initialization
> (since there is no paging yet to know where to place the breakpoint).
> Hardware breakpoints work for this fine though.
>
> So in summary I haven't found any major issues with KVM's guest debug
> yet.
>
> If I do notice issues with guest debug, I will try to isolate
> and debug them.
> For the issue that you mentioned, do you have a way to reproduce it?
I need to spend some time in factoring out a clean test setup, will come
back to you. I'm always pushing this to the back - and then grumble when
hitting it while debugging something urgent. Your patch is a nice reason
to do this systematically now.
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote:
> On 16.03.21 13:34, Maxim Levitsky wrote:
> > On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
> > > On 16.03.21 11:59, Maxim Levitsky wrote:
> > > > On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
> > > > > On 16.03.21 00:37, Sean Christopherson wrote:
> > > > > > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > > > > > This change greatly helps with two issues:
> > > > > > >
> > > > > > > * Resuming from a breakpoint is much more reliable.
> > > > > > >
> > > > > > > When resuming execution from a breakpoint, with interrupts enabled, more often
> > > > > > > than not, KVM would inject an interrupt and make the CPU jump immediately to
> > > > > > > the interrupt handler and eventually return to the breakpoint, to trigger it
> > > > > > > again.
> > > > > > >
> > > > > > > From the user point of view it looks like the CPU never executed a
> > > > > > > single instruction and in some cases that can even prevent forward progress,
> > > > > > > for example, when the breakpoint is placed by an automated script
> > > > > > > (e.g lx-symbols), which does something in response to the breakpoint and then
> > > > > > > continues the guest automatically.
> > > > > > > If the script execution takes enough time for another interrupt to arrive,
> > > > > > > the guest will be stuck on the same breakpoint RIP forever.
> > > > > > >
> > > > > > > * Normal single stepping is much more predictable, since it won't land the
> > > > > > > debugger into an interrupt handler, so it is much more usable.
> > > > > > >
> > > > > > > (If entry to an interrupt handler is desired, the user can still place a
> > > > > > > breakpoint at it and resume the guest, which won't activate this workaround
> > > > > > > and let the gdb still stop at the interrupt handler)
> > > > > > >
> > > > > > > Since this change is only active when guest is debugged, it won't affect
> > > > > > > KVM running normal 'production' VMs.
> > > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Maxim Levitsky <[email protected]>
> > > > > > > Tested-by: Stefano Garzarella <[email protected]>
> > > > > > > ---
> > > > > > > arch/x86/kvm/x86.c | 6 ++++++
> > > > > > > 1 file changed, 6 insertions(+)
> > > > > > >
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index a9d95f90a0487..b75d990fcf12b 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
> > > > > > > can_inject = false;
> > > > > > > }
> > > > > > >
> > > > > > > + /*
> > > > > > > + * Don't inject interrupts while single stepping to make guest debug easier
> > > > > > > + */
> > > > > > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > > > > > > + return;
> > > > > >
> > > > > > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
> > > > > > blocking at the start of single-stepping, unwind at the end? Deviating this far
> > > > > > from architectural behavior will end in tears at some point.
> > > > > >
> > > > >
> > > > > Does this happen to address this suspicious workaround in the kernel?
> > > > >
> > > > > /*
> > > > > * The kernel doesn't use TF single-step outside of:
> > > > > *
> > > > > * - Kprobes, consumed through kprobe_debug_handler()
> > > > > * - KGDB, consumed through notify_debug()
> > > > > *
> > > > > * So if we get here with DR_STEP set, something is wonky.
> > > > > *
> > > > > * A known way to trigger this is through QEMU's GDB stub,
> > > > > * which leaks #DB into the guest and causes IST recursion.
> > > > > */
> > > > > if (WARN_ON_ONCE(dr6 & DR_STEP))
> > > > > regs->flags &= ~X86_EFLAGS_TF;
> > > > >
> > > > > (arch/x86/kernel/traps.c, exc_debug_kernel)
> > > > >
> > > > > I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
> > > > > yeah, question to myself as well, dancing around broken guest debugging
> > > > > for a long time while trying to fix other issues...
> > > >
> > > > To be honest I didn't see that warning even once, but I can imagine KVM
> > > > leaking #DB due to bugs in that code. That area historically didn't receive
> > > > much attention since it can only be triggered by
> > > > KVM_GET/SET_GUEST_DEBUG which isn't used in production.
> > >
> > > I've triggered it recently while debugging a guest, that's why I got
> > > aware of the code path. Long ago, all this used to work (soft BPs,
> > > single-stepping etc.)
> > >
> > > > The only issue that I on the other hand did
> > > > see which is mostly gdb fault is that it fails to remove a software breakpoint
> > > > when resuming over it, if that breakpoint's python handler messes up
> > > > with gdb's symbols, which is what lx-symbols does.
> > > >
> > > > And that despite the fact that lx-symbol doesn't mess with the object
> > > > (that is the kernel) where the breakpoint is defined.
> > > >
> > > > Just adding/removing one symbol file is enough to trigger this issue.
> > > >
> > > > Since lx-symbols already works this around when it reloads all symbols,
> > > > I extended that workaround to happen also when loading/unloading
> > > > only a single symbol file.
> > >
> > > You have no issue with interactive debugging when NOT using gdb scripts
> > > / lx-symbol?
> >
> > To be honest I don't use guest debugging that much,
> > so I probably missed some issues.
> >
> > Now that I fixed lx-symbols though I'll probably use
> > guest debugging much more.
> > I will keep an eye on any issues that I find.
> >
> > The main push to fix lx-symbols actually came
> > from me wanting to understand if there is something
> > broken with KVM's guest debugging knowing that
> > lx-symbols crashes the guest when module is loaded
> > after lx-symbols was executed.
> >
> > That lx-symbols related guest crash I traced to issue
> > with gdb as I explained, and the lack of blocking of the interrupts
> > on single step is not a bug but more a missing feature
> > that should be implemented to make single step easier to use.
>
> Again, this used to work fine. But maybe this patch can change the
> picture by avoid that the unavoidable short TF leakage into the guest
> escalates beyond the single instruction to step over.
Actually now I think I understand what is going on.
The TF flag isn't auto cleared as RF flag is, and if the instruction
which is single stepped gets an interrupt it is pushed onto the interrupt stack.
(then it is cleared for the duration of the interrupt handler)
Since we use the TF flag for single stepping the guest, this indeed can
cause it to be leaked.
So this patch actually should mitigate this almost completely.
Also now I understand why Intel has the 'monitor trap' feature, I think it
is exactly for the cases when hypervisor wants to single step the guest
without the fear of changing of the guest visible cpu state.
KVM on VMX should probably switch to using monitor trap for single stepping.
Best regards,
Maxim Levitsky
>
> > Another issue which isn't a bug is that you can't place a software
> > breakpoint if kernel is not loaded (since there is no code in memory)
> > or if the kernel haven't done basic paging initialization
> > (since there is no paging yet to know where to place the breakpoint).
> > Hardware breakpoints work for this fine though.
> >
> > So in summary I haven't found any major issues with KVM's guest debug
> > yet.
> >
> > If I do notice issues with guest debug, I will try to isolate
> > and debug them.
> > For the issue that you mentioned, do you have a way to reproduce it?
>
> I need to spend some time in factoring out a clean test setup, will come
> back to you. I'm always pushing this to the back - and then grumble when
> hitting it while debugging something urgent. Your patch is a nice reason
> to do this systematically now.
Thanks for the review,
Best regards,
Maxim Levitsky
>
> Jan
>
On 16.03.21 15:34, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote:
>> On 16.03.21 13:34, Maxim Levitsky wrote:
>>> On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
>>>> On 16.03.21 11:59, Maxim Levitsky wrote:
>>>>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
>>>>>> On 16.03.21 00:37, Sean Christopherson wrote:
>>>>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>>>>>>>> This change greatly helps with two issues:
>>>>>>>>
>>>>>>>> * Resuming from a breakpoint is much more reliable.
>>>>>>>>
>>>>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often
>>>>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to
>>>>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it
>>>>>>>> again.
>>>>>>>>
>>>>>>>> From the user point of view it looks like the CPU never executed a
>>>>>>>> single instruction and in some cases that can even prevent forward progress,
>>>>>>>> for example, when the breakpoint is placed by an automated script
>>>>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then
>>>>>>>> continues the guest automatically.
>>>>>>>> If the script execution takes enough time for another interrupt to arrive,
>>>>>>>> the guest will be stuck on the same breakpoint RIP forever.
>>>>>>>>
>>>>>>>> * Normal single stepping is much more predictable, since it won't land the
>>>>>>>> debugger into an interrupt handler, so it is much more usable.
>>>>>>>>
>>>>>>>> (If entry to an interrupt handler is desired, the user can still place a
>>>>>>>> breakpoint at it and resume the guest, which won't activate this workaround
>>>>>>>> and let the gdb still stop at the interrupt handler)
>>>>>>>>
>>>>>>>> Since this change is only active when guest is debugged, it won't affect
>>>>>>>> KVM running normal 'production' VMs.
>>>>>>>>
>>>>>>>>
>>>>>>>> Signed-off-by: Maxim Levitsky <[email protected]>
>>>>>>>> Tested-by: Stefano Garzarella <[email protected]>
>>>>>>>> ---
>>>>>>>> arch/x86/kvm/x86.c | 6 ++++++
>>>>>>>> 1 file changed, 6 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>>>> index a9d95f90a0487..b75d990fcf12b 100644
>>>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
>>>>>>>> can_inject = false;
>>>>>>>> }
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * Don't inject interrupts while single stepping to make guest debug easier
>>>>>>>> + */
>>>>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>>>>>>>> + return;
>>>>>>>
>>>>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
>>>>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far
>>>>>>> from architectural behavior will end in tears at some point.
>>>>>>>
>>>>>>
>>>>>> Does this happen to address this suspicious workaround in the kernel?
>>>>>>
>>>>>> /*
>>>>>> * The kernel doesn't use TF single-step outside of:
>>>>>> *
>>>>>> * - Kprobes, consumed through kprobe_debug_handler()
>>>>>> * - KGDB, consumed through notify_debug()
>>>>>> *
>>>>>> * So if we get here with DR_STEP set, something is wonky.
>>>>>> *
>>>>>> * A known way to trigger this is through QEMU's GDB stub,
>>>>>> * which leaks #DB into the guest and causes IST recursion.
>>>>>> */
>>>>>> if (WARN_ON_ONCE(dr6 & DR_STEP))
>>>>>> regs->flags &= ~X86_EFLAGS_TF;
>>>>>>
>>>>>> (arch/x86/kernel/traps.c, exc_debug_kernel)
>>>>>>
>>>>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
>>>>>> yeah, question to myself as well, dancing around broken guest debugging
>>>>>> for a long time while trying to fix other issues...
>>>>>
>>>>> To be honest I didn't see that warning even once, but I can imagine KVM
>>>>> leaking #DB due to bugs in that code. That area historically didn't receive
>>>>> much attention since it can only be triggered by
>>>>> KVM_GET/SET_GUEST_DEBUG which isn't used in production.
>>>>
>>>> I've triggered it recently while debugging a guest, that's why I got
>>>> aware of the code path. Long ago, all this used to work (soft BPs,
>>>> single-stepping etc.)
>>>>
>>>>> The only issue that I on the other hand did
>>>>> see which is mostly gdb fault is that it fails to remove a software breakpoint
>>>>> when resuming over it, if that breakpoint's python handler messes up
>>>>> with gdb's symbols, which is what lx-symbols does.
>>>>>
>>>>> And that despite the fact that lx-symbol doesn't mess with the object
>>>>> (that is the kernel) where the breakpoint is defined.
>>>>>
>>>>> Just adding/removing one symbol file is enough to trigger this issue.
>>>>>
>>>>> Since lx-symbols already works this around when it reloads all symbols,
>>>>> I extended that workaround to happen also when loading/unloading
>>>>> only a single symbol file.
>>>>
>>>> You have no issue with interactive debugging when NOT using gdb scripts
>>>> / lx-symbol?
>>>
>>> To be honest I don't use guest debugging that much,
>>> so I probably missed some issues.
>>>
>>> Now that I fixed lx-symbols though I'll probably use
>>> guest debugging much more.
>>> I will keep an eye on any issues that I find.
>>>
>>> The main push to fix lx-symbols actually came
>>> from me wanting to understand if there is something
>>> broken with KVM's guest debugging knowing that
>>> lx-symbols crashes the guest when module is loaded
>>> after lx-symbols was executed.
>>>
>>> That lx-symbols related guest crash I traced to issue
>>> with gdb as I explained, and the lack of blocking of the interrupts
>>> on single step is not a bug but more a missing feature
>>> that should be implemented to make single step easier to use.
>>
>> Again, this used to work fine. But maybe this patch can change the
>> picture by avoid that the unavoidable short TF leakage into the guest
>> escalates beyond the single instruction to step over.
>
>
> Actually now I think I understand what is going on.
>
> The TF flag isn't auto cleared as RF flag is, and if the instruction
> which is single stepped gets an interrupt it is pushed onto the interrupt stack.
> (then it is cleared for the duration of the interrupt handler)
> Since we use the TF flag for single stepping the guest, this indeed can
> cause it to be leaked.
>
> So this patch actually should mitigate this almost completely.
>
> Also now I understand why Intel has the 'monitor trap' feature, I think it
> is exactly for the cases when hypervisor wants to single step the guest
> without the fear of changing of the guest visible cpu state.
Exactly.
>
> KVM on VMX should probably switch to using monitor trap for single stepping.
Back then, when I was hacking on the gdb-stub and KVM support, the
monitor trap flag was not yet broadly available, but the idea to once
use it was already there. Now it can be considered broadly available,
but it would still require some changes to get it in.
Unfortunately, we don't have such thing with SVM, even recent versions,
right? So, a proper way of avoiding diverting event injections while we
are having the guest in an "incorrect" state should definitely be the goal.
Given that KVM knows whether TF originates solely from guest debugging
or was (also) injected by the guest, we should be able to identify the
cases where your approach is best to apply. And that without any extra
control knob that everyone will only forget to set.
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On 16.03.21 17:50, Sean Christopherson wrote:
> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>> On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote:
>>> Back then, when I was hacking on the gdb-stub and KVM support, the
>>> monitor trap flag was not yet broadly available, but the idea to once
>>> use it was already there. Now it can be considered broadly available,
>>> but it would still require some changes to get it in.
>>>
>>> Unfortunately, we don't have such thing with SVM, even recent versions,
>>> right? So, a proper way of avoiding diverting event injections while we
>>> are having the guest in an "incorrect" state should definitely be the goal.
>> Yes, I am not aware of anything like monitor trap on SVM.
>>
>>>
>>> Given that KVM knows whether TF originates solely from guest debugging
>>> or was (also) injected by the guest, we should be able to identify the
>>> cases where your approach is best to apply. And that without any extra
>>> control knob that everyone will only forget to set.
>> Well I think that the downside of this patch is that the user might actually
>> want to single step into an interrupt handler, and this patch makes it a bit
>> more complicated, and changes the default behavior.
>
> Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't
> prevent weirdness if the guest is running in L2, since IRQs for L1 will cause
> exits from L2 during nested_ops->check_events().
>
>> I have no objections though to use this patch as is, or at least make this
>> the new default with a new flag to override this.
>
> That's less bad, but IMO still violates the principle of least surprise, e.g.
> someone that is single-stepping a guest and is expecting an IRQ to fire will be
> all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc...
> settings, but no interrupt.
From my practical experience with debugging guests via single step,
seeing an interrupt in that case is everything but handy and generally
also not expected (though logical, I agree). IOW: When there is a knob
for it, it will remain off in 99% of the time.
But I see the point of having some control, in an ideal world also an
indication that there are pending events, permitting the user to decide
what to do. But I suspect the gdb frontend and protocol does not easily
permit that.
>
>> Sean Christopherson, what do you think?
>
> Rather than block all events in KVM, what about having QEMU "pause" the timer?
> E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
> which flavor it's using), clear them to zero, then restore both when
> single-stepping is disabled. I think that will work?
>
No one can stop the clock, and timers are only one source of interrupts.
Plus they do not all come from QEMU, some also from KVM or in-kernel
sources directly. Would quickly become a mess.
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Tue, 2021-03-16 at 18:01 +0100, Jan Kiszka wrote:
> On 16.03.21 17:50, Sean Christopherson wrote:
> > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote:
> > > > Back then, when I was hacking on the gdb-stub and KVM support, the
> > > > monitor trap flag was not yet broadly available, but the idea to once
> > > > use it was already there. Now it can be considered broadly available,
> > > > but it would still require some changes to get it in.
> > > >
> > > > Unfortunately, we don't have such thing with SVM, even recent versions,
> > > > right? So, a proper way of avoiding diverting event injections while we
> > > > are having the guest in an "incorrect" state should definitely be the goal.
> > > Yes, I am not aware of anything like monitor trap on SVM.
> > >
> > > > Given that KVM knows whether TF originates solely from guest debugging
> > > > or was (also) injected by the guest, we should be able to identify the
> > > > cases where your approach is best to apply. And that without any extra
> > > > control knob that everyone will only forget to set.
> > > Well I think that the downside of this patch is that the user might actually
> > > want to single step into an interrupt handler, and this patch makes it a bit
> > > more complicated, and changes the default behavior.
> >
> > Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't
> > prevent weirdness if the guest is running in L2, since IRQs for L1 will cause
> > exits from L2 during nested_ops->check_events().
> >
> > > I have no objections though to use this patch as is, or at least make this
> > > the new default with a new flag to override this.
> >
> > That's less bad, but IMO still violates the principle of least surprise, e.g.
> > someone that is single-stepping a guest and is expecting an IRQ to fire will be
> > all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc...
> > settings, but no interrupt.
>
> From my practical experience with debugging guests via single step,
> seeing an interrupt in that case is everything but handy and generally
> also not expected (though logical, I agree). IOW: When there is a knob
> for it, it will remain off in 99% of the time.
>
> But I see the point of having some control, in an ideal world also an
> indication that there are pending events, permitting the user to decide
> what to do. But I suspect the gdb frontend and protocol does not easily
> permit that.
Qemu gdbstub actually does have control over suppression of the interrupts
over a single step and it is even enabled by default:
https://qemu.readthedocs.io/en/latest/system/gdb.html
(advanced debug options)
However it is currently only implemented in TCG (software emulator) mode
and not in KVM mode (I can argue that this is a qemu bug).
So my plan was to add a new kvm guest debug flag KVM_GUESTDBG_BLOCKEVENTS,
and let qemu enable it when its 'NOIRQ' mode is enabled (it is by default).
However due to the discussion in this thread about the leakage of the RFLAGS.TF,
I wonder if kvm should by default suppress events and have something like
KVM_GUESTDBG_SSTEP_ALLOW_EVENTS to override this and wire
that to qemu's NOIRQ=false case.
This will allow older qemu to work correctly and new qemu will be able to choose
the old less ideal behavior.
>
> > > Sean Christopherson, what do you think?
> >
> > Rather than block all events in KVM, what about having QEMU "pause" the timer?
> > E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
> > which flavor it's using), clear them to zero, then restore both when
> > single-stepping is disabled. I think that will work?
> >
>
> No one can stop the clock, and timers are only one source of interrupts.
> Plus they do not all come from QEMU, some also from KVM or in-kernel
> sources directly. Would quickly become a mess.
This, plus as we see, even changing with RFLAGS.TF leaks it.
Changing things like MSR_TSC_DEADLINE will also make it visible to the guest,
sooner or later and is a mess that I rather not get into.
It is _possible_ to disable timer interrupts 'out of band', but that is messy too
if done from userspace. For example, what if the timer interrupt is already pending
in local apic, when qemu decides to single step?
Also with gdbstub the user doesn't have to stop all vcpus (there is a non-stop mode),
in which only some vcpus are stopped which is actually a very cool feature,
and of course running vcpus can raise events.
Also interrupts can indeed come from things like vhost.
Best regards,
Maxim Levitsky
> Jan
>
On 16.03.21 18:26, Sean Christopherson wrote:
> On Tue, Mar 16, 2021, Jan Kiszka wrote:
>> On 16.03.21 17:50, Sean Christopherson wrote:
>>> Rather than block all events in KVM, what about having QEMU "pause" the timer?
>>> E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
>>> which flavor it's using), clear them to zero, then restore both when
>>> single-stepping is disabled. I think that will work?
>>>
>>
>> No one can stop the clock, and timers are only one source of interrupts.
>> Plus they do not all come from QEMU, some also from KVM or in-kernel
>> sources directly.
>
> But are any other sources of interrupts a chronic problem? I 100% agree that
If you are debugging a problem, you are not interested in seening
problems of the debugger, only real ones of your target. IOW: Yes, they
are, even if less likely - for idle VMs.
> this would not be a robust solution, but neither is blocking events in KVM. At
> least with this approach, the blast radius is somewhat contained.
>
>> Would quickly become a mess.
>
> Maybe, but it'd be Qemu's mess ;-)
>
Nope, it would spread to KVM as well, as indicated above.
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Tue, Mar 16, 2021 at 12:51:20PM +0200, Maxim Levitsky wrote:
> I agree but what is wrong with that?
> This is a debug feature, and it only can be enabled by the root,
> and so someone might actually want this case to happen
> (e.g to see if a SEV guest can cope with extra #VC exceptions).
That doesn't make sense, we know that and SEV-ES guest can't cope with
extra #VC exceptions, so there is no point in testing this. It is more a
way to shot oneself into the foot for the user and a potential source of
bug reports for SEV-ES guests.
> I have nothing against not allowing this for SEV-ES guests though.
> What do you think?
I think SEV-ES guests should only have the intercept bits set which
guests acutally support.
Regards,
Joerg
On Thu, 2021-03-18 at 10:19 +0100, Joerg Roedel wrote:
> On Tue, Mar 16, 2021 at 12:51:20PM +0200, Maxim Levitsky wrote:
> > I agree but what is wrong with that?
> > This is a debug feature, and it only can be enabled by the root,
> > and so someone might actually want this case to happen
> > (e.g to see if a SEV guest can cope with extra #VC exceptions).
>
> That doesn't make sense, we know that and SEV-ES guest can't cope with
> extra #VC exceptions, so there is no point in testing this. It is more a
> way to shot oneself into the foot for the user and a potential source of
> bug reports for SEV-ES guests.
But again this is a debug feature, and it is intended to allow the user
to shoot himself in the foot. Bug reports for a debug feature
are autoclosed. It is no different from say poking kernel memory with
its built-in gdbstub, for example.
Best regards,
Maxim Levitsky
>
>
> > I have nothing against not allowing this for SEV-ES guests though.
> > What do you think?
>
> I think SEV-ES guests should only have the intercept bits set which
> guests acutally support
>
> Regards,
>
> Joerg
>
On 16.03.21 13:34, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
>> On 16.03.21 11:59, Maxim Levitsky wrote:
>>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
>>>> On 16.03.21 00:37, Sean Christopherson wrote:
>>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>>>>>> This change greatly helps with two issues:
>>>>>>
>>>>>> * Resuming from a breakpoint is much more reliable.
>>>>>>
>>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often
>>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to
>>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it
>>>>>> again.
>>>>>>
>>>>>> From the user point of view it looks like the CPU never executed a
>>>>>> single instruction and in some cases that can even prevent forward progress,
>>>>>> for example, when the breakpoint is placed by an automated script
>>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then
>>>>>> continues the guest automatically.
>>>>>> If the script execution takes enough time for another interrupt to arrive,
>>>>>> the guest will be stuck on the same breakpoint RIP forever.
>>>>>>
>>>>>> * Normal single stepping is much more predictable, since it won't land the
>>>>>> debugger into an interrupt handler, so it is much more usable.
>>>>>>
>>>>>> (If entry to an interrupt handler is desired, the user can still place a
>>>>>> breakpoint at it and resume the guest, which won't activate this workaround
>>>>>> and let the gdb still stop at the interrupt handler)
>>>>>>
>>>>>> Since this change is only active when guest is debugged, it won't affect
>>>>>> KVM running normal 'production' VMs.
>>>>>>
>>>>>>
>>>>>> Signed-off-by: Maxim Levitsky <[email protected]>
>>>>>> Tested-by: Stefano Garzarella <[email protected]>
>>>>>> ---
>>>>>> arch/x86/kvm/x86.c | 6 ++++++
>>>>>> 1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index a9d95f90a0487..b75d990fcf12b 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
>>>>>> can_inject = false;
>>>>>> }
>>>>>>
>>>>>> + /*
>>>>>> + * Don't inject interrupts while single stepping to make guest debug easier
>>>>>> + */
>>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
>>>>>> + return;
>>>>>
>>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI
>>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far
>>>>> from architectural behavior will end in tears at some point.
>>>>>
>>>>
>>>> Does this happen to address this suspicious workaround in the kernel?
>>>>
>>>> /*
>>>> * The kernel doesn't use TF single-step outside of:
>>>> *
>>>> * - Kprobes, consumed through kprobe_debug_handler()
>>>> * - KGDB, consumed through notify_debug()
>>>> *
>>>> * So if we get here with DR_STEP set, something is wonky.
>>>> *
>>>> * A known way to trigger this is through QEMU's GDB stub,
>>>> * which leaks #DB into the guest and causes IST recursion.
>>>> */
>>>> if (WARN_ON_ONCE(dr6 & DR_STEP))
>>>> regs->flags &= ~X86_EFLAGS_TF;
>>>>
>>>> (arch/x86/kernel/traps.c, exc_debug_kernel)
>>>>
>>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
>>>> yeah, question to myself as well, dancing around broken guest debugging
>>>> for a long time while trying to fix other issues...
>>>
>>> To be honest I didn't see that warning even once, but I can imagine KVM
>>> leaking #DB due to bugs in that code. That area historically didn't receive
>>> much attention since it can only be triggered by
>>> KVM_GET/SET_GUEST_DEBUG which isn't used in production.
>>
>> I've triggered it recently while debugging a guest, that's why I got
>> aware of the code path. Long ago, all this used to work (soft BPs,
>> single-stepping etc.)
>>
>>> The only issue that I on the other hand did
>>> see which is mostly gdb fault is that it fails to remove a software breakpoint
>>> when resuming over it, if that breakpoint's python handler messes up
>>> with gdb's symbols, which is what lx-symbols does.
>>>
>>> And that despite the fact that lx-symbol doesn't mess with the object
>>> (that is the kernel) where the breakpoint is defined.
>>>
>>> Just adding/removing one symbol file is enough to trigger this issue.
>>>
>>> Since lx-symbols already works this around when it reloads all symbols,
>>> I extended that workaround to happen also when loading/unloading
>>> only a single symbol file.
>>
>> You have no issue with interactive debugging when NOT using gdb scripts
>> / lx-symbol?
>
> To be honest I don't use guest debugging that much,
> so I probably missed some issues.
>
> Now that I fixed lx-symbols though I'll probably use
> guest debugging much more.
> I will keep an eye on any issues that I find.
>
> The main push to fix lx-symbols actually came
> from me wanting to understand if there is something
> broken with KVM's guest debugging knowing that
> lx-symbols crashes the guest when module is loaded
> after lx-symbols was executed.
>
> That lx-symbols related guest crash I traced to issue
> with gdb as I explained, and the lack of blocking of the interrupts
> on single step is not a bug but more a missing feature
> that should be implemented to make single step easier to use.
>
> Another issue which isn't a bug is that you can't place a software
> breakpoint if kernel is not loaded (since there is no code in memory)
> or if the kernel haven't done basic paging initialization
> (since there is no paging yet to know where to place the breakpoint).
> Hardware breakpoints work for this fine though.
>
> So in summary I haven't found any major issues with KVM's guest debug
> yet.
>
> If I do notice issues with guest debug, I will try to isolate
> and debug them.
> For the issue that you mentioned, do you have a way to reproduce it?
To pick this up again, I did some experiments and was easily able to
reproduce the main problem:
- checkout linux master (1df27313f50 yesterday)
- build a fitting host kernel with KVM
- build a target kernel with defconfig + CONFIG_KVM_GUEST +
CONFIG_DEBUG_INFO, no gdb scripts for now
- boot the guest
qemu-system-x86_64 linux.img -enable-kvm -smp 4 -s
-kernel bzImage -append "console=ttyS0 root=... nokaslr"
- gdb vmlinux
- tar rem :1234
- b __x64_sys_execve
- continue whenever you hit the breakpoint, and the guest will
eventually hang
- or stepi, and you may end up in the interrupt handler
- specifically doing that on the serial console or setting the bp in
early boot exposes the issue
I've also seen
WARNING: CPU: 3 PID: 751 at ../arch/x86/kernel/traps.c:915
exc_debug+0x16b/0x1c0
from time to time.
When I apply this patch to the host, the problems are gone.
Interestingly, my OpenSUSE 5.3.18-lp152.66-default does not show this
behavior and /seems/ stable from quick testing. Not sure if there were
changes in upstream between that baseline and head or if Suse is
carrying a local fix.
In any case, I think we need the following:
- default disabling of event injections when guest debugging injected
TF
- possibly some control interface to allow events BUT then avoids any
TF injection to prevent guest state corruptions
- exposure of that interface to the gdb frontend, maybe by making it
available via the QEMU monitor (monitor ...)
- a for kvm-unit-tests to trigger the above corner cases
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Thu, Mar 18, 2021 at 11:24:25AM +0200, Maxim Levitsky wrote:
> But again this is a debug feature, and it is intended to allow the user
> to shoot himself in the foot.
And one can't debug SEV-ES guests with it, so what is the point of
enabling it for them too?
Regards,
Joerg
[only saw this now, or delivery to me was delayed - anyway]
On 16.03.21 19:02, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 18:01 +0100, Jan Kiszka wrote:
>> On 16.03.21 17:50, Sean Christopherson wrote:
>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>>>> On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote:
>>>>> Back then, when I was hacking on the gdb-stub and KVM support, the
>>>>> monitor trap flag was not yet broadly available, but the idea to once
>>>>> use it was already there. Now it can be considered broadly available,
>>>>> but it would still require some changes to get it in.
>>>>>
>>>>> Unfortunately, we don't have such thing with SVM, even recent versions,
>>>>> right? So, a proper way of avoiding diverting event injections while we
>>>>> are having the guest in an "incorrect" state should definitely be the goal.
>>>> Yes, I am not aware of anything like monitor trap on SVM.
>>>>
>>>>> Given that KVM knows whether TF originates solely from guest debugging
>>>>> or was (also) injected by the guest, we should be able to identify the
>>>>> cases where your approach is best to apply. And that without any extra
>>>>> control knob that everyone will only forget to set.
>>>> Well I think that the downside of this patch is that the user might actually
>>>> want to single step into an interrupt handler, and this patch makes it a bit
>>>> more complicated, and changes the default behavior.
>>>
>>> Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't
>>> prevent weirdness if the guest is running in L2, since IRQs for L1 will cause
>>> exits from L2 during nested_ops->check_events().
>>>
>>>> I have no objections though to use this patch as is, or at least make this
>>>> the new default with a new flag to override this.
>>>
>>> That's less bad, but IMO still violates the principle of least surprise, e.g.
>>> someone that is single-stepping a guest and is expecting an IRQ to fire will be
>>> all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc...
>>> settings, but no interrupt.
>>
>> From my practical experience with debugging guests via single step,
>> seeing an interrupt in that case is everything but handy and generally
>> also not expected (though logical, I agree). IOW: When there is a knob
>> for it, it will remain off in 99% of the time.
>>
>> But I see the point of having some control, in an ideal world also an
>> indication that there are pending events, permitting the user to decide
>> what to do. But I suspect the gdb frontend and protocol does not easily
>> permit that.
>
> Qemu gdbstub actually does have control over suppression of the interrupts
> over a single step and it is even enabled by default:
>
> https://qemu.readthedocs.io/en/latest/system/gdb.html
> (advanced debug options)
>
Ah, cool! Absolutely in line with what we need.
> However it is currently only implemented in TCG (software emulator) mode
> and not in KVM mode (I can argue that this is a qemu bug).
Maybe the behavior of old KVM was not exposing the issue, thus no one
cared. As I wrote in the other mail today, even some recent kernel do
not seem to break single-stepping, for yet unknown reasons.
>
> So my plan was to add a new kvm guest debug flag KVM_GUESTDBG_BLOCKEVENTS,
> and let qemu enable it when its 'NOIRQ' mode is enabled (it is by default).
>
> However due to the discussion in this thread about the leakage of the RFLAGS.TF,
> I wonder if kvm should by default suppress events and have something like
> KVM_GUESTDBG_SSTEP_ALLOW_EVENTS to override this and wire
> that to qemu's NOIRQ=false case.
>
> This will allow older qemu to work correctly and new qemu will be able to choose
> the old less ideal behavior.
Sounds very reasonable to me.
>
>>
>>>> Sean Christopherson, what do you think?
>>>
>>> Rather than block all events in KVM, what about having QEMU "pause" the timer?
>>> E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
>>> which flavor it's using), clear them to zero, then restore both when
>>> single-stepping is disabled. I think that will work?
>>>
>>
>> No one can stop the clock, and timers are only one source of interrupts.
>> Plus they do not all come from QEMU, some also from KVM or in-kernel
>> sources directly. Would quickly become a mess.
>
> This, plus as we see, even changing with RFLAGS.TF leaks it.
As I wrote: When we take events, the leakage must be stopped for that
case. But that might be a bit more tricky because we need to stop on the
first instruction in the interrupt handler, thus need some TF, but we
must also remove it again from the flags saved for the interrupt context
on the guest's interrupt/exception handler stack.
> Changing things like MSR_TSC_DEADLINE will also make it visible to the guest,
> sooner or later and is a mess that I rather not get into.
>
> It is _possible_ to disable timer interrupts 'out of band', but that is messy too
> if done from userspace. For example, what if the timer interrupt is already pending
> in local apic, when qemu decides to single step?
>
> Also with gdbstub the user doesn't have to stop all vcpus (there is a non-stop mode),
> in which only some vcpus are stopped which is actually a very cool feature,
> and of course running vcpus can raise events.
>
> Also interrupts can indeed come from things like vhost.
>
Exactly.
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
On Thu, Mar 18, 2021, Joerg Roedel wrote:
> On Thu, Mar 18, 2021 at 11:24:25AM +0200, Maxim Levitsky wrote:
> > But again this is a debug feature, and it is intended to allow the user
> > to shoot himself in the foot.
>
> And one can't debug SEV-ES guests with it, so what is the point of
> enabling it for them too?
Agreed. I can see myself enabling debug features by default, it would be nice
to not having to go out of my way to disable them for SEV-ES/SNP guests.
Skipping SEV-ES guests should not be difficult; KVM could probably even
print a message stating that the debug hook is being ignored. One thought would
be to snapshot debug_intercept_exceptions at VM creation, and simply zero it out
for incompatible guests. That would also allow changing debug_intercept_exceptions
without reloading KVM, which IMO would be very convenient.
On Thu, 2021-03-18 at 16:35 +0000, Sean Christopherson wrote:
> On Thu, Mar 18, 2021, Joerg Roedel wrote:
> > On Thu, Mar 18, 2021 at 11:24:25AM +0200, Maxim Levitsky wrote:
> > > But again this is a debug feature, and it is intended to allow the user
> > > to shoot himself in the foot.
> >
> > And one can't debug SEV-ES guests with it, so what is the point of
> > enabling it for them too?
You can create a special SEV-ES guest which does handle all exceptions via
#VC, or just observe it fail which can be useful for some whatever reason.
>
> Agreed. I can see myself enabling debug features by default, it would be nice
> to not having to go out of my way to disable them for SEV-ES/SNP guests.
This does sound like a valid reason to disable this for SEV-ES.
>
> Skipping SEV-ES guests should not be difficult; KVM could probably even
> print a message stating that the debug hook is being ignored. One thought would
> be to snapshot debug_intercept_exceptions at VM creation, and simply zero it out
> for incompatible guests. That would also allow changing debug_intercept_exceptions
> without reloading KVM, which IMO would be very convenient.
>
So all right I'll disable this for SEV-ES.
The idea to change the debug_intercept_exceptions on the fly is also a good idea,
I will implement it in next version of the patches.
Thanks for the review,
Best regards,
Maxim Levitsky
On Thu, Mar 18, 2021, Maxim Levitsky wrote:
> On Thu, 2021-03-18 at 16:35 +0000, Sean Christopherson wrote:
> > Skipping SEV-ES guests should not be difficult; KVM could probably even
> > print a message stating that the debug hook is being ignored. One thought would
> > be to snapshot debug_intercept_exceptions at VM creation, and simply zero it out
> > for incompatible guests. That would also allow changing debug_intercept_exceptions
> > without reloading KVM, which IMO would be very convenient.
> >
> So all right I'll disable this for SEV-ES.
Belated thought. KVM doesn't know a guest will be an SEV-ES guest until
sev_es_guest_init(), and KVM currently doesn't prevent creating vCPUs before
KVM_SEV_ES_INIT. But, I'm 99% confident that's a KVM bug. For your purposes,
I think you can assume kvm->arch.debug_intercept_exceptions will _not_ change
after vCPU creation.
> The idea to change the debug_intercept_exceptions on the fly is also a good idea,
> I will implement it in next version of the patches.
Can you also move the module param to x86? It doesn't need to be wired up for
VMX right away, but it makes sense to do it at some point, and ideally folks
won't have to update their scripts when that happens.