2014-10-20 13:53:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86, microcode, intel: clarify log messages

On Mon, Sep 08, 2014 at 02:37:49PM -0300, Henrique de Moraes Holschuh wrote:
> The Intel microcode update driver will skip the second hardware thread
> on hyper-threaded cores during an update run, as the first hardware
> thread will have updated the entire core. This can confuse users,
> because it will look like some CPUs were not updated in the system log.
> Attempt to clarify the log messages to hint that we might be updating
> more than one CPU (hardware thread) at a time.
>
> Make sure all driver log messages conform to this template: "microcode:
> CPU#: <message>". The <message> might refer to the core, and not to the
> hardware thread/CPU.
>
> Reword error and debug messages for clarity or style. Tag most error
> messages as "error:", and warnings as "warning:". Report conditions
> which will stop a microcode update as errors, and conditions which will
> not stop a microcode update as warnings.
>
> Signed-off-by: Henrique de Moraes Holschuh <[email protected]>
> ---
> arch/x86/kernel/cpu/microcode/intel.c | 10 +++++-----
> arch/x86/kernel/cpu/microcode/intel_early.c | 11 +++++++----
> arch/x86/kernel/cpu/microcode/intel_lib.c | 12 ++++++------
> 3 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
> index 2c629d1..e99cdd8 100644
> --- a/arch/x86/kernel/cpu/microcode/intel.c
> +++ b/arch/x86/kernel/cpu/microcode/intel.c
> @@ -115,7 +115,7 @@ static int collect_cpu_info(int cpu_num, struct cpu_signature *csig)
> {
> __collect_cpu_info(cpu_num, csig);
>
> - pr_info("CPU%d sig=0x%x, pf=0x%x, revision=0x%x\n",
> + pr_info("CPU%d: sig=0x%x, pf=0x%x, revision=0x%x\n",
> cpu_num, csig->sig, csig->pf, csig->rev);
>
> return 0;
> @@ -178,11 +178,11 @@ static int apply_microcode_intel(int cpu)
> rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
>
> if (val[1] != mc_intel->hdr.rev) {
> - pr_err("CPU%d update to revision 0x%x failed\n",
> + pr_err("CPU%d: update to revision 0x%x rejected by the processor\n",
> cpu_num, mc_intel->hdr.rev);
> return -1;
> }
> - pr_info("CPU%d updated to revision 0x%x, date = %04x-%02x-%02x\n",
> + pr_info("CPU%d: entire core updated to revision 0x%x, date %04x-%02x-%02x\n",

Those two above are not really needed IMO.

> cpu_num, val[1],
> mc_intel->hdr.date & 0xffff,
> mc_intel->hdr.date >> 24,
> @@ -214,7 +214,7 @@ static enum ucode_state generic_load_microcode(int cpu, void *data, size_t size,
>
> mc_size = get_totalsize(&mc_header);
> if (!mc_size || mc_size > leftover) {
> - pr_err("error! Bad data in microcode data file\n");
> + pr_err("error: invalid microcode update data\n");

What's wrong with the original message?

> break;
> }
>
> @@ -268,7 +268,7 @@ static enum ucode_state generic_load_microcode(int cpu, void *data, size_t size,
> */
> save_mc_for_early(new_mc);
>
> - pr_debug("CPU%d found a matching microcode update with version 0x%x (current=0x%x)\n",
> + pr_debug("CPU%d: found a matching microcode update with version 0x%x (current=0x%x)\n",
> cpu, new_rev, uci->cpu_sig.rev);
> out:
> return state;
> diff --git a/arch/x86/kernel/cpu/microcode/intel_early.c b/arch/x86/kernel/cpu/microcode/intel_early.c
> index b88343f..f73fc0a 100644
> --- a/arch/x86/kernel/cpu/microcode/intel_early.c
> +++ b/arch/x86/kernel/cpu/microcode/intel_early.c
> @@ -16,6 +16,9 @@
> * as published by the Free Software Foundation; either version
> * 2 of the License, or (at your option) any later version.
> */
> +
> +#define pr_fmt(fmt) "microcode: " fmt
> +
> #include <linux/module.h>
> #include <linux/mm.h>
> #include <linux/slab.h>
> @@ -418,7 +421,7 @@ static void __ref show_saved_mc(void)
> pr_debug("no microcode data saved.\n");
> return;
> }
> - pr_debug("Total microcode saved: %d\n", mc_saved_data.mc_saved_count);
> + pr_debug("total microcode entries saved: %d\n", mc_saved_data.mc_saved_count);

That should be "Total microcode patches saved" - "entries" doesn't say a whole
lot.

>
> collect_cpu_info_early(&uci);
>
> @@ -519,7 +522,7 @@ int save_mc_for_early(u8 *mc)
> */
> ret = save_microcode(&mc_saved_data, mc_saved_tmp, mc_saved_count);
> if (ret) {
> - pr_err("Cannot save microcode patch.\n");
> + pr_warn("warning: could not store microcode update data for later use.\n");

Capitalize: "Warning: could... "

otherwise that message clarification makes sense.

> goto out;
> }
>
> @@ -579,7 +582,7 @@ print_ucode_info(struct ucode_cpu_info *uci, unsigned int date)
> {
> int cpu = smp_processor_id();
>
> - pr_info("CPU%d microcode updated early to revision 0x%x, date = %04x-%02x-%02x\n",
> + pr_info("CPU%d: entire core updated early to revision 0x%x, date %04x-%02x-%02x\n",

No, please no "entire core" mentions - that'll only confuse people.
Simply think of logical cores as separate cores which share the
microcode hw. No need for more confusion.

> cpu,
> uci->cpu_sig.rev,
> date & 0xffff,
> @@ -701,7 +704,7 @@ int __init save_microcode_in_initrd_intel(void)
> microcode_pointer(mc_saved, mc_saved_in_initrd, initrd_start, count);
> ret = save_microcode(&mc_saved_data, mc_saved, count);
> if (ret)
> - pr_err("Cannot save microcode patches from initrd.\n");
> + pr_warn("warning: failed to save early microcode update data for future use\n");

This one actually loses info - the "initrd" part.

>
> show_saved_mc();
>
> diff --git a/arch/x86/kernel/cpu/microcode/intel_lib.c b/arch/x86/kernel/cpu/microcode/intel_lib.c
> index 25915e3..1cc6494 100644
> --- a/arch/x86/kernel/cpu/microcode/intel_lib.c
> +++ b/arch/x86/kernel/cpu/microcode/intel_lib.c
> @@ -64,7 +64,7 @@ int microcode_sanity_check(void *mc, int print_err)
>
> if (mc_header->ldrver != 1 || mc_header->hdrver != 1) {
> if (print_err)
> - pr_err("error! Unknown microcode update format\n");
> + pr_err("error: unknown microcode update format\n");

Actually it should be like a real sentence:

"Error: unknown ... format.\n"

> return -EINVAL;
> }
> ext_table_size = total_size - (MC_HEADER_SIZE + data_size);
> @@ -72,13 +72,13 @@ int microcode_sanity_check(void *mc, int print_err)
> if ((ext_table_size < EXT_HEADER_SIZE)
> || ((ext_table_size - EXT_HEADER_SIZE) % EXT_SIGNATURE_SIZE)) {
> if (print_err)
> - pr_err("error! Small exttable size in microcode data file\n");
> + pr_err("error: small exttable size in microcode data file\n");

That doesn't tell me a whole lot - maybe "... truncated exttable in microcode data file" ?

> return -EINVAL;
> }
> ext_header = mc + MC_HEADER_SIZE + data_size;
> if (ext_table_size != exttable_size(ext_header)) {
> if (print_err)
> - pr_err("error! Bad exttable size in microcode data file\n");
> + pr_err("error: bad exttable size in microcode data file\n");

Ditto.

> return -EFAULT;
> }
> ext_sigcount = ext_header->count;
> @@ -114,7 +114,7 @@ int microcode_sanity_check(void *mc, int print_err)
> ext_table_sum += ext_tablep[i];
> if (ext_table_sum) {
> if (print_err)
> - pr_warn("aborting, bad extended signature table checksum\n");
> + pr_err("error: bad extended signature table checksum\n");

Capitalize.

> return -EINVAL;
> }
> }
> @@ -126,7 +126,7 @@ int microcode_sanity_check(void *mc, int print_err)
> orig_sum += ((int *)mc)[i];
> if (orig_sum) {
> if (print_err)
> - pr_err("aborting, bad checksum\n");
> + pr_err("error: bad microcode update checksum\n");

Ditto.

> return -EINVAL;
> }
> if (!ext_table_size)
> @@ -140,7 +140,7 @@ int microcode_sanity_check(void *mc, int print_err)
> + (ext_sig->sig + ext_sig->pf + ext_sig->cksum);
> if (sum) {
> if (print_err)
> - pr_err("aborting, bad checksum\n");
> + pr_err("error: bad extended signature checksum\n");

"Aborting ..." was better.

> return -EINVAL;
> }
> }
> --
> 1.7.10.4
>
>

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--


Subject: Re: [PATCH 3/8] x86, microcode, intel: clarify log messages

On Mon, 20 Oct 2014, Borislav Petkov wrote:
> On Mon, Sep 08, 2014 at 02:37:49PM -0300, Henrique de Moraes Holschuh wrote:
> > The Intel microcode update driver will skip the second hardware thread
> > on hyper-threaded cores during an update run, as the first hardware
> > thread will have updated the entire core. This can confuse users,
> > because it will look like some CPUs were not updated in the system log.
> > Attempt to clarify the log messages to hint that we might be updating
> > more than one CPU (hardware thread) at a time.
> >
> > Make sure all driver log messages conform to this template: "microcode:
> > CPU#: <message>". The <message> might refer to the core, and not to the
> > hardware thread/CPU.
> >
> > Reword error and debug messages for clarity or style. Tag most error
> > messages as "error:", and warnings as "warning:". Report conditions
> > which will stop a microcode update as errors, and conditions which will
> > not stop a microcode update as warnings.
> >
> > Signed-off-by: Henrique de Moraes Holschuh <[email protected]>

...

> > @@ -178,11 +178,11 @@ static int apply_microcode_intel(int cpu)
> > rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
> >
> > if (val[1] != mc_intel->hdr.rev) {
> > - pr_err("CPU%d update to revision 0x%x failed\n",
> > + pr_err("CPU%d: update to revision 0x%x rejected by the processor\n",
> > cpu_num, mc_intel->hdr.rev);
> > return -1;
> > }
> > - pr_info("CPU%d updated to revision 0x%x, date = %04x-%02x-%02x\n",
> > + pr_info("CPU%d: entire core updated to revision 0x%x, date %04x-%02x-%02x\n",
>
> Those two above are not really needed IMO.

They are an attempt to make it more clear. I've got bug reports about this
sort of issue before in Debian, people do wonder why some "cpus" are
updated, and others _apparently_ are not (because there are no messages for
them).

We have more reasons why an update fail. Rejected by the processor is a
very specific one we want to single out (it basically means we are either
feeding crap to the processor _or_ violating one of the architectural
requirements).

> > @@ -214,7 +214,7 @@ static enum ucode_state generic_load_microcode(int cpu, void *data, size_t size,
> >
> > mc_size = get_totalsize(&mc_header);
> > if (!mc_size || mc_size > leftover) {
> > - pr_err("error! Bad data in microcode data file\n");
> > + pr_err("error: invalid microcode update data\n");
>
> What's wrong with the original message?

This is a cosmetic fix, so that the messages follow the same pattern.

> > @@ -418,7 +421,7 @@ static void __ref show_saved_mc(void)
> > pr_debug("no microcode data saved.\n");
> > return;
> > }
> > - pr_debug("Total microcode saved: %d\n", mc_saved_data.mc_saved_count);
> > + pr_debug("total microcode entries saved: %d\n", mc_saved_data.mc_saved_count);
>
> That should be "Total microcode patches saved" - "entries" doesn't say a whole
> lot.

We don't call them patches anywhere else in Intel's case. But since you
prefer "patch", "patch" it is.

> > ret = save_microcode(&mc_saved_data, mc_saved_tmp, mc_saved_count);
> > if (ret) {
> > - pr_err("Cannot save microcode patch.\n");
> > + pr_warn("warning: could not store microcode update data for later use.\n");
>
> Capitalize: "Warning: could... "
>
> otherwise that message clarification makes sense.

It is lowercase because we now prefix all pr_<level>() with "microcode: ",
so it will look like this:

"microcode: warning: foo"
"microcode: error: foo".

Should I capitalize this and all other messages of this sort? It doesn't
look right to me to log "microcode: Error: foo" or "microcode: Warning:
foo".

> > @@ -579,7 +582,7 @@ print_ucode_info(struct ucode_cpu_info *uci, unsigned int date)
> > {
> > int cpu = smp_processor_id();
> >
> > - pr_info("CPU%d microcode updated early to revision 0x%x, date = %04x-%02x-%02x\n",
> > + pr_info("CPU%d: entire core updated early to revision 0x%x, date %04x-%02x-%02x\n",
>
> No, please no "entire core" mentions - that'll only confuse people.
> Simply think of logical cores as separate cores which share the
> microcode hw. No need for more confusion.

The status-quo confuses people, they wonder why some "cpus" were skipped,
and they file bugs about it or ask about it in the user forums. I got a bug
report for a module-based AMD processor, and it took a few round-trips to
explain what was happening to the user...

If "entire core" is unclear, what wording would you suggest?

> > @@ -701,7 +704,7 @@ int __init save_microcode_in_initrd_intel(void)
> > microcode_pointer(mc_saved, mc_saved_in_initrd, initrd_start, count);
> > ret = save_microcode(&mc_saved_data, mc_saved, count);
> > if (ret)
> > - pr_err("Cannot save microcode patches from initrd.\n");
> > + pr_warn("warning: failed to save early microcode update data for future use\n");
>
> This one actually loses info - the "initrd" part.

However, it isn't from the initrd: it comes from the early initramfs, which
is a different kind of beast. We (distros) did have microcode data inside
the initrd/initramfs for the regular microcode driver, so the previous error
message could be misleading.

Would you perfer "warning: failed to save early initramfs microcode update
data for future use\n" ?

> > @@ -72,13 +72,13 @@ int microcode_sanity_check(void *mc, int print_err)
> > if ((ext_table_size < EXT_HEADER_SIZE)
> > || ((ext_table_size - EXT_HEADER_SIZE) % EXT_SIGNATURE_SIZE)) {
> > if (print_err)
> > - pr_err("error! Small exttable size in microcode data file\n");
> > + pr_err("error: small exttable size in microcode data file\n");
>
> That doesn't tell me a whole lot - maybe "... truncated exttable in microcode data file" ?

If it hit this codepath, it will not be due to "normal" truncation. The
caller detects truncation earlier, checking the main header total_size
against the size of the data it got from userspace or the early initramfs.

The message really means: "exttable size incompatible with the size of a
valid table".

Would you prefer this alternate wording?

> > ext_header = mc + MC_HEADER_SIZE + data_size;
> > if (ext_table_size != exttable_size(ext_header)) {
> > if (print_err)
> > - pr_err("error! Bad exttable size in microcode data file\n");
> > + pr_err("error: bad exttable size in microcode data file\n");
>
> Ditto.

I want to differentiate these two, they're caused by different issues, so
they should not be the same error message.

Would you prefer "exttable entry counter incompatible with exttable size",
which is what this message really means?

> > @@ -140,7 +140,7 @@ int microcode_sanity_check(void *mc, int print_err)
> > + (ext_sig->sig + ext_sig->pf + ext_sig->cksum);
> > if (sum) {
> > if (print_err)
> > - pr_err("aborting, bad checksum\n");
> > + pr_err("error: bad extended signature checksum\n");
>
> "Aborting ..." was better.

Well, every other error message in microcode_sanity_check() also means we
abort just the same. It doesn't make sense to have "abort" just in this
one.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2014-10-29 09:55:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86, microcode, intel: clarify log messages

On Tue, Oct 21, 2014 at 12:13:32PM -0200, Henrique de Moraes Holschuh wrote:
> > > @@ -178,11 +178,11 @@ static int apply_microcode_intel(int cpu)
> > > rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
> > >
> > > if (val[1] != mc_intel->hdr.rev) {
> > > - pr_err("CPU%d update to revision 0x%x failed\n",
> > > + pr_err("CPU%d: update to revision 0x%x rejected by the processor\n",
> > > cpu_num, mc_intel->hdr.rev);
> > > return -1;
> > > }
> > > - pr_info("CPU%d updated to revision 0x%x, date = %04x-%02x-%02x\n",
> > > + pr_info("CPU%d: entire core updated to revision 0x%x, date %04x-%02x-%02x\n",
> >
> > Those two above are not really needed IMO.
>
> They are an attempt to make it more clear. I've got bug reports about this
> sort of issue before in Debian, people do wonder why some "cpus" are
> updated, and others _apparently_ are not (because there are no messages for
> them).

The fact that people open bugs doesn't always mean it is really a bug :-)

And there's a very simple answer to that:

$ grep microcode /proc/cpuinfo | uniq

should give only a single line.

And I'm pretty sure people will ask about "entire core" too. You simply
can't explain to them that the microcode engine is shared between
threads in a oneliner. :-)

Hrrm, I still don't like that "entire core" thing, how about something
like this:

pr_info("CPU%d: update to revision 0x%x, date %04x-%02x-%02x\n"

this doesn't say at least that we're updating a CPU to a revision sth
sth but that we're doing the update on CPU%d. This should clearly
state what happens without making any assumptions about what the core
contains.

> We have more reasons why an update fail. Rejected by the processor is a
> very specific one we want to single out (it basically means we are either
> feeding crap to the processor _or_ violating one of the architectural
> requirements).

That all doesn't matter - you only want to be able to pinpoint which
error message you're looking at and as long as they differ, you're fine.
Otherwise this is just an unnecessary churn.

> It is lowercase because we now prefix all pr_<level>() with "microcode: ",
> so it will look like this:
>
> "microcode: warning: foo"
> "microcode: error: foo".
>
> Should I capitalize this and all other messages of this sort? It doesn't
> look right to me to log "microcode: Error: foo" or "microcode: Warning:
> foo".

Why not? You have the piece of machinery which generates the message,
i.e. "microcode" and then a sentence which starts with a capital letter.
I think this is the accepted practice in the kernel.

> The status-quo confuses people, they wonder why some "cpus" were skipped,
> and they file bugs about it or ask about it in the user forums. I got a bug
> report for a module-based AMD processor, and it took a few round-trips to
> explain what was happening to the user...
>
> If "entire core" is unclear, what wording would you suggest?

See above.

I just don't like the quantification "entire core" because later, when
CPU guys decide to split it more, we'll need to change it again.

> However, it isn't from the initrd: it comes from the early initramfs, which
> is a different kind of beast. We (distros) did have microcode data inside
> the initrd/initramfs for the regular microcode driver, so the previous error
> message could be misleading.

Wait a minute, we do supply the early microcode by adding it to the
initrd: Documentation/x86/early-microcode.txt

Also, what you do on Debian doesn't necessarily mean all distros do it,
does it?

> Would you perfer "warning: failed to save early initramfs microcode update
> data for future use\n" ?

I guess we can say something method-agnostic without explicitly naming
that method...


> > > @@ -72,13 +72,13 @@ int microcode_sanity_check(void *mc, int print_err)
> > > if ((ext_table_size < EXT_HEADER_SIZE)
> > > || ((ext_table_size - EXT_HEADER_SIZE) % EXT_SIGNATURE_SIZE)) {
> > > if (print_err)
> > > - pr_err("error! Small exttable size in microcode data file\n");
> > > + pr_err("error: small exttable size in microcode data file\n");
> >
> > That doesn't tell me a whole lot - maybe "... truncated exttable in microcode data file" ?
>
> If it hit this codepath, it will not be due to "normal" truncation. The

What is a "normal" truncation?

> caller detects truncation earlier, checking the main header total_size
> against the size of the data it got from userspace or the early initramfs.

You mean this:

if (data_size + MC_HEADER_SIZE > total_size)

?

If so, this is not truncation check - it sanity-checks what's in the
headers or the default sizes.


Btw, there's a typo in show_saved_mc()

pr_debug("mc_saved[%d]: sig=0x%x, pf=0x%x, rev=0x%x, toal size=0x%x, date = %04x-%02x-%02x\n",
^^^^
> The message really means: "exttable size incompatible with the size of a
> valid table".

Let's call this:

"Error: extended signature table size mismatch!\n"

because this is what this check does.

> I want to differentiate these two, they're caused by different issues, so
> they should not be the same error message.
>
> Would you prefer "exttable entry counter incompatible with exttable size",
> which is what this message really means?

Yep, that sounds better.

> Well, every other error message in microcode_sanity_check() also means we
> abort just the same. It doesn't make sense to have "abort" just in this
> one.

Ok.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Subject: Re: [PATCH 3/8] x86, microcode, intel: clarify log messages

On Wed, 29 Oct 2014, Borislav Petkov wrote:
> On Tue, Oct 21, 2014 at 12:13:32PM -0200, Henrique de Moraes Holschuh wrote:
> > > > @@ -178,11 +178,11 @@ static int apply_microcode_intel(int cpu)
> > > > rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
> > > >
> > > > if (val[1] != mc_intel->hdr.rev) {
> > > > - pr_err("CPU%d update to revision 0x%x failed\n",
> > > > + pr_err("CPU%d: update to revision 0x%x rejected by the processor\n",
> > > > cpu_num, mc_intel->hdr.rev);
> > > > return -1;
> > > > }
> > > > - pr_info("CPU%d updated to revision 0x%x, date = %04x-%02x-%02x\n",
> > > > + pr_info("CPU%d: entire core updated to revision 0x%x, date %04x-%02x-%02x\n",
> > >
> > > Those two above are not really needed IMO.
> >
> > They are an attempt to make it more clear. I've got bug reports about this
> > sort of issue before in Debian, people do wonder why some "cpus" are
> > updated, and others _apparently_ are not (because there are no messages for
> > them).
>
> The fact that people open bugs doesn't always mean it is really a bug :-)

Indeed.

> And I'm pretty sure people will ask about "entire core" too. You simply
> can't explain to them that the microcode engine is shared between
> threads in a oneliner. :-)

Unfortunately, I believe you're correct.

> Hrrm, I still don't like that "entire core" thing, how about something
> like this:
>
> pr_info("CPU%d: update to revision 0x%x, date %04x-%02x-%02x\n"
>
> this doesn't say at least that we're updating a CPU to a revision sth
> sth but that we're doing the update on CPU%d. This should clearly
> state what happens without making any assumptions about what the core
> contains.

Will do. I don't think this addresses what I was trying to do, however it
_is_ better, and if you accept my proposal for the change in logging, the
confusing issue will be taken care off anyway.

> > We have more reasons why an update fail. Rejected by the processor is a
> > very specific one we want to single out (it basically means we are either
> > feeding crap to the processor _or_ violating one of the architectural
> > requirements).
>
> That all doesn't matter - you only want to be able to pinpoint which
> error message you're looking at and as long as they differ, you're fine.
> Otherwise this is just an unnecessary churn.

In this case, it is useful churn IMHO: it is a building block that I used
immediately and that I will also need later for further improvements, it is
used to log that the early microcode driver failed due to an error
(something we currently don't know at all), and it can be used to log why
(since we can use it to have as many different error messages when
attempting to update the BSP microcode as we need).

> > It is lowercase because we now prefix all pr_<level>() with "microcode: ",
> > so it will look like this:
> >
> > "microcode: warning: foo"
> > "microcode: error: foo".
> >
> > Should I capitalize this and all other messages of this sort? It doesn't
> > look right to me to log "microcode: Error: foo" or "microcode: Warning:
> > foo".
>
> Why not? You have the piece of machinery which generates the message,
> i.e. "microcode" and then a sentence which starts with a capital letter.
> I think this is the accepted practice in the kernel.

This is a matter of personal taste, so I will just do it your way, and
change the template to:

<driver>: Error|Warning: foo

> I just don't like the quantification "entire core" because later, when
> CPU guys decide to split it more, we'll need to change it again.

Yes.

> > However, it isn't from the initrd: it comes from the early initramfs, which
> > is a different kind of beast. We (distros) did have microcode data inside
> > the initrd/initramfs for the regular microcode driver, so the previous error
> > message could be misleading.
>
> Wait a minute, we do supply the early microcode by adding it to the
> initrd: Documentation/x86/early-microcode.txt

As far as the kernel is concerned, there are two ways to do that: early
initramfs image (used by the early microcode drivers), or adding the regular
microcode.ko module and /lib/firmware/intel-ucode/* inside the regular
compressed initramfs, where it will be modprobed.

Both share "initramfs" (or initrd) in the name, but they are obviously two
completely different things.

As far as userspace is concerned there is at least one extra variant (which
doesn't matter as far as the kernel is concerned): Arch linux uses two
separate initramfs images, and tells the bootloader to load both. The first
one has the early initramfs with the microcode data, the second one has the
regular compressed initramfs with the boot machinery. I don't know if
anyone else is doing it that way, but it is quite cool in that it reduces
the chances of some very horrible failure modes, so I will try to switch
Debian and Ubuntu to it next year.

> > Would you perfer "warning: failed to save early initramfs microcode update
> > data for future use\n" ?
>
> I guess we can say something method-agnostic without explicitly naming
> that method...

Yes, however isn't that code used only to save the microcode data used from
the early initramfs (because otherwise it would be freed along with the rest
of __initdata), and for nothing else?

> > > > @@ -72,13 +72,13 @@ int microcode_sanity_check(void *mc, int print_err)
> > > > if ((ext_table_size < EXT_HEADER_SIZE)
> > > > || ((ext_table_size - EXT_HEADER_SIZE) % EXT_SIGNATURE_SIZE)) {
> > > > if (print_err)
> > > > - pr_err("error! Small exttable size in microcode data file\n");
> > > > + pr_err("error: small exttable size in microcode data file\n");
> > >
> > > That doesn't tell me a whole lot - maybe "... truncated exttable in microcode data file" ?
> >
> > If it hit this codepath, it will not be due to "normal" truncation. The
>
> What is a "normal" truncation?

When you're missing the rest of the data for whatever reason, e.g:
incomplete firmware file.

> > caller detects truncation earlier, checking the main header total_size
> > against the size of the data it got from userspace or the early initramfs.
>
> You mean this:
>
> if (data_size + MC_HEADER_SIZE > total_size)
>
> ?
>
> If so, this is not truncation check - it sanity-checks what's in the
> headers or the default sizes.

No, I mean this:

(early intel microcode driver caller code):
mc_size = get_totalsize(mc_header);
if (!mc_size || mc_size > leftover ||
microcode_sanity_check(ucode_ptr, 0) < 0)
...

(regular intel microcode driver caller code):
mc_size = get_totalsize(&mc_header);
if (!mc_size || mc_size > leftover) {
pr_err("error! Bad data in microcode data file\n");
break;
}

...

if (get_ucode_data(mc, ucode_ptr, mc_size) ||
microcode_sanity_check(mc, 1) < 0) {
break;
}

Anything that was just partially downloaded (and not further corrupted) will
fail these tests before microcode_sanity_check() is even called, Thus,
"caller detects truncation earlier".

> Btw, there's a typo in show_saved_mc()
>
> pr_debug("mc_saved[%d]: sig=0x%x, pf=0x%x, rev=0x%x, toal size=0x%x, date = %04x-%02x-%02x\n",
> ^^^^

Noted. Might as well fix it, too :-)

> > The message really means: "exttable size incompatible with the size of a
> > valid table".
>
> Let's call this:
>
> "Error: extended signature table size mismatch!\n"
>
> because this is what this check does.

Will do.

> > I want to differentiate these two, they're caused by different issues, so
> > they should not be the same error message.
> >
> > Would you prefer "exttable entry counter incompatible with exttable size",
> > which is what this message really means?
>
> Yep, that sounds better.

Ok, will do.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2014-11-07 17:37:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86, microcode, intel: clarify log messages

On Fri, Oct 31, 2014 at 06:08:02PM -0200, Henrique de Moraes Holschuh wrote:
> In this case, it is useful churn IMHO: it is a building block that I used
> immediately and that I will also need later for further improvements, it is
> used to log that the early microcode driver failed due to an error
> (something we currently don't know at all),

So we're again trying to improve at the wrong side. What we actually
need to do is fix the early loader in those cases where it is still
b0rked. If any. I have currently two more fixes for 32-bit early
loading, one generic, one for AMD, but on 64-bit it is solid. At least I
don't know about any issues.

And then once all is in place, I'd like to make the loader as silent as
possible because it is supposed to just work.

If one still wants to debug stuff then we can have our small shell
script to gather the needed info on a system and people can run it on
theirs when reporting issues.

> This is a matter of personal taste, so I will just do it your way, and
> change the template to:
>
> <driver>: Error|Warning: foo

Just look at dmesg output on any system.

> ... I don't know if anyone else is doing it that way, but it is quite
> cool in that it reduces the chances of some very horrible failure
> modes, so I will try to switch Debian and Ubuntu to it next year.

Whatever you do, just keep the users in mind. Adding the microcode to
the initrd should be trivial. Maybe it should be added by a script, or
some postinst whatever funkyness, whatever... just make sure *anyone*
can create it. Not a tutorial somewhere on the net but a single command
like

update-initramfs --add-microcode ...

or whatever, I don't care, as long as it is simple and
it always works. I don't want to be pointing people to
Documentation/x86/early-microcode.txt and ask "but but, have you done
this already".

> Yes, however isn't that code used only to save the microcode data used from
> the early initramfs (because otherwise it would be freed along with the rest
> of __initdata), and for nothing else?

What do you mean with "only"?

This code saves the microcode patch which we've used in the early
loader, into a memory buffer so that we can use it later, for example,
after resuming the box or for when cores are coming back online as a
result of a CPU hotplug operation.

This is not really important, though, I'm thinking, because we have
redundant functionality from the "late" microcode loader which updates
the microcode in its CPU hotplug callback.

For that to work, though, we need the correct microcode installed in
/lib/firmware.

So packagers should be doing two important things:

* add microcode to the initrd for early loading
* add microcode to /lib/firmware/ for late loading when needed

I think with this we're covered solid. Unless I'm missing a hole.

> When you're missing the rest of the data for whatever reason, e.g:
> incomplete firmware file.

But this is the same, right?

(ext_table_size - EXT_HEADER_SIZE) % EXT_SIGNATURE_SIZE

checks whether some of the extended signatures are incomplete. So we can
say:

pr_err("error: Small exttable size/truncated extended signature in microcode data file.\n");

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--