Subject: Re: [PATCH v2 2/2] scripts/gdb: lx-dmesg: Use explicit encoding=utf8
 errors=replace
To: Leonard Crestez <leonard.crestez@nxp.com>,
        Kieran Bingham <kieran@ksquared.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
References: <ba6f85dbb02ca980ebd0e2399b0649423399b565.1498481469.git.leonard.crestez@nxp.com>
 <acee067f3345954ed41efb77b80eebdc038619c6.1498481469.git.leonard.crestez@nxp.com>
From: Jan Kiszka <jan.kiszka@siemens.com>
Message-ID: <df13613c-f91f-e20c-4365-f2f598348635@siemens.com>
Date: Fri, 7 Jul 2017 11:16:37 +0200
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); de; rv:1.8.1.12)
 Gecko/20080226 SUSE/2.0.0.12-1.1 Thunderbird/2.0.0.12 Mnenhy/0.7.5.666
MIME-Version: 1.0
In-Reply-To: <acee067f3345954ed41efb77b80eebdc038619c6.1498481469.git.leonard.crestez@nxp.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2866
Lines: 78

On 2017-06-26 14:52, Leonard Crestez wrote:
> Use errors=replace because it is never desirable for lx-dmesg to fail on
> string decoding errors, not even if the log buffer is corrupt and we show
> incorrect info.
> 
> The kernel will sometimes print utf8, for example the copyright symbol from
> jffs2. In order to make this work specify 'utf8' everywhere because python2
> otherwise defaults to 'ascii'.
> 
> In theory the second errors='replace' is not be required because everything
> that can be decoded as utf8 should also be encodable back to utf8. But
> it's better to be extra safe here. It's worth noting that this is
> definitely not true for encoding='ascii', unknown characters are
> replaced with U+FFFD REPLACEMENT CHARACTER and they fail to encode back
> to ascii.
> 
> Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
> 
> ---
> Changes since v1:
> * Add encoding='utf8'
> * Only do an explicit encode for python2. On python3 this returns a
> bytes object which formats to b'BLAH' instead.
> * Elaborate commit message explaining what's wrong. The original patch
> was hacked together while debugging something else.
> 
> Link: https://lkml.org/lkml/2017/6/23/405
> Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
> ---
>  scripts/gdb/linux/dmesg.py | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/scripts/gdb/linux/dmesg.py b/scripts/gdb/linux/dmesg.py
> index f5a0303..6d2e09a 100644
> --- a/scripts/gdb/linux/dmesg.py
> +++ b/scripts/gdb/linux/dmesg.py
> @@ -12,6 +12,7 @@
>  #
>  
>  import gdb
> +import sys
>  
>  from linux import utils
>  
> @@ -52,13 +53,19 @@ class LxDmesg(gdb.Command):
>                  continue
>  
>              text_len = utils.read_u16(log_buf[pos + 10:pos + 12])
> -            text = log_buf[pos + 16:pos + 16 + text_len].decode()
> +            text = log_buf[pos + 16:pos + 16 + text_len].decode(
> +                encoding='utf8', errors='replace')
>              time_stamp = utils.read_u64(log_buf[pos:pos + 8])
>  
>              for line in text.splitlines():
> -                gdb.write("[{time:12.6f}] {line}\n".format(
> +                msg = u"[{time:12.6f}] {line}\n".format(
>                      time=time_stamp / 1000000000.0,
> -                    line=line))
> +                    line=line)
> +                # With python2 gdb.write will attempt to convert unicode to
> +                # ascii and might fail so pass an utf8-encoded str instead.
> +                if sys.hexversion < 0x03000000:
> +                    msg = msg.encode(encoding='utf8', errors='replace')
> +                gdb.write(msg)
>  
>              pos += length
>  
> 

Acked-by: Jan Kiszka <jan.kiszka@siemens.com>

Andrew, please pick this up.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux