2018-12-12 13:14:33

by Thierry Reding

[permalink] [raw]
Subject: [PATCH 1/2] scripts/spdxcheck.py: Always open files in binary mode

From: Thierry Reding <[email protected]>

The spdxcheck script currently falls over when confronted with a binary
file (such as Documentation/logo.gif). To avoid that, always open files
in binary mode and decode line-by-line, ignoring encoding errors.

One tricky case is when piping data into the script and reading it from
standard input. By default, standard input will be opened in text mode,
so we need to reopen it in binary mode.

Signed-off-by: Thierry Reding <[email protected]>
---
scripts/spdxcheck.py | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py
index 5056fb3b897d..e559c6294c39 100755
--- a/scripts/spdxcheck.py
+++ b/scripts/spdxcheck.py
@@ -168,6 +168,7 @@ class id_parser(object):
self.curline = 0
try:
for line in fd:
+ line = line.decode(locale.getpreferredencoding(False), errors='ignore')
self.curline += 1
if self.curline > maxlines:
break
@@ -249,12 +250,13 @@ if __name__ == '__main__':

try:
if len(args.path) and args.path[0] == '-':
- parser.parse_lines(sys.stdin, args.maxlines, '-')
+ stdin = os.fdopen(sys.stdin.fileno(), 'rb')
+ parser.parse_lines(stdin, args.maxlines, '-')
else:
if args.path:
for p in args.path:
if os.path.isfile(p):
- parser.parse_lines(open(p), args.maxlines, p)
+ parser.parse_lines(open(p, 'rb'), args.maxlines, p)
elif os.path.isdir(p):
scan_git_subtree(repo.head.reference.commit.tree, p)
else:
--
2.19.1



2018-12-12 13:14:42

by Thierry Reding

[permalink] [raw]
Subject: [PATCH 2/2] scripts: Add spdxcheck.py self test

From: Thierry Reding <[email protected]>

Add a script that will run spdxcheck.py through a couple of self tests
to simplify validation in the future. The tests are run for both Python
2 and Python 3 to make sure all changes to the script remain compatible
across both versions.

The script tests a regular text file (Makefile) for basic sanity checks
and then runs it on a binary file (Documentation/logo.gif) to make sure
it works in both cases. It also tests opening files passed on the
command line as well as piped files read from standard input. Finally a
run on the complete tree will be performed to catch any other potential
issues.

Signed-off-by: Thierry Reding <[email protected]>
---
scripts/spdxcheck-test.sh | 12 ++++++++++++
1 file changed, 12 insertions(+)
create mode 100755 scripts/spdxcheck-test.sh

diff --git a/scripts/spdxcheck-test.sh b/scripts/spdxcheck-test.sh
new file mode 100755
index 000000000000..cfea6a0d1cc0
--- /dev/null
+++ b/scripts/spdxcheck-test.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+
+for PYTHON in python2 python3; do
+ # run check on a text and a binary file
+ for FILE in Makefile Documentation/logo.gif; do
+ $PYTHON scripts/spdxcheck.py $FILE
+ $PYTHON scripts/spdxcheck.py - < $FILE
+ done
+
+ # run check on complete tree to catch any other issues
+ $PYTHON scripts/spdxcheck.py > /dev/null
+done
--
2.19.1


2018-12-12 18:15:29

by Jeremy Cline

[permalink] [raw]
Subject: Re: [PATCH 1/2] scripts/spdxcheck.py: Always open files in binary mode

Hi,

On Wed, Dec 12, 2018 at 02:12:09PM +0100, Thierry Reding wrote:
> From: Thierry Reding <[email protected]>
>
> The spdxcheck script currently falls over when confronted with a binary
> file (such as Documentation/logo.gif). To avoid that, always open files
> in binary mode and decode line-by-line, ignoring encoding errors.
>
> One tricky case is when piping data into the script and reading it from
> standard input. By default, standard input will be opened in text mode,
> so we need to reopen it in binary mode.
>
> Signed-off-by: Thierry Reding <[email protected]>
> ---
> scripts/spdxcheck.py | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py
> index 5056fb3b897d..e559c6294c39 100755
> --- a/scripts/spdxcheck.py
> +++ b/scripts/spdxcheck.py
> @@ -168,6 +168,7 @@ class id_parser(object):
> self.curline = 0
> try:
> for line in fd:
> + line = line.decode(locale.getpreferredencoding(False), errors='ignore')
> self.curline += 1
> if self.curline > maxlines:
> break
> @@ -249,12 +250,13 @@ if __name__ == '__main__':
>
> try:
> if len(args.path) and args.path[0] == '-':
> - parser.parse_lines(sys.stdin, args.maxlines, '-')
> + stdin = os.fdopen(sys.stdin.fileno(), 'rb')
> + parser.parse_lines(stdin, args.maxlines, '-')
> else:
> if args.path:
> for p in args.path:
> if os.path.isfile(p):
> - parser.parse_lines(open(p), args.maxlines, p)
> + parser.parse_lines(open(p, 'rb'), args.maxlines, p)
> elif os.path.isdir(p):
> scan_git_subtree(repo.head.reference.commit.tree, p)
> else:
> --
> 2.19.1
>

It might be worth noting this fixes commit 6f4d29df66ac
("scripts/spdxcheck.py: make python3 compliant") and also Cc this for
stable since 6f4d29df66ac got backported to v4.19. While that commit
did indeed make the script work with Python 3 for piping data, it broke
Python 2 and made its way to stable.

Reviewed-by: Jeremy Cline <[email protected]>

Regards,
Jeremy

2018-12-13 02:52:27

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 1/2] scripts/spdxcheck.py: Always open files in binary mode

On Wed, 2018-12-12 at 13:14 -0500, Jeremy Cline wrote:
> It might be worth noting this fixes commit 6f4d29df66ac
> ("scripts/spdxcheck.py: make python3 compliant")

umm, how does it do that?

> and also Cc this for
> stable since 6f4d29df66ac got backported to v4.19. While that commit
> did indeed make the script work with Python 3 for piping data, it broke
> Python 2 and made its way to stable.

I think attempting to use spdxcheck on binary files
is not useful in the first place.



2018-12-13 07:38:53

by Uwe Kleine-König

[permalink] [raw]
Subject: Re: [PATCH 1/2] scripts/spdxcheck.py: Always open files in binary mode

On Wed, Dec 12, 2018 at 01:14:10PM -0500, Jeremy Cline wrote:
> Hi,
>
> On Wed, Dec 12, 2018 at 02:12:09PM +0100, Thierry Reding wrote:
> > From: Thierry Reding <[email protected]>
> >
> > The spdxcheck script currently falls over when confronted with a binary
> > file (such as Documentation/logo.gif). To avoid that, always open files
> > in binary mode and decode line-by-line, ignoring encoding errors.

I suggest pointing out that the breakage only happens with python3 and
results in a UnicodeDecodeError.

> > One tricky case is when piping data into the script and reading it from
> > standard input. By default, standard input will be opened in text mode,
> > so we need to reopen it in binary mode.
> >
> > Signed-off-by: Thierry Reding <[email protected]>
> > ---
> > scripts/spdxcheck.py | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py
> > index 5056fb3b897d..e559c6294c39 100755
> > --- a/scripts/spdxcheck.py
> > +++ b/scripts/spdxcheck.py
> > @@ -168,6 +168,7 @@ class id_parser(object):
> > self.curline = 0
> > try:
> > for line in fd:
> > + line = line.decode(locale.getpreferredencoding(False), errors='ignore')
> > self.curline += 1
> > if self.curline > maxlines:
> > break
> > @@ -249,12 +250,13 @@ if __name__ == '__main__':
> >
> > try:
> > if len(args.path) and args.path[0] == '-':
> > - parser.parse_lines(sys.stdin, args.maxlines, '-')
> > + stdin = os.fdopen(sys.stdin.fileno(), 'rb')
> > + parser.parse_lines(stdin, args.maxlines, '-')
> > else:
> > if args.path:
> > for p in args.path:
> > if os.path.isfile(p):
> > - parser.parse_lines(open(p), args.maxlines, p)
> > + parser.parse_lines(open(p, 'rb'), args.maxlines, p)
> > elif os.path.isdir(p):
> > scan_git_subtree(repo.head.reference.commit.tree, p)
> > else:
> > --
> > 2.19.1
> >
>
> It might be worth noting this fixes commit 6f4d29df66ac
> ("scripts/spdxcheck.py: make python3 compliant") and also Cc this for
> stable since 6f4d29df66ac got backported to v4.19. While that commit
> did indeed make the script work with Python 3 for piping data, it broke
> Python 2 and made its way to stable.

It didn't break for me. Can you provide details about how and when it
broke for you?

Best regards
Uwe

--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |

2018-12-13 15:12:32

by Jeremy Cline

[permalink] [raw]
Subject: Re: [PATCH 1/2] scripts/spdxcheck.py: Always open files in binary mode

On Thu, Dec 13, 2018 at 08:37:08AM +0100, Uwe Kleine-König wrote:
> On Wed, Dec 12, 2018 at 01:14:10PM -0500, Jeremy Cline wrote:
> > Hi,
> >
> > On Wed, Dec 12, 2018 at 02:12:09PM +0100, Thierry Reding wrote:
> > > From: Thierry Reding <[email protected]>
> > >
> > > The spdxcheck script currently falls over when confronted with a binary
> > > file (such as Documentation/logo.gif). To avoid that, always open files
> > > in binary mode and decode line-by-line, ignoring encoding errors.
>
> I suggest pointing out that the breakage only happens with python3 and
> results in a UnicodeDecodeError.
>
> > > One tricky case is when piping data into the script and reading it from
> > > standard input. By default, standard input will be opened in text mode,
> > > so we need to reopen it in binary mode.
> > >
> > > Signed-off-by: Thierry Reding <[email protected]>
> > > ---
> > > scripts/spdxcheck.py | 6 ++++--
> > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py
> > > index 5056fb3b897d..e559c6294c39 100755
> > > --- a/scripts/spdxcheck.py
> > > +++ b/scripts/spdxcheck.py
> > > @@ -168,6 +168,7 @@ class id_parser(object):
> > > self.curline = 0
> > > try:
> > > for line in fd:
> > > + line = line.decode(locale.getpreferredencoding(False), errors='ignore')
> > > self.curline += 1
> > > if self.curline > maxlines:
> > > break
> > > @@ -249,12 +250,13 @@ if __name__ == '__main__':
> > >
> > > try:
> > > if len(args.path) and args.path[0] == '-':
> > > - parser.parse_lines(sys.stdin, args.maxlines, '-')
> > > + stdin = os.fdopen(sys.stdin.fileno(), 'rb')
> > > + parser.parse_lines(stdin, args.maxlines, '-')
> > > else:
> > > if args.path:
> > > for p in args.path:
> > > if os.path.isfile(p):
> > > - parser.parse_lines(open(p), args.maxlines, p)
> > > + parser.parse_lines(open(p, 'rb'), args.maxlines, p)
> > > elif os.path.isdir(p):
> > > scan_git_subtree(repo.head.reference.commit.tree, p)
> > > else:
> > > --
> > > 2.19.1
> > >
> >
> > It might be worth noting this fixes commit 6f4d29df66ac
> > ("scripts/spdxcheck.py: make python3 compliant") and also Cc this for
> > stable since 6f4d29df66ac got backported to v4.19. While that commit
> > did indeed make the script work with Python 3 for piping data, it broke
> > Python 2 and made its way to stable.
>
> It didn't break for me. Can you provide details about how and when it
> broke for you?

I was wrong about it being Python 2 that broke, sorry about that.
6f4d29df66ac broke Python 3 when you run it against a sub-tree because
scan_git_tree() opens the files in binary mode, but then find is run
with a text string:

$ python3 scripts/spdxcheck.py net/
FAIL: argument should be integer or bytes-like object, not 'str'
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 259, in <module>
scan_git_subtree(repo.head.reference.commit.tree, p)
File "scripts/spdxcheck.py", line 211, in scan_git_subtree
scan_git_tree(tree)
File "scripts/spdxcheck.py", line 206, in scan_git_tree
parser.parse_lines(fd, args.maxlines, el.path)
File "scripts/spdxcheck.py", line 175, in parse_lines
if line.find("SPDX-License-Identifier:") < 0:
TypeError: argument should be integer or bytes-like object, not 'str'

The reason I opened things in binary mode when I started adding Python 3
support was because not all files were valid UTF-8 (and some were
binary) so I decoded the text line-by-line and ignored any decoding
errors for simplicity's sake.

Regards,
Jeremy

2018-12-13 16:17:55

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH 1/2] scripts/spdxcheck.py: Always open files in binary mode

On Wed, Dec 12, 2018 at 06:51:19PM -0800, Joe Perches wrote:
> On Wed, 2018-12-12 at 13:14 -0500, Jeremy Cline wrote:
> > It might be worth noting this fixes commit 6f4d29df66ac
> > ("scripts/spdxcheck.py: make python3 compliant")
>
> umm, how does it do that?
>
> > and also Cc this for
> > stable since 6f4d29df66ac got backported to v4.19. While that commit
> > did indeed make the script work with Python 3 for piping data, it broke
> > Python 2 and made its way to stable.
>
> I think attempting to use spdxcheck on binary files
> is not useful in the first place.

Agreed. However, if run without arguments the tool will walk the entire
tree and inevitably run into binary files. I realize that this is kind
of a lame way of dealing with binary files, but the alternatives would
be much more complicated and involve dealing with mimetypes and such.

Thierry


Attachments:
(No filename) (888.00 B)
signature.asc (849.00 B)
Download all attachments

2018-12-13 19:42:26

by Uwe Kleine-König

[permalink] [raw]
Subject: Re: [PATCH 1/2] scripts/spdxcheck.py: Always open files in binary mode

On Thu, Dec 13, 2018 at 10:10:52AM -0500, Jeremy Cline wrote:
> On Thu, Dec 13, 2018 at 08:37:08AM +0100, Uwe Kleine-K?nig wrote:
> > It didn't break for me. Can you provide details about how and when it
> > broke for you?
>
> I was wrong about it being Python 2 that broke, sorry about that.
> 6f4d29df66ac broke Python 3 when you run it against a sub-tree because
> scan_git_tree() opens the files in binary mode, but then find is run
> with a text string:
>
> $ python3 scripts/spdxcheck.py net/
> FAIL: argument should be integer or bytes-like object, not 'str'
> Traceback (most recent call last):
> File "scripts/spdxcheck.py", line 259, in <module>
> scan_git_subtree(repo.head.reference.commit.tree, p)
> File "scripts/spdxcheck.py", line 211, in scan_git_subtree
> scan_git_tree(tree)
> File "scripts/spdxcheck.py", line 206, in scan_git_tree
> parser.parse_lines(fd, args.maxlines, el.path)
> File "scripts/spdxcheck.py", line 175, in parse_lines
> if line.find("SPDX-License-Identifier:") < 0:
> TypeError: argument should be integer or bytes-like object, not 'str'
>
> The reason I opened things in binary mode when I started adding Python 3
> support was because not all files were valid UTF-8 (and some were
> binary) so I decoded the text line-by-line and ignored any decoding
> errors for simplicity's sake.

OK I understand. The problem is that there are inconsistencies
in handling files as binaries or not that already existed before
6f4d29df66ac. Different code paths result in a different type for line
depending on how fd was opened. I fixed the cases where fd was opened
as text file and broke the cases where it was opened as binary.

So changing this to consistently using binary mode (as the patch by
Thierry does) seems the right thing to do.

Thanks
Uwe

--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |