2023-02-02 03:38:48

by Masahiro Yamada

[permalink] [raw]
Subject: [PATCH v4 0/6] kbuild: improve source package builds


This series improve deb-pkg and (src)rpm-pkg so they can build
without cleaning the kernel tree.
The debian source package will switch to 3.0 (quilt).

My next plans are:

- add 'srcdeb-pkg' target

- add more compression mode

- rewrite snap-pkg and delete the old tar macro



Masahiro Yamada (6):
kbuild: add a tool to generate a list of files ignored by git
kbuild: deb-pkg: create source package without cleaning
kbuild: rpm-pkg: build binary packages from source rpm
kbuild: srcrpm-pkg: create source package without cleaning
kbuild: deb-pkg: hide KDEB_SOURCENAME from Makefile
kbuild: deb-pkg: switch over to format 3.0 (quilt)

Makefile | 4 +
scripts/.gitignore | 1 +
scripts/Makefile | 2 +-
scripts/Makefile.package | 94 +++---
scripts/gen-exclude.c | 623 +++++++++++++++++++++++++++++++++++++++
scripts/package/mkdebian | 23 +-
scripts/package/mkspec | 8 +-
7 files changed, 706 insertions(+), 49 deletions(-)
create mode 100644 scripts/gen-exclude.c

--
2.34.1



2023-02-02 03:38:51

by Masahiro Yamada

[permalink] [raw]
Subject: [PATCH v4 1/6] kbuild: add a tool to generate a list of files ignored by git

In short, the motivation of this commit is to build a source package
without cleaning the source tree.

The deb-pkg and (src)rpm-pkg targets first run 'make clean' before
creating a source tarball. Otherwise build artifacts such as *.o,
*.a, etc. would be included in the tarball. Yet, the tarball ends up
containing several garbage files since 'make clean' does not clean
everything.

Cleaning the tree every time is annoying since it makes the incremental
build impossible. It is desirable to create a source tarball without
cleaning the tree.

In fact, there are some ways to archive this.

The easiest way is 'git archive'. Actually, 'make perf-tar*-src-pkg'
does this way, but I do not like it because it works only when the source
tree is managed by git, and all files you want in the tarball must be
committed in advance.

I want to make it work without relying on git. We can do this.

Files that are not tracked by git are generated files. We can list them
out by parsing the .gitignore files. Of course, .gitignore does not cover
all the cases, but it works well enough.

tar(1) claims to support it:

--exclude-vcs-ignores

Exclude files that match patterns read from VCS-specific ignore files.
Supported files are: .cvsignore, .gitignore, .bzrignore, and .hgignore.

The best scenario would be to use 'tar --exclude-vcs-ignores', but this
option does not work. --exclude-vcs-ignore does not understand any of
the negation (!), preceding slash, following slash, etc.. So, this option
is just useless.

Hence, I wrote this gitignore parser. The previous version [1], written
in Python, was so slow. This version is implemented in C, so it works
much faster.

This tool traverses the source tree, parsing the .gitignore files. It
prints the file paths that are not tracked by git. The output can be
used for tar's --exclude-from= option.

[How to test this tool]

$ git clean -dfx
$ make -s -j$(nproc) defconfig all # or allmodconifg or whatever
$ git archive -o ../linux1.tar --prefix=./ HEAD
$ tar tf ../linux1.tar | LANG=C sort > ../file-list1 # files emitted by 'git archive'
$ make scripts_exclude
HOSTCC scripts/gen-exclude
$ scripts/gen-exclude --prefix=./ -o ../exclude-list
$ tar cf ../linux2.tar --exclude-from=../exclude-list .
$ tar tf ../linux2.tar | LANG=C sort > ../file-list2 # files emitted by 'tar'
$ diff ../file-list1 ../file-list2 | grep -E '^(<|>)'
< ./Documentation/devicetree/bindings/.yamllint
< ./drivers/clk/.kunitconfig
< ./drivers/gpu/drm/tests/.kunitconfig
< ./drivers/gpu/drm/vc4/tests/.kunitconfig
< ./drivers/hid/.kunitconfig
< ./fs/ext4/.kunitconfig
< ./fs/fat/.kunitconfig
< ./kernel/kcsan/.kunitconfig
< ./lib/kunit/.kunitconfig
< ./mm/kfence/.kunitconfig
< ./net/sunrpc/.kunitconfig
< ./tools/testing/selftests/arm64/tags/
< ./tools/testing/selftests/arm64/tags/.gitignore
< ./tools/testing/selftests/arm64/tags/Makefile
< ./tools/testing/selftests/arm64/tags/run_tags_test.sh
< ./tools/testing/selftests/arm64/tags/tags_test.c
< ./tools/testing/selftests/kvm/.gitignore
< ./tools/testing/selftests/kvm/Makefile
< ./tools/testing/selftests/kvm/config
< ./tools/testing/selftests/kvm/settings

The source tarball contains most of files that are tracked by git. You
see some diffs, but it is just because some .gitignore files are wrong.

$ git ls-files -i -c --exclude-per-directory=.gitignore
Documentation/devicetree/bindings/.yamllint
drivers/clk/.kunitconfig
drivers/gpu/drm/tests/.kunitconfig
drivers/hid/.kunitconfig
fs/ext4/.kunitconfig
fs/fat/.kunitconfig
kernel/kcsan/.kunitconfig
lib/kunit/.kunitconfig
mm/kfence/.kunitconfig
tools/testing/selftests/arm64/tags/.gitignore
tools/testing/selftests/arm64/tags/Makefile
tools/testing/selftests/arm64/tags/run_tags_test.sh
tools/testing/selftests/arm64/tags/tags_test.c
tools/testing/selftests/kvm/.gitignore
tools/testing/selftests/kvm/Makefile
tools/testing/selftests/kvm/config
tools/testing/selftests/kvm/settings

[1]: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Masahiro Yamada <[email protected]>
---

(no changes since v3)

Changes in v3:
- Various code refactoring: remove struct gitignore, remove next: label etc.
- Support --extra-pattern option

Changes in v2:
- Reimplement in C

Makefile | 4 +
scripts/.gitignore | 1 +
scripts/Makefile | 2 +-
scripts/gen-exclude.c | 623 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 629 insertions(+), 1 deletion(-)
create mode 100644 scripts/gen-exclude.c

diff --git a/Makefile b/Makefile
index 2faf872b6808..35b294cc6f32 100644
--- a/Makefile
+++ b/Makefile
@@ -1652,6 +1652,10 @@ distclean: mrproper
%pkg: include/config/kernel.release FORCE
$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.package $@

+PHONY += scripts_exclude
+scripts_exclude: scripts_basic
+ $(Q)$(MAKE) $(build)=scripts scripts/gen-exclude
+
# Brief documentation of the typical targets used
# ---------------------------------------------------------------------------

diff --git a/scripts/.gitignore b/scripts/.gitignore
index 6e9ce6720a05..7f433bc1461c 100644
--- a/scripts/.gitignore
+++ b/scripts/.gitignore
@@ -1,5 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
/asn1_compiler
+/gen-exclude
/generate_rust_target
/insert-sys-cert
/kallsyms
diff --git a/scripts/Makefile b/scripts/Makefile
index 32b6ba722728..5dcd7f57607f 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -38,7 +38,7 @@ HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
endif

# The following programs are only built on demand
-hostprogs += unifdef
+hostprogs += gen-exclude unifdef

# The module linker script is preprocessed on demand
targets += module.lds
diff --git a/scripts/gen-exclude.c b/scripts/gen-exclude.c
new file mode 100644
index 000000000000..5c4ecd902290
--- /dev/null
+++ b/scripts/gen-exclude.c
@@ -0,0 +1,623 @@
+// SPDX-License-Identifier: GPL-2.0-only
+//
+// Traverse the source tree, parsing all .gitignore files, and print file paths
+// that are not tracked by git.
+// The output is suitable to the --exclude-from option of tar.
+// This is useful until the --exclude-vcs-ignores option gets working correctly.
+//
+// Copyright (C) 2023 Masahiro Yamada <[email protected]>
+
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <fnmatch.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+// struct pattern - represent an ignore pattern (a line in .gitignroe)
+// @negate: negate the pattern (prefixing '!')
+// @dir_only: only matches directories (trailing '/')
+// @path_match: true if the glob pattern is a path instead of a file name
+// @double_asterisk: true if the glob pattern contains double asterisks ('**')
+// @glob: glob pattern
+struct pattern {
+ bool negate;
+ bool dir_only;
+ bool path_match;
+ bool double_asterisk;
+ char glob[];
+};
+
+struct pattern **patterns;
+static int nr_patterns, alloced_patterns;
+
+// Remember the number of patterns at each directory level
+static int *nr_patterns_at;
+// Track the current/max directory level;
+static int depth, max_depth;
+static bool debug_on;
+static FILE *out_fp;
+static char *prefix = "";
+static char *progname;
+
+static void __attribute__((noreturn)) perror_exit(const char *s)
+{
+ perror(s);
+
+ exit(EXIT_FAILURE);
+}
+
+static void __attribute__((noreturn)) error_exit(const char *fmt, ...)
+{
+ va_list args;
+
+ fprintf(stderr, "%s: error: ", progname);
+
+ va_start(args, fmt);
+ vfprintf(stderr, fmt, args);
+ va_end(args);
+
+ exit(EXIT_FAILURE);
+}
+
+static void debug(const char *fmt, ...)
+{
+ va_list args;
+ int i;
+
+ if (!debug_on)
+ return;
+
+ fprintf(stderr, "[DEBUG]");
+
+ for (i = 0; i < depth * 2; i++)
+ fputc(' ', stderr);
+
+ va_start(args, fmt);
+ vfprintf(stderr, fmt, args);
+ va_end(args);
+}
+
+static void *xrealloc(void *ptr, size_t size)
+{
+ ptr = realloc(ptr, size);
+ if (!ptr)
+ perror_exit(progname);
+
+ return ptr;
+}
+
+static void *xmalloc(size_t size)
+{
+ return xrealloc(NULL, size);
+}
+
+static char *xstrdup(const char *s)
+{
+ char *new = strdup(s);
+
+ if (!new)
+ perror_exit(progname);
+
+ return new;
+}
+
+static bool simple_match(const char *string, const char *pattern)
+{
+ return fnmatch(pattern, string, FNM_PATHNAME) == 0;
+}
+
+// Handle double asterisks ("**") matching.
+// FIXME:
+// This function does not work if double asterisks apppear multiple times,
+// like "foo/**/bar/**/baz".
+static bool double_asterisk_match(const char *path, const char *pattern)
+{
+ bool result = false;
+ int slash_diff = 0;
+ char *modified_pattern, *q;
+ const char *p;
+ size_t len;
+
+ for (p = path; *p; p++)
+ if (*p == '/')
+ slash_diff++;
+
+ for (p = pattern; *p; p++)
+ if (*p == '/')
+ slash_diff--;
+
+ len = strlen(pattern) + 1;
+
+ if (slash_diff > 0)
+ len += slash_diff * 2;
+ modified_pattern = xmalloc(len);
+
+ q = modified_pattern;
+ for (p = pattern; *p; p++) {
+ if (!strncmp(p, "**/", 3)) {
+ // "**/" means zero of more sequences of '*/".
+ // "foo**/bar" matches "foobar", "foo*/bar",
+ // "foo*/*/bar", etc.
+ while (slash_diff-- > 0) {
+ *q++ = '*';
+ *q++ = '/';
+ }
+
+ if (slash_diff == 0) {
+ *q++ = '*';
+ *q++ = '/';
+ }
+
+ if (slash_diff < 0)
+ slash_diff++;
+
+ p += 2;
+ } else if (!strcmp(p, "/**")) {
+ // A trailing "/**" matches everything inside.
+ while (slash_diff-- >= 0) {
+ *q++ = '/';
+ *q++ = '*';
+ }
+
+ p += 2;
+ } else {
+ // Copy other patterns as-is.
+ // Other consecutive asterisks are considered regular
+ // asterisks. fnmatch() already handles them like that.
+ *q++ = *p;
+ }
+ }
+
+ *q = '\0';
+
+ result = simple_match(path, modified_pattern);
+
+ free(modified_pattern);
+
+ return result;
+}
+
+// Return true if the given path is ignored by git.
+static bool is_ignored(const char *path, const char *name, bool is_dir)
+{
+ int i;
+
+ // Search the patterns in the reverse order because the last matching
+ // pattern wins.
+ for (i = nr_patterns - 1; i >= 0; i--) {
+ struct pattern *p = patterns[i];
+
+ if (!is_dir && p->dir_only)
+ continue;
+
+ if (!p->path_match) {
+ // If the pattern has no slash at the beginning or
+ // middle, it matches against the basename. Most cases
+ // fall into this and work well with double asterisks.
+ if (!simple_match(name, p->glob))
+ continue;
+ } else if (!p->double_asterisk) {
+ // Unless the pattern has double asterisks, it is still
+ // simple but matches against the path instead.
+ if (!simple_match(path, p->glob))
+ continue;
+ } else {
+ // Double asterisks with a slash. Complex, but rare.
+ if (!double_asterisk_match(path, p->glob))
+ continue;
+ }
+
+ debug("%s: matches %s%s%s\n", path, p->negate ? "!" : "",
+ p->glob, p->dir_only ? "/" : "");
+
+ return !p->negate;
+ }
+
+ debug("%s: no match\n", path);
+
+ return false;
+}
+
+// Return the length of the initial segment of the string that does not contain
+// the unquoted sequence of the given character. Similar to strcspn() in libc.
+static size_t strcspn_trailer(const char *str, char c)
+{
+ bool quoted = false;
+ size_t len = strlen(str);
+ size_t spn = len;
+ const char *s;
+
+ for (s = str; *s; s++) {
+ if (!quoted && *s == c) {
+ if (s - str < spn)
+ spn = s - str;
+ } else {
+ spn = len;
+
+ if (!quoted && *s == '\\')
+ quoted = true;
+ else
+ quoted = false;
+ }
+ }
+
+ return spn;
+}
+
+// Add an gitignore pattern.
+static void add_pattern(char *s, const char *dirpath)
+{
+ bool negate = false;
+ bool dir_only = false;
+ bool path_match = false;
+ bool double_asterisk = false;
+ char *e = s + strlen(s);
+ struct pattern *p;
+ size_t len;
+
+ // Skip comments
+ if (*s == '#')
+ return;
+
+ // Trailing spaces are ignored unless they are quoted with backslash.
+ e = s + strcspn_trailer(s, ' ');
+ *e = '\0';
+
+ // The prefix '!' negates the pattern
+ if (*s == '!') {
+ s++;
+ negate = true;
+ }
+
+ // If there is slash(es) that is not escaped at the end of the pattern,
+ // it matches only directories.
+ len = strcspn_trailer(s, '/');
+ if (s + len < e) {
+ dir_only = true;
+ e = s + len;
+ *e = '\0';
+ }
+
+ // Skip if the line gets empty
+ if (*s == '\0')
+ return;
+
+ // Double asterisk is tricky. Mark it to handle it specially later.
+ if (strstr(s, "**/") || strstr(s, "/**"))
+ double_asterisk = true;
+
+ // If there is a slash at the beginning or middle, the pattern
+ // is relative to the directory level of the .gitignore.
+ if (strchr(s, '/')) {
+ if (*s == '/')
+ s++;
+ path_match = true;
+ }
+
+ len = e - s;
+
+ // We need more room to store dirpath and '/'
+ if (path_match)
+ len += strlen(dirpath) + 1;
+
+ p = xmalloc(sizeof(*p) + len + 1);
+ p->negate = negate;
+ p->dir_only = dir_only;
+ p->path_match = path_match;
+ p->double_asterisk = double_asterisk;
+ p->glob[0] = '\0';
+
+ if (path_match) {
+ strcat(p->glob, dirpath);
+ strcat(p->glob, "/");
+ }
+
+ strcat(p->glob, s);
+
+ debug("Add pattern: %s%s%s\n", negate ? "!" : "", p->glob,
+ dir_only ? "/" : "");
+
+ if (nr_patterns >= alloced_patterns) {
+ alloced_patterns += 128;
+ patterns = xrealloc(patterns,
+ sizeof(*patterns) * alloced_patterns);
+ }
+
+ patterns[nr_patterns++] = p;
+}
+
+static void *load_gitignore(const char *dirpath)
+{
+ struct stat st;
+ char path[PATH_MAX], *buf;
+ int fd, ret;
+
+ ret = snprintf(path, sizeof(path), "%s/.gitignore", dirpath);
+ if (ret >= sizeof(path))
+ error_exit("%s: too long path was truncated\n", path);
+
+ // If .gitignore does not exist in this directory, open() fails.
+ // It is ok, just skip it.
+ fd = open(path, O_RDONLY);
+ if (fd < 0)
+ return NULL;
+
+ if (fstat(fd, &st) < 0)
+ perror_exit(path);
+
+ buf = xmalloc(st.st_size + 1);
+ if (read(fd, buf, st.st_size) != st.st_size)
+ perror_exit(path);
+
+ buf[st.st_size] = '\0';
+ if (close(fd))
+ perror_exit(path);
+
+ return buf;
+}
+
+// Parse '.gitignore' in the given directory.
+static void parse_gitignore(const char *dirpath)
+{
+ char *buf, *s, *next;
+
+ buf = load_gitignore(dirpath);
+ if (!buf)
+ return;
+
+ debug("Parse %s/.gitignore\n", dirpath);
+
+ for (s = buf; *s; s = next) {
+ next = s;
+
+ while (*next != '\0' && *next != '\n')
+ next++;
+
+ if (*next != '\0') {
+ *next = '\0';
+ next++;
+ }
+
+ add_pattern(s, dirpath);
+ }
+
+ free(buf);
+}
+
+// Save the current number of patterns and increment the depth
+static void increment_depth(void)
+{
+ if (depth >= max_depth) {
+ max_depth += 1;
+ nr_patterns_at = xrealloc(nr_patterns_at,
+ sizeof(*nr_patterns_at) * max_depth);
+ }
+
+ nr_patterns_at[depth] = nr_patterns;
+ depth++;
+}
+
+// Decrement the depth, and free up the patterns of this directory level.
+static void decrement_depth(void)
+{
+ depth--;
+ if (depth < 0)
+ error_exit("BUG\n");
+
+ while (nr_patterns > nr_patterns_at[depth])
+ free(patterns[--nr_patterns]);
+}
+
+// If we find an ignored path, print it.
+static void print_path(const char *path)
+{
+ // The path always start with "./". If not, it is a bug.
+ if (strlen(path) < 2)
+ error_exit("BUG\n");
+
+ // Replace the root directory with the prefix you like.
+ // This is useful for the tar command.
+ fprintf(out_fp, "%s%s\n", prefix, path + 2);
+}
+
+// Traverse the entire directory tree, parsing .gitignore files.
+// Print file paths that are not tracked by git.
+//
+// Return true if all files under the directory are ignored, false otherwise.
+static bool traverse_directory(const char *dirpath)
+{
+ bool all_ignored = true;
+ DIR *dirp;
+
+ debug("Enter[%d]: %s\n", depth, dirpath);
+ increment_depth();
+
+ // We do not know whether .gitignore exists in this directory or not.
+ // Anyway, try to open it.
+ parse_gitignore(dirpath);
+
+ dirp = opendir(dirpath);
+ if (!dirp)
+ perror_exit(dirpath);
+
+ while (1) {
+ char path[PATH_MAX];
+ struct dirent *d;
+ int ret;
+
+ errno = 0;
+ d = readdir(dirp);
+ if (!d) {
+ // readdir() returns NULL on the end of the directory
+ // steam, and also on an error. To distinguish them,
+ // errno should be checked.
+ if (errno)
+ perror_exit(dirpath);
+ break;
+ }
+
+ if (!strcmp(d->d_name, "..") || !strcmp(d->d_name, "."))
+ continue;
+
+ ret = snprintf(path, sizeof(path), "%s/%s", dirpath, d->d_name);
+ if (ret >= sizeof(path))
+ error_exit("%s: too long path was truncated\n", path);
+
+ if (is_ignored(path, d->d_name, d->d_type & DT_DIR)) {
+ debug("Ignore: %s\n", path);
+ print_path(path);
+ } else {
+ if ((d->d_type & DT_DIR) && !(d->d_type & DT_LNK)) {
+ if (!traverse_directory(path))
+ all_ignored = false;
+ } else {
+ all_ignored = false;
+ }
+ }
+ }
+
+ if (closedir(dirp))
+ perror_exit(dirpath);
+
+ // If all the files under this directory are ignored, let's ignore this
+ // directory as well in order to avoid empty directories in the tarball.
+ if (all_ignored) {
+ debug("Ignore: %s (due to all files inside ignored)\n", dirpath);
+ print_path(dirpath);
+ }
+
+ decrement_depth();
+ debug("Leave[%d]: %s\n", depth, dirpath);
+
+ return all_ignored;
+}
+
+// Register hard-coded ignore patterns.
+static void add_fixed_patterns(void)
+{
+ const char * const fixed_patterns[] = {
+ ".git/",
+ };
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(fixed_patterns); i++) {
+ char *s = xstrdup(fixed_patterns[i]);
+
+ add_pattern(s, ".");
+ free(s);
+ }
+}
+
+static void usage(void)
+{
+ fprintf(stderr,
+ "usage: %s [options]\n"
+ "\n"
+ "Print files that are not ignored by git\n"
+ "\n"
+ "options:\n"
+ " -d, --debug print debug messages to stderr\n"
+ " -e, --extra-pattern PATTERN Add extra ignore patterns. This behaves like it is prepended to the top .gitignore\n"
+ " -h, --help show this help message and exit\n"
+ " -o, --output FILE output to a file (default: '-', i.e. stdout)\n"
+ " -p, --prefix PREFIX prefix added to each path (default: empty string)\n"
+ " -r, --rootdir DIR root of the source tree (default: current working directory):\n",
+ progname);
+}
+
+int main(int argc, char *argv[])
+{
+ const char *output = "-";
+ const char *rootdir = ".";
+
+ progname = strrchr(argv[0], '/');
+ if (progname)
+ progname++;
+ else
+ progname = argv[0];
+
+ while (1) {
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"extra-pattern", required_argument, NULL, 'e'},
+ {"help", no_argument, NULL, 'h'},
+ {"output", required_argument, NULL, 'o'},
+ {"prefix", required_argument, NULL, 'p'},
+ {"rootdir", required_argument, NULL, 'r'},
+ {},
+ };
+
+ int c = getopt_long(argc, argv, "de:ho:p:r:", long_options, NULL);
+
+ if (c == -1)
+ break;
+
+ switch (c) {
+ case 'd':
+ debug_on = true;
+ break;
+ case 'e':
+ add_pattern(optarg, ".");
+ break;
+ case 'h':
+ usage();
+ exit(0);
+ case 'o':
+ output = optarg;
+ break;
+ case 'p':
+ prefix = optarg;
+ break;
+ case 'r':
+ rootdir = optarg;
+ break;
+ case '?':
+ usage();
+ /* fallthrough */
+ default:
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ if (chdir(rootdir))
+ perror_exit(rootdir);
+
+ if (strcmp(output, "-")) {
+ out_fp = fopen(output, "w");
+ if (!out_fp)
+ perror_exit(output);
+ } else {
+ out_fp = stdout;
+ }
+
+ add_fixed_patterns();
+
+ traverse_directory(".");
+
+ if (depth != 0)
+ error_exit("BUG\n");
+
+ while (nr_patterns > 0)
+ free(patterns[--nr_patterns]);
+ free(patterns);
+ free(nr_patterns_at);
+
+ fflush(out_fp);
+ if (ferror(out_fp))
+ error_exit("not all data was written to the output\n");
+
+ if (fclose(out_fp))
+ perror_exit(output);
+
+ return 0;
+}
--
2.34.1


2023-02-02 03:39:16

by Masahiro Yamada

[permalink] [raw]
Subject: [PATCH v4 2/6] kbuild: deb-pkg: create source package without cleaning

If you run 'make deb-pkg', all objects are lost due to 'make clean',
which makes the incremental builds impossible.

Instead of cleaning, pass the exclude list to tar's --exclude-from
option.

Previously, *.diff.gz contained some check-in files such as
.clang-format, .cocciconfig.

With this commit, *.diff.gz will only contain the .config and debian/.
The other source files will go into the tarball.

Signed-off-by: Masahiro Yamada <[email protected]>
---

Changes in v4:
- Fix a typo in comment

Changes in v3:
- Add --extra-pattern='*.rej'
- Exclude symlinks at the toplevel
- Add --sort=name tar option

scripts/Makefile.package | 38 +++++++++++++++++++++++++++++++++-----
scripts/package/mkdebian | 25 +++++++++++++++++++++++++
2 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/scripts/Makefile.package b/scripts/Makefile.package
index dfbf40454a99..14567043a8af 100644
--- a/scripts/Makefile.package
+++ b/scripts/Makefile.package
@@ -50,6 +50,32 @@ fi ; \
tar -I $(KGZIP) -c $(RCS_TAR_IGNORE) -f $(2).tar.gz \
--transform 's:^:$(2)/:S' $(TAR_CONTENT) $(3)

+# Source Tarball
+# ---------------------------------------------------------------------------
+
+PHONY += gen-exclude
+gen-exclude:
+ $(Q)$(MAKE) -f $(srctree)/Makefile scripts_exclude
+
+# - Commit 1f5d3a6b6532e25a5cdf1f311956b2b03d343a48 removed '*.rej' from
+# .gitignore, but it is definitely a generated file.
+# - The kernel tree has no symlink at the toplevel. If it does, it is a
+# generated one.
+quiet_cmd_exclude_list = GEN $@
+ cmd_exclude_list = \
+ scripts/gen-exclude --extra-pattern='*.rej' --prefix=./ --rootdir=$(srctree) > $@; \
+ find . -maxdepth 1 -type l >> $@; \
+ echo "./$@" >> $@
+
+.exclude-list: gen-exclude
+ $(call cmd,exclude_list)
+
+quiet_cmd_tar = TAR $@
+ cmd_tar = tar -I $(KGZIP) -c -f $@ -C $(srctree) --exclude-from=$< --exclude=./$@ --sort=name --transform 's:^\.:linux:S' .
+
+%.tar.gz: .exclude-list
+ $(call cmd,tar)
+
# rpm-pkg
# ---------------------------------------------------------------------------
PHONY += rpm-pkg
@@ -81,12 +107,11 @@ binrpm-pkg:

PHONY += deb-pkg
deb-pkg:
- $(MAKE) clean
$(CONFIG_SHELL) $(srctree)/scripts/package/mkdebian
- $(call cmd,src_tar,$(KDEB_SOURCENAME))
- origversion=$$(dpkg-parsechangelog -SVersion |sed 's/-[^-]*$$//');\
- mv $(KDEB_SOURCENAME).tar.gz ../$(KDEB_SOURCENAME)_$${origversion}.orig.tar.gz
- +dpkg-buildpackage -r$(KBUILD_PKG_ROOTCMD) -a$$(cat debian/arch) $(DPKG_FLAGS) --source-option=-sP -i.git -us -uc
+ $(Q)origversion=$$(dpkg-parsechangelog -SVersion |sed 's/-[^-]*$$//');\
+ $(MAKE) -f $(srctree)/scripts/Makefile.package ../$(KDEB_SOURCENAME)_$${origversion}.orig.tar.gz
+ +dpkg-buildpackage -r$(KBUILD_PKG_ROOTCMD) -a$$(cat debian/arch) $(DPKG_FLAGS) \
+ --build=source,binary --source-option=-sP -nc -us -uc

PHONY += bindeb-pkg
bindeb-pkg:
@@ -174,4 +199,7 @@ help:
@echo ' perf-tarxz-src-pkg - Build $(perf-tar).tar.xz source tarball'
@echo ' perf-tarzst-src-pkg - Build $(perf-tar).tar.zst source tarball'

+PHONY += FORCE
+FORCE:
+
.PHONY: $(PHONY)
diff --git a/scripts/package/mkdebian b/scripts/package/mkdebian
index c3bbef7a6754..2f612617cbcf 100755
--- a/scripts/package/mkdebian
+++ b/scripts/package/mkdebian
@@ -84,6 +84,8 @@ set_debarch() {
fi
}

+rm -rf debian
+
# Some variables and settings used throughout the script
version=$KERNELRELEASE
if [ -n "$KDEB_PKGVERSION" ]; then
@@ -135,6 +137,29 @@ fi
mkdir -p debian/source/
echo "1.0" > debian/source/format

+# Ugly: ignore anything except .config or debian/
+# (is there a cleaner way to do this?)
+cat<<'EOF' > debian/source/local-options
+diff-ignore
+
+extend-diff-ignore = ^[^.d]
+
+extend-diff-ignore = ^\.[^c]
+extend-diff-ignore = ^\.c($|[^o])
+extend-diff-ignore = ^\.co($|[^n])
+extend-diff-ignore = ^\.con($|[^f])
+extend-diff-ignore = ^\.conf($|[^i])
+extend-diff-ignore = ^\.confi($|[^g])
+extend-diff-ignore = ^\.config.
+
+extend-diff-ignore = ^d($|[^e])
+extend-diff-ignore = ^de($|[^b])
+extend-diff-ignore = ^deb($|[^i])
+extend-diff-ignore = ^debi($|[^a])
+extend-diff-ignore = ^debia($|[^n])
+extend-diff-ignore = ^debian[^/]
+EOF
+
echo $debarch > debian/arch
extra_build_depends=", $(if_enabled_echo CONFIG_UNWINDER_ORC libelf-dev:native)"
extra_build_depends="$extra_build_depends, $(if_enabled_echo CONFIG_SYSTEM_TRUSTED_KEYRING libssl-dev:native)"
--
2.34.1


2023-02-02 03:39:34

by Masahiro Yamada

[permalink] [raw]
Subject: [PATCH v4 3/6] kbuild: rpm-pkg: build binary packages from source rpm

The build rules of rpm-pkg and srcrpm-pkg are almost the same.
Remove the code duplication.

Change rpm-pkg to build binary packages from the source package generated
by srcrpm-pkg.

This changes the output directory of the srpm generated by 'make rpm-pkg'
because srcrpm-pkg overrides _srcrpmdir.

Signed-off-by: Masahiro Yamada <[email protected]>
---

(no changes since v3)

Changes in v3:
- Explain that the source package location will be changed.

scripts/Makefile.package | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/scripts/Makefile.package b/scripts/Makefile.package
index 14567043a8af..ebf3db81b994 100644
--- a/scripts/Makefile.package
+++ b/scripts/Makefile.package
@@ -79,11 +79,9 @@ quiet_cmd_tar = TAR $@
# rpm-pkg
# ---------------------------------------------------------------------------
PHONY += rpm-pkg
-rpm-pkg:
- $(MAKE) clean
- $(CONFIG_SHELL) $(MKSPEC) >$(objtree)/kernel.spec
- $(call cmd,src_tar,$(KERNELPATH),kernel.spec)
- +rpmbuild $(RPMOPTS) --target $(UTS_MACHINE)-linux -ta $(KERNELPATH).tar.gz \
+rpm-pkg: srpm = $(shell rpmspec --srpm --query --queryformat='%{name}-%{VERSION}-%{RELEASE}.src.rpm' kernel.spec)
+rpm-pkg: srcrpm-pkg
+ +rpmbuild $(RPMOPTS) --target $(UTS_MACHINE)-linux -rb $(srpm) \
--define='_smp_mflags %{nil}'

# srcrpm-pkg
--
2.34.1


2023-02-02 03:39:41

by Masahiro Yamada

[permalink] [raw]
Subject: [PATCH v4 4/6] kbuild: srcrpm-pkg: create source package without cleaning

If you run 'make (src)rpm-pkg', all objects are lost due to 'make clean',
which makes the incremental builds impossible.

Instead of cleaning, pass the exclude list to tar's --exclude-from
option.

Previously, the .config was contained in the source tarball.

With this commit, the source rpm consists of separate linux.tar.gz
and .config.

Remove stale comments. Now, 'make (src)rpm-pkg' works with O= option.

Signed-off-by: Masahiro Yamada <[email protected]>
---

Changes in v4:
- Do not delete the old tar command because it is still used
by snap-pkg although snap-pkg is broken, and it does not work at all.

scripts/Makefile.package | 29 +++--------------------------
scripts/package/mkspec | 8 ++++----
2 files changed, 7 insertions(+), 30 deletions(-)

diff --git a/scripts/Makefile.package b/scripts/Makefile.package
index ebf3db81b994..6732632a0259 100644
--- a/scripts/Makefile.package
+++ b/scripts/Makefile.package
@@ -3,27 +3,6 @@

include $(srctree)/scripts/Kbuild.include

-# RPM target
-# ---------------------------------------------------------------------------
-# The rpm target generates two rpm files:
-# /usr/src/packages/SRPMS/kernel-2.6.7rc2-1.src.rpm
-# /usr/src/packages/RPMS/i386/kernel-2.6.7rc2-1.<arch>.rpm
-# The src.rpm files includes all source for the kernel being built
-# The <arch>.rpm includes kernel configuration, modules etc.
-#
-# Process to create the rpm files
-# a) clean the kernel
-# b) Generate .spec file
-# c) Build a tar ball, using symlink to make kernel version
-# first entry in the path
-# d) and pack the result to a tar.gz file
-# e) generate the rpm files, based on kernel.spec
-# - Use /. to avoid tar packing just the symlink
-
-# Note that the rpm-pkg target cannot be used with KBUILD_OUTPUT,
-# but the binrpm-pkg target can; for some reason O= gets ignored.
-
-# Remove hyphens since they have special meaning in RPM filenames
KERNELPATH := kernel-$(subst -,_,$(KERNELRELEASE))
KDEB_SOURCENAME ?= linux-upstream
KBUILD_PKG_ROOTCMD ?="fakeroot -u"
@@ -87,12 +66,10 @@ rpm-pkg: srcrpm-pkg
# srcrpm-pkg
# ---------------------------------------------------------------------------
PHONY += srcrpm-pkg
-srcrpm-pkg:
- $(MAKE) clean
+srcrpm-pkg: linux.tar.gz
$(CONFIG_SHELL) $(MKSPEC) >$(objtree)/kernel.spec
- $(call cmd,src_tar,$(KERNELPATH),kernel.spec)
- +rpmbuild $(RPMOPTS) --target $(UTS_MACHINE)-linux -ts $(KERNELPATH).tar.gz \
- --define='_smp_mflags %{nil}' --define='_srcrpmdir $(srctree)'
+ +rpmbuild $(RPMOPTS) --target $(UTS_MACHINE)-linux -bs kernel.spec \
+ --define='_smp_mflags %{nil}' --define='_sourcedir .' --define='_srcrpmdir .'

# binrpm-pkg
# ---------------------------------------------------------------------------
diff --git a/scripts/package/mkspec b/scripts/package/mkspec
index 108c0cb95436..83a64d9d7372 100755
--- a/scripts/package/mkspec
+++ b/scripts/package/mkspec
@@ -47,7 +47,8 @@ sed -e '/^DEL/d' -e 's/^\t*//' <<EOF
Group: System Environment/Kernel
Vendor: The Linux Community
URL: https://www.kernel.org
-$S Source: kernel-$__KERNELRELEASE.tar.gz
+$S Source0: linux.tar.gz
+$S Source1: .config
Provides: $PROVIDES
$S BuildRequires: bc binutils bison dwarves
$S BuildRequires: (elfutils-libelf-devel or libelf-devel) flex
@@ -83,9 +84,8 @@ $S$M This package provides kernel headers and makefiles sufficient to build modu
$S$M against the $__KERNELRELEASE kernel package.
$S$M
$S %prep
-$S %setup -q
-$S rm -f scripts/basic/fixdep scripts/kconfig/conf
-$S rm -f tools/objtool/{fixdep,objtool}
+$S %setup -q -n linux
+$S cp %{SOURCE1} .
$S
$S %build
$S $MAKE %{?_smp_mflags} KERNELRELEASE=$KERNELRELEASE KBUILD_BUILD_VERSION=%{release}
--
2.34.1


2023-02-02 03:40:31

by Masahiro Yamada

[permalink] [raw]
Subject: [PATCH v4 5/6] kbuild: deb-pkg: hide KDEB_SOURCENAME from Makefile

scripts/Makefile.package does not need to know the value of
KDEB_SOURCENAME because the source name can be taken from
debian/changelog by using dpkg-parsechangelog.

Move the default of KDEB_SOURCENAME (i.e. linux-upstream) to
scripts/package/mkdebian.

Signed-off-by: Masahiro Yamada <[email protected]>
---

(no changes since v3)

Changes in v3:
- Move cmd_debianize

Changes in v2:
- New patch

scripts/Makefile.package | 23 +++++++++++++++--------
scripts/package/mkdebian | 2 +-
2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/scripts/Makefile.package b/scripts/Makefile.package
index 6732632a0259..5135a5419a72 100644
--- a/scripts/Makefile.package
+++ b/scripts/Makefile.package
@@ -4,9 +4,7 @@
include $(srctree)/scripts/Kbuild.include

KERNELPATH := kernel-$(subst -,_,$(KERNELRELEASE))
-KDEB_SOURCENAME ?= linux-upstream
KBUILD_PKG_ROOTCMD ?="fakeroot -u"
-export KDEB_SOURCENAME
# Include only those top-level files that are needed by make, plus the GPL copy
TAR_CONTENT := Documentation LICENSES arch block certs crypto drivers fs \
include init io_uring ipc kernel lib mm net rust \
@@ -80,17 +78,26 @@ binrpm-pkg:
+rpmbuild $(RPMOPTS) --define "_builddir $(objtree)" --target \
$(UTS_MACHINE)-linux -bb $(objtree)/binkernel.spec

+quiet_cmd_debianize = GEN $@
+ cmd_debianize = $(srctree)/scripts/package/mkdebian
+
+PHONY += debian
+debian:
+ $(call cmd,debianize)
+
+PHONY += debian-tarball
+debian-tarball: source = $(shell dpkg-parsechangelog -S Source)
+debian-tarball: orig-version = $(shell dpkg-parsechangelog -S Version | sed 's/-[^-]*$$//')
+debian-tarball: debian
+ $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.package ../$(source)_$(orig-version).orig.tar.gz
+
PHONY += deb-pkg
-deb-pkg:
- $(CONFIG_SHELL) $(srctree)/scripts/package/mkdebian
- $(Q)origversion=$$(dpkg-parsechangelog -SVersion |sed 's/-[^-]*$$//');\
- $(MAKE) -f $(srctree)/scripts/Makefile.package ../$(KDEB_SOURCENAME)_$${origversion}.orig.tar.gz
+deb-pkg: debian-tarball
+dpkg-buildpackage -r$(KBUILD_PKG_ROOTCMD) -a$$(cat debian/arch) $(DPKG_FLAGS) \
--build=source,binary --source-option=-sP -nc -us -uc

PHONY += bindeb-pkg
-bindeb-pkg:
- $(CONFIG_SHELL) $(srctree)/scripts/package/mkdebian
+bindeb-pkg: debian
+dpkg-buildpackage -r$(KBUILD_PKG_ROOTCMD) -a$$(cat debian/arch) $(DPKG_FLAGS) -b -nc -uc

PHONY += intdeb-pkg
diff --git a/scripts/package/mkdebian b/scripts/package/mkdebian
index 2f612617cbcf..0c1ed6215a02 100755
--- a/scripts/package/mkdebian
+++ b/scripts/package/mkdebian
@@ -95,7 +95,7 @@ else
revision=$($srctree/init/build-version)
packageversion=$version-$revision
fi
-sourcename=$KDEB_SOURCENAME
+sourcename=${KDEB_SOURCENAME:-linux-upstream}

if [ "$ARCH" = "um" ] ; then
packagename=user-mode-linux
--
2.34.1


2023-02-02 03:40:41

by Masahiro Yamada

[permalink] [raw]
Subject: [PATCH v4 6/6] kbuild: deb-pkg: switch over to format 3.0 (quilt)

Switch from "1.0" to "3.0 (quilt)" because it works more cleanly.

All files except .config and debian/ go into the .orig tarball.
You can add a single patch, debian/patches/config, and delete the ugly
extend-diff-ignore patterns.

The debian tarball will be compressed into *.debian.tar.xz by default.
If you like to use a different compression mode, you can pass the
command line option, DPKG_FLAGS=-Zgzip, for example.

The .orig tarball only supports gzip for now. The combination of
gzip and xz is somewhat clumsy, but it is not a practical problem.

Signed-off-by: Masahiro Yamada <[email protected]>
---

Changes in v4:
- New patch

scripts/Makefile.package | 2 +-
scripts/package/mkdebian | 42 +++++++++++++++++-----------------------
2 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/scripts/Makefile.package b/scripts/Makefile.package
index 5135a5419a72..454268a37af1 100644
--- a/scripts/Makefile.package
+++ b/scripts/Makefile.package
@@ -94,7 +94,7 @@ debian-tarball: debian
PHONY += deb-pkg
deb-pkg: debian-tarball
+dpkg-buildpackage -r$(KBUILD_PKG_ROOTCMD) -a$$(cat debian/arch) $(DPKG_FLAGS) \
- --build=source,binary --source-option=-sP -nc -us -uc
+ --build=source,binary -nc -us -uc

PHONY += bindeb-pkg
bindeb-pkg: debian
diff --git a/scripts/package/mkdebian b/scripts/package/mkdebian
index 0c1ed6215a02..1ab4c6ee76d9 100755
--- a/scripts/package/mkdebian
+++ b/scripts/package/mkdebian
@@ -135,30 +135,24 @@ else
fi

mkdir -p debian/source/
-echo "1.0" > debian/source/format
-
-# Ugly: ignore anything except .config or debian/
-# (is there a cleaner way to do this?)
-cat<<'EOF' > debian/source/local-options
-diff-ignore
-
-extend-diff-ignore = ^[^.d]
-
-extend-diff-ignore = ^\.[^c]
-extend-diff-ignore = ^\.c($|[^o])
-extend-diff-ignore = ^\.co($|[^n])
-extend-diff-ignore = ^\.con($|[^f])
-extend-diff-ignore = ^\.conf($|[^i])
-extend-diff-ignore = ^\.confi($|[^g])
-extend-diff-ignore = ^\.config.
-
-extend-diff-ignore = ^d($|[^e])
-extend-diff-ignore = ^de($|[^b])
-extend-diff-ignore = ^deb($|[^i])
-extend-diff-ignore = ^debi($|[^a])
-extend-diff-ignore = ^debia($|[^n])
-extend-diff-ignore = ^debian[^/]
-EOF
+echo "3.0 (quilt)" > debian/source/format
+
+{
+ echo "diff-ignore"
+ echo "extend-diff-ignore = .*"
+} > debian/source/local-options
+
+# Add .config as a patch
+mkdir -p debian/patches
+{
+ echo "Subject: Add .config"
+ echo "Author: ${maintainer}"
+ echo
+ echo "--- /dev/null"
+ echo "+++ linux/.config"
+ diff -u /dev/null .config | tail -n +3
+} > debian/patches/config
+echo config > debian/patches/series

echo $debarch > debian/arch
extra_build_depends=", $(if_enabled_echo CONFIG_UNWINDER_ORC libelf-dev:native)"
--
2.34.1


2023-02-02 03:50:52

by Masahiro Yamada

[permalink] [raw]
Subject: Re: [PATCH v4 1/6] kbuild: add a tool to generate a list of files ignored by git

On Thu, Feb 2, 2023 at 12:38 PM Masahiro Yamada <[email protected]> wrote:
>
> In short, the motivation of this commit is to build a source package
> without cleaning the source tree.
>
> The deb-pkg and (src)rpm-pkg targets first run 'make clean' before
> creating a source tarball. Otherwise build artifacts such as *.o,
> *.a, etc. would be included in the tarball. Yet, the tarball ends up
> containing several garbage files since 'make clean' does not clean
> everything.
>
> Cleaning the tree every time is annoying since it makes the incremental
> build impossible. It is desirable to create a source tarball without
> cleaning the tree.
>
> In fact, there are some ways to archive this.

"achieve this".



>
> The easiest way is 'git archive'. Actually, 'make perf-tar*-src-pkg'
> does this way, but I do not like it because it works only when the source
> tree is managed by git, and all files you want in the tarball must be
> committed in advance.
>
> I want to make it work without relying on git. We can do this.
>
> Files that are not tracked by git are generated files. We can list them
> out by parsing the .gitignore files. Of course, .gitignore does not cover
> all the cases, but it works well enough.
>
> tar(1) claims to support it:
>
> --exclude-vcs-ignores
>
> Exclude files that match patterns read from VCS-specific ignore files.
> Supported files are: .cvsignore, .gitignore, .bzrignore, and .hgignore.
>
> The best scenario would be to use 'tar --exclude-vcs-ignores', but this
> option does not work. --exclude-vcs-ignore does not understand any of
> the negation (!), preceding slash, following slash, etc.. So, this option
> is just useless.
>
> Hence, I wrote this gitignore parser. The previous version [1], written
> in Python, was so slow. This version is implemented in C, so it works
> much faster.
>
> This tool traverses the source tree, parsing the .gitignore files. It
> prints the file paths that are not tracked by git. The output can be
> used for tar's --exclude-from= option.
>
> [How to test this tool]
>
> $ git clean -dfx
> $ make -s -j$(nproc) defconfig all # or allmodconifg or whatever
> $ git archive -o ../linux1.tar --prefix=./ HEAD
> $ tar tf ../linux1.tar | LANG=C sort > ../file-list1 # files emitted by 'git archive'
> $ make scripts_exclude
> HOSTCC scripts/gen-exclude
> $ scripts/gen-exclude --prefix=./ -o ../exclude-list
> $ tar cf ../linux2.tar --exclude-from=../exclude-list .
> $ tar tf ../linux2.tar | LANG=C sort > ../file-list2 # files emitted by 'tar'
> $ diff ../file-list1 ../file-list2 | grep -E '^(<|>)'
> < ./Documentation/devicetree/bindings/.yamllint
> < ./drivers/clk/.kunitconfig
> < ./drivers/gpu/drm/tests/.kunitconfig
> < ./drivers/gpu/drm/vc4/tests/.kunitconfig
> < ./drivers/hid/.kunitconfig
> < ./fs/ext4/.kunitconfig
> < ./fs/fat/.kunitconfig
> < ./kernel/kcsan/.kunitconfig
> < ./lib/kunit/.kunitconfig
> < ./mm/kfence/.kunitconfig
> < ./net/sunrpc/.kunitconfig
> < ./tools/testing/selftests/arm64/tags/
> < ./tools/testing/selftests/arm64/tags/.gitignore
> < ./tools/testing/selftests/arm64/tags/Makefile
> < ./tools/testing/selftests/arm64/tags/run_tags_test.sh
> < ./tools/testing/selftests/arm64/tags/tags_test.c
> < ./tools/testing/selftests/kvm/.gitignore
> < ./tools/testing/selftests/kvm/Makefile
> < ./tools/testing/selftests/kvm/config
> < ./tools/testing/selftests/kvm/settings
>
> The source tarball contains most of files that are tracked by git. You
> see some diffs, but it is just because some .gitignore files are wrong.
>
> $ git ls-files -i -c --exclude-per-directory=.gitignore
> Documentation/devicetree/bindings/.yamllint
> drivers/clk/.kunitconfig
> drivers/gpu/drm/tests/.kunitconfig
> drivers/hid/.kunitconfig
> fs/ext4/.kunitconfig
> fs/fat/.kunitconfig
> kernel/kcsan/.kunitconfig
> lib/kunit/.kunitconfig
> mm/kfence/.kunitconfig
> tools/testing/selftests/arm64/tags/.gitignore
> tools/testing/selftests/arm64/tags/Makefile
> tools/testing/selftests/arm64/tags/run_tags_test.sh
> tools/testing/selftests/arm64/tags/tags_test.c
> tools/testing/selftests/kvm/.gitignore
> tools/testing/selftests/kvm/Makefile
> tools/testing/selftests/kvm/config
> tools/testing/selftests/kvm/settings
>
> [1]: https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Masahiro Yamada <[email protected]>
> ---
>
> (no changes since v3)
>
> Changes in v3:
> - Various code refactoring: remove struct gitignore, remove next: label etc.
> - Support --extra-pattern option
>
> Changes in v2:
> - Reimplement in C
>
> Makefile | 4 +
> scripts/.gitignore | 1 +
> scripts/Makefile | 2 +-
> scripts/gen-exclude.c | 623 ++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 629 insertions(+), 1 deletion(-)
> create mode 100644 scripts/gen-exclude.c
>
> diff --git a/Makefile b/Makefile
> index 2faf872b6808..35b294cc6f32 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1652,6 +1652,10 @@ distclean: mrproper
> %pkg: include/config/kernel.release FORCE
> $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.package $@
>
> +PHONY += scripts_exclude
> +scripts_exclude: scripts_basic
> + $(Q)$(MAKE) $(build)=scripts scripts/gen-exclude
> +
> # Brief documentation of the typical targets used
> # ---------------------------------------------------------------------------
>
> diff --git a/scripts/.gitignore b/scripts/.gitignore
> index 6e9ce6720a05..7f433bc1461c 100644
> --- a/scripts/.gitignore
> +++ b/scripts/.gitignore
> @@ -1,5 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0-only
> /asn1_compiler
> +/gen-exclude
> /generate_rust_target
> /insert-sys-cert
> /kallsyms
> diff --git a/scripts/Makefile b/scripts/Makefile
> index 32b6ba722728..5dcd7f57607f 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -38,7 +38,7 @@ HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
> endif
>
> # The following programs are only built on demand
> -hostprogs += unifdef
> +hostprogs += gen-exclude unifdef
>
> # The module linker script is preprocessed on demand
> targets += module.lds
> diff --git a/scripts/gen-exclude.c b/scripts/gen-exclude.c
> new file mode 100644
> index 000000000000..5c4ecd902290
> --- /dev/null
> +++ b/scripts/gen-exclude.c
> @@ -0,0 +1,623 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +//
> +// Traverse the source tree, parsing all .gitignore files, and print file paths
> +// that are not tracked by git.
> +// The output is suitable to the --exclude-from option of tar.
> +// This is useful until the --exclude-vcs-ignores option gets working correctly.
> +//
> +// Copyright (C) 2023 Masahiro Yamada <[email protected]>
> +
> +#include <dirent.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <fnmatch.h>
> +#include <getopt.h>
> +#include <stdarg.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
> +
> +// struct pattern - represent an ignore pattern (a line in .gitignroe)
> +// @negate: negate the pattern (prefixing '!')
> +// @dir_only: only matches directories (trailing '/')
> +// @path_match: true if the glob pattern is a path instead of a file name
> +// @double_asterisk: true if the glob pattern contains double asterisks ('**')
> +// @glob: glob pattern
> +struct pattern {
> + bool negate;
> + bool dir_only;
> + bool path_match;
> + bool double_asterisk;
> + char glob[];
> +};
> +
> +struct pattern **patterns;
> +static int nr_patterns, alloced_patterns;
> +
> +// Remember the number of patterns at each directory level
> +static int *nr_patterns_at;
> +// Track the current/max directory level;
> +static int depth, max_depth;
> +static bool debug_on;
> +static FILE *out_fp;
> +static char *prefix = "";
> +static char *progname;
> +
> +static void __attribute__((noreturn)) perror_exit(const char *s)
> +{
> + perror(s);
> +
> + exit(EXIT_FAILURE);
> +}
> +
> +static void __attribute__((noreturn)) error_exit(const char *fmt, ...)
> +{
> + va_list args;
> +
> + fprintf(stderr, "%s: error: ", progname);
> +
> + va_start(args, fmt);
> + vfprintf(stderr, fmt, args);
> + va_end(args);
> +
> + exit(EXIT_FAILURE);
> +}
> +
> +static void debug(const char *fmt, ...)
> +{
> + va_list args;
> + int i;
> +
> + if (!debug_on)
> + return;
> +
> + fprintf(stderr, "[DEBUG]");
> +
> + for (i = 0; i < depth * 2; i++)
> + fputc(' ', stderr);
> +
> + va_start(args, fmt);
> + vfprintf(stderr, fmt, args);
> + va_end(args);
> +}
> +
> +static void *xrealloc(void *ptr, size_t size)
> +{
> + ptr = realloc(ptr, size);
> + if (!ptr)
> + perror_exit(progname);
> +
> + return ptr;
> +}
> +
> +static void *xmalloc(size_t size)
> +{
> + return xrealloc(NULL, size);
> +}
> +
> +static char *xstrdup(const char *s)
> +{
> + char *new = strdup(s);
> +
> + if (!new)
> + perror_exit(progname);
> +
> + return new;
> +}
> +
> +static bool simple_match(const char *string, const char *pattern)
> +{
> + return fnmatch(pattern, string, FNM_PATHNAME) == 0;
> +}
> +
> +// Handle double asterisks ("**") matching.
> +// FIXME:
> +// This function does not work if double asterisks apppear multiple times,
> +// like "foo/**/bar/**/baz".
> +static bool double_asterisk_match(const char *path, const char *pattern)
> +{
> + bool result = false;
> + int slash_diff = 0;
> + char *modified_pattern, *q;
> + const char *p;
> + size_t len;
> +
> + for (p = path; *p; p++)
> + if (*p == '/')
> + slash_diff++;
> +
> + for (p = pattern; *p; p++)
> + if (*p == '/')
> + slash_diff--;
> +
> + len = strlen(pattern) + 1;
> +
> + if (slash_diff > 0)
> + len += slash_diff * 2;
> + modified_pattern = xmalloc(len);
> +
> + q = modified_pattern;
> + for (p = pattern; *p; p++) {
> + if (!strncmp(p, "**/", 3)) {
> + // "**/" means zero of more sequences of '*/".
> + // "foo**/bar" matches "foobar", "foo*/bar",
> + // "foo*/*/bar", etc.
> + while (slash_diff-- > 0) {
> + *q++ = '*';
> + *q++ = '/';
> + }
> +
> + if (slash_diff == 0) {
> + *q++ = '*';
> + *q++ = '/';
> + }
> +
> + if (slash_diff < 0)
> + slash_diff++;
> +
> + p += 2;
> + } else if (!strcmp(p, "/**")) {
> + // A trailing "/**" matches everything inside.
> + while (slash_diff-- >= 0) {
> + *q++ = '/';
> + *q++ = '*';
> + }
> +
> + p += 2;
> + } else {
> + // Copy other patterns as-is.
> + // Other consecutive asterisks are considered regular
> + // asterisks. fnmatch() already handles them like that.
> + *q++ = *p;
> + }
> + }
> +
> + *q = '\0';
> +
> + result = simple_match(path, modified_pattern);
> +
> + free(modified_pattern);
> +
> + return result;
> +}
> +
> +// Return true if the given path is ignored by git.
> +static bool is_ignored(const char *path, const char *name, bool is_dir)
> +{
> + int i;
> +
> + // Search the patterns in the reverse order because the last matching
> + // pattern wins.
> + for (i = nr_patterns - 1; i >= 0; i--) {
> + struct pattern *p = patterns[i];
> +
> + if (!is_dir && p->dir_only)
> + continue;
> +
> + if (!p->path_match) {
> + // If the pattern has no slash at the beginning or
> + // middle, it matches against the basename. Most cases
> + // fall into this and work well with double asterisks.
> + if (!simple_match(name, p->glob))
> + continue;
> + } else if (!p->double_asterisk) {
> + // Unless the pattern has double asterisks, it is still
> + // simple but matches against the path instead.
> + if (!simple_match(path, p->glob))
> + continue;
> + } else {
> + // Double asterisks with a slash. Complex, but rare.
> + if (!double_asterisk_match(path, p->glob))
> + continue;
> + }
> +
> + debug("%s: matches %s%s%s\n", path, p->negate ? "!" : "",
> + p->glob, p->dir_only ? "/" : "");
> +
> + return !p->negate;
> + }
> +
> + debug("%s: no match\n", path);
> +
> + return false;
> +}
> +
> +// Return the length of the initial segment of the string that does not contain
> +// the unquoted sequence of the given character. Similar to strcspn() in libc.
> +static size_t strcspn_trailer(const char *str, char c)
> +{
> + bool quoted = false;
> + size_t len = strlen(str);
> + size_t spn = len;
> + const char *s;
> +
> + for (s = str; *s; s++) {
> + if (!quoted && *s == c) {
> + if (s - str < spn)
> + spn = s - str;
> + } else {
> + spn = len;
> +
> + if (!quoted && *s == '\\')
> + quoted = true;
> + else
> + quoted = false;
> + }
> + }
> +
> + return spn;
> +}
> +
> +// Add an gitignore pattern.
> +static void add_pattern(char *s, const char *dirpath)
> +{
> + bool negate = false;
> + bool dir_only = false;
> + bool path_match = false;
> + bool double_asterisk = false;
> + char *e = s + strlen(s);
> + struct pattern *p;
> + size_t len;
> +
> + // Skip comments
> + if (*s == '#')
> + return;
> +
> + // Trailing spaces are ignored unless they are quoted with backslash.
> + e = s + strcspn_trailer(s, ' ');
> + *e = '\0';
> +
> + // The prefix '!' negates the pattern
> + if (*s == '!') {
> + s++;
> + negate = true;
> + }
> +
> + // If there is slash(es) that is not escaped at the end of the pattern,
> + // it matches only directories.
> + len = strcspn_trailer(s, '/');
> + if (s + len < e) {
> + dir_only = true;
> + e = s + len;
> + *e = '\0';
> + }
> +
> + // Skip if the line gets empty
> + if (*s == '\0')
> + return;
> +
> + // Double asterisk is tricky. Mark it to handle it specially later.
> + if (strstr(s, "**/") || strstr(s, "/**"))
> + double_asterisk = true;
> +
> + // If there is a slash at the beginning or middle, the pattern
> + // is relative to the directory level of the .gitignore.
> + if (strchr(s, '/')) {
> + if (*s == '/')
> + s++;
> + path_match = true;
> + }
> +
> + len = e - s;
> +
> + // We need more room to store dirpath and '/'
> + if (path_match)
> + len += strlen(dirpath) + 1;
> +
> + p = xmalloc(sizeof(*p) + len + 1);
> + p->negate = negate;
> + p->dir_only = dir_only;
> + p->path_match = path_match;
> + p->double_asterisk = double_asterisk;
> + p->glob[0] = '\0';
> +
> + if (path_match) {
> + strcat(p->glob, dirpath);
> + strcat(p->glob, "/");
> + }
> +
> + strcat(p->glob, s);
> +
> + debug("Add pattern: %s%s%s\n", negate ? "!" : "", p->glob,
> + dir_only ? "/" : "");
> +
> + if (nr_patterns >= alloced_patterns) {
> + alloced_patterns += 128;
> + patterns = xrealloc(patterns,
> + sizeof(*patterns) * alloced_patterns);
> + }
> +
> + patterns[nr_patterns++] = p;
> +}
> +
> +static void *load_gitignore(const char *dirpath)
> +{
> + struct stat st;
> + char path[PATH_MAX], *buf;
> + int fd, ret;
> +
> + ret = snprintf(path, sizeof(path), "%s/.gitignore", dirpath);
> + if (ret >= sizeof(path))
> + error_exit("%s: too long path was truncated\n", path);
> +
> + // If .gitignore does not exist in this directory, open() fails.
> + // It is ok, just skip it.
> + fd = open(path, O_RDONLY);
> + if (fd < 0)
> + return NULL;
> +
> + if (fstat(fd, &st) < 0)
> + perror_exit(path);
> +
> + buf = xmalloc(st.st_size + 1);
> + if (read(fd, buf, st.st_size) != st.st_size)
> + perror_exit(path);
> +
> + buf[st.st_size] = '\0';
> + if (close(fd))
> + perror_exit(path);
> +
> + return buf;
> +}
> +
> +// Parse '.gitignore' in the given directory.
> +static void parse_gitignore(const char *dirpath)
> +{
> + char *buf, *s, *next;
> +
> + buf = load_gitignore(dirpath);
> + if (!buf)
> + return;
> +
> + debug("Parse %s/.gitignore\n", dirpath);
> +
> + for (s = buf; *s; s = next) {
> + next = s;
> +
> + while (*next != '\0' && *next != '\n')
> + next++;
> +
> + if (*next != '\0') {
> + *next = '\0';
> + next++;
> + }
> +
> + add_pattern(s, dirpath);
> + }
> +
> + free(buf);
> +}
> +
> +// Save the current number of patterns and increment the depth
> +static void increment_depth(void)
> +{
> + if (depth >= max_depth) {
> + max_depth += 1;
> + nr_patterns_at = xrealloc(nr_patterns_at,
> + sizeof(*nr_patterns_at) * max_depth);
> + }
> +
> + nr_patterns_at[depth] = nr_patterns;
> + depth++;
> +}
> +
> +// Decrement the depth, and free up the patterns of this directory level.
> +static void decrement_depth(void)
> +{
> + depth--;
> + if (depth < 0)
> + error_exit("BUG\n");
> +
> + while (nr_patterns > nr_patterns_at[depth])
> + free(patterns[--nr_patterns]);
> +}
> +
> +// If we find an ignored path, print it.
> +static void print_path(const char *path)
> +{
> + // The path always start with "./". If not, it is a bug.
> + if (strlen(path) < 2)
> + error_exit("BUG\n");
> +
> + // Replace the root directory with the prefix you like.
> + // This is useful for the tar command.
> + fprintf(out_fp, "%s%s\n", prefix, path + 2);
> +}
> +
> +// Traverse the entire directory tree, parsing .gitignore files.
> +// Print file paths that are not tracked by git.
> +//
> +// Return true if all files under the directory are ignored, false otherwise.
> +static bool traverse_directory(const char *dirpath)
> +{
> + bool all_ignored = true;
> + DIR *dirp;
> +
> + debug("Enter[%d]: %s\n", depth, dirpath);
> + increment_depth();
> +
> + // We do not know whether .gitignore exists in this directory or not.
> + // Anyway, try to open it.
> + parse_gitignore(dirpath);
> +
> + dirp = opendir(dirpath);
> + if (!dirp)
> + perror_exit(dirpath);
> +
> + while (1) {
> + char path[PATH_MAX];
> + struct dirent *d;
> + int ret;
> +
> + errno = 0;
> + d = readdir(dirp);
> + if (!d) {
> + // readdir() returns NULL on the end of the directory
> + // steam, and also on an error. To distinguish them,
> + // errno should be checked.
> + if (errno)
> + perror_exit(dirpath);
> + break;
> + }
> +
> + if (!strcmp(d->d_name, "..") || !strcmp(d->d_name, "."))
> + continue;
> +
> + ret = snprintf(path, sizeof(path), "%s/%s", dirpath, d->d_name);
> + if (ret >= sizeof(path))
> + error_exit("%s: too long path was truncated\n", path);
> +
> + if (is_ignored(path, d->d_name, d->d_type & DT_DIR)) {
> + debug("Ignore: %s\n", path);
> + print_path(path);
> + } else {
> + if ((d->d_type & DT_DIR) && !(d->d_type & DT_LNK)) {
> + if (!traverse_directory(path))
> + all_ignored = false;
> + } else {
> + all_ignored = false;
> + }
> + }
> + }
> +
> + if (closedir(dirp))
> + perror_exit(dirpath);
> +
> + // If all the files under this directory are ignored, let's ignore this
> + // directory as well in order to avoid empty directories in the tarball.
> + if (all_ignored) {
> + debug("Ignore: %s (due to all files inside ignored)\n", dirpath);
> + print_path(dirpath);
> + }
> +
> + decrement_depth();
> + debug("Leave[%d]: %s\n", depth, dirpath);
> +
> + return all_ignored;
> +}
> +
> +// Register hard-coded ignore patterns.
> +static void add_fixed_patterns(void)
> +{
> + const char * const fixed_patterns[] = {
> + ".git/",
> + };
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(fixed_patterns); i++) {
> + char *s = xstrdup(fixed_patterns[i]);
> +
> + add_pattern(s, ".");
> + free(s);
> + }
> +}
> +
> +static void usage(void)
> +{
> + fprintf(stderr,
> + "usage: %s [options]\n"
> + "\n"
> + "Print files that are not ignored by git\n"
> + "\n"
> + "options:\n"
> + " -d, --debug print debug messages to stderr\n"
> + " -e, --extra-pattern PATTERN Add extra ignore patterns. This behaves like it is prepended to the top .gitignore\n"
> + " -h, --help show this help message and exit\n"
> + " -o, --output FILE output to a file (default: '-', i.e. stdout)\n"
> + " -p, --prefix PREFIX prefix added to each path (default: empty string)\n"
> + " -r, --rootdir DIR root of the source tree (default: current working directory):\n",
> + progname);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + const char *output = "-";
> + const char *rootdir = ".";
> +
> + progname = strrchr(argv[0], '/');
> + if (progname)
> + progname++;
> + else
> + progname = argv[0];
> +
> + while (1) {
> + static struct option long_options[] = {
> + {"debug", no_argument, NULL, 'd'},
> + {"extra-pattern", required_argument, NULL, 'e'},
> + {"help", no_argument, NULL, 'h'},
> + {"output", required_argument, NULL, 'o'},
> + {"prefix", required_argument, NULL, 'p'},
> + {"rootdir", required_argument, NULL, 'r'},
> + {},
> + };
> +
> + int c = getopt_long(argc, argv, "de:ho:p:r:", long_options, NULL);
> +
> + if (c == -1)
> + break;
> +
> + switch (c) {
> + case 'd':
> + debug_on = true;
> + break;
> + case 'e':
> + add_pattern(optarg, ".");
> + break;
> + case 'h':
> + usage();
> + exit(0);
> + case 'o':
> + output = optarg;
> + break;
> + case 'p':
> + prefix = optarg;
> + break;
> + case 'r':
> + rootdir = optarg;
> + break;
> + case '?':
> + usage();
> + /* fallthrough */
> + default:
> + exit(EXIT_FAILURE);
> + }
> + }
> +
> + if (chdir(rootdir))
> + perror_exit(rootdir);
> +
> + if (strcmp(output, "-")) {
> + out_fp = fopen(output, "w");
> + if (!out_fp)
> + perror_exit(output);
> + } else {
> + out_fp = stdout;
> + }
> +
> + add_fixed_patterns();
> +
> + traverse_directory(".");
> +
> + if (depth != 0)
> + error_exit("BUG\n");
> +
> + while (nr_patterns > 0)
> + free(patterns[--nr_patterns]);
> + free(patterns);
> + free(nr_patterns_at);
> +
> + fflush(out_fp);
> + if (ferror(out_fp))
> + error_exit("not all data was written to the output\n");
> +
> + if (fclose(out_fp))
> + perror_exit(output);
> +
> + return 0;
> +}
> --
> 2.34.1
>


--
Best Regards
Masahiro Yamada

2023-02-02 11:08:43

by Nicolas Schier

[permalink] [raw]
Subject: Re: [PATCH v4 1/6] kbuild: add a tool to generate a list of files ignored by git

On Thu, Feb 02, 2023 at 12:37:11PM +0900 Masahiro Yamada wrote:
> In short, the motivation of this commit is to build a source package
> without cleaning the source tree.
>
> The deb-pkg and (src)rpm-pkg targets first run 'make clean' before
> creating a source tarball. Otherwise build artifacts such as *.o,
> *.a, etc. would be included in the tarball. Yet, the tarball ends up
> containing several garbage files since 'make clean' does not clean
> everything.
>
> Cleaning the tree every time is annoying since it makes the incremental
> build impossible. It is desirable to create a source tarball without
> cleaning the tree.
>
> In fact, there are some ways to archive this.
>
> The easiest way is 'git archive'. Actually, 'make perf-tar*-src-pkg'
> does this way, but I do not like it because it works only when the source
> tree is managed by git, and all files you want in the tarball must be
> committed in advance.
>
> I want to make it work without relying on git. We can do this.
>
> Files that are not tracked by git are generated files. We can list them
> out by parsing the .gitignore files. Of course, .gitignore does not cover
> all the cases, but it works well enough.
>
> tar(1) claims to support it:
>
> --exclude-vcs-ignores
>
> Exclude files that match patterns read from VCS-specific ignore files.
> Supported files are: .cvsignore, .gitignore, .bzrignore, and .hgignore.
>
> The best scenario would be to use 'tar --exclude-vcs-ignores', but this
> option does not work. --exclude-vcs-ignore does not understand any of
> the negation (!), preceding slash, following slash, etc.. So, this option
> is just useless.
>
> Hence, I wrote this gitignore parser. The previous version [1], written
> in Python, was so slow. This version is implemented in C, so it works
> much faster.
>
> This tool traverses the source tree, parsing the .gitignore files. It
> prints the file paths that are not tracked by git. The output can be
> used for tar's --exclude-from= option.
>
> [How to test this tool]
>
> $ git clean -dfx
> $ make -s -j$(nproc) defconfig all # or allmodconifg or whatever
> $ git archive -o ../linux1.tar --prefix=./ HEAD
> $ tar tf ../linux1.tar | LANG=C sort > ../file-list1 # files emitted by 'git archive'
> $ make scripts_exclude
> HOSTCC scripts/gen-exclude
> $ scripts/gen-exclude --prefix=./ -o ../exclude-list
> $ tar cf ../linux2.tar --exclude-from=../exclude-list .
> $ tar tf ../linux2.tar | LANG=C sort > ../file-list2 # files emitted by 'tar'
> $ diff ../file-list1 ../file-list2 | grep -E '^(<|>)'
> < ./Documentation/devicetree/bindings/.yamllint
> < ./drivers/clk/.kunitconfig
> < ./drivers/gpu/drm/tests/.kunitconfig
> < ./drivers/gpu/drm/vc4/tests/.kunitconfig
> < ./drivers/hid/.kunitconfig
> < ./fs/ext4/.kunitconfig
> < ./fs/fat/.kunitconfig
> < ./kernel/kcsan/.kunitconfig
> < ./lib/kunit/.kunitconfig
> < ./mm/kfence/.kunitconfig
> < ./net/sunrpc/.kunitconfig
> < ./tools/testing/selftests/arm64/tags/
> < ./tools/testing/selftests/arm64/tags/.gitignore
> < ./tools/testing/selftests/arm64/tags/Makefile
> < ./tools/testing/selftests/arm64/tags/run_tags_test.sh
> < ./tools/testing/selftests/arm64/tags/tags_test.c
> < ./tools/testing/selftests/kvm/.gitignore
> < ./tools/testing/selftests/kvm/Makefile
> < ./tools/testing/selftests/kvm/config
> < ./tools/testing/selftests/kvm/settings
>
> The source tarball contains most of files that are tracked by git. You
> see some diffs, but it is just because some .gitignore files are wrong.
>
> $ git ls-files -i -c --exclude-per-directory=.gitignore
> Documentation/devicetree/bindings/.yamllint
> drivers/clk/.kunitconfig
> drivers/gpu/drm/tests/.kunitconfig
> drivers/hid/.kunitconfig
> fs/ext4/.kunitconfig
> fs/fat/.kunitconfig
> kernel/kcsan/.kunitconfig
> lib/kunit/.kunitconfig
> mm/kfence/.kunitconfig
> tools/testing/selftests/arm64/tags/.gitignore
> tools/testing/selftests/arm64/tags/Makefile
> tools/testing/selftests/arm64/tags/run_tags_test.sh
> tools/testing/selftests/arm64/tags/tags_test.c
> tools/testing/selftests/kvm/.gitignore
> tools/testing/selftests/kvm/Makefile
> tools/testing/selftests/kvm/config
> tools/testing/selftests/kvm/settings
>
> [1]: https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Masahiro Yamada <[email protected]>
> ---
>
> (no changes since v3)
>
> Changes in v3:
> - Various code refactoring: remove struct gitignore, remove next: label etc.
> - Support --extra-pattern option
>
> Changes in v2:
> - Reimplement in C
>
> Makefile | 4 +
> scripts/.gitignore | 1 +
> scripts/Makefile | 2 +-
> scripts/gen-exclude.c | 623 ++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 629 insertions(+), 1 deletion(-)
> create mode 100644 scripts/gen-exclude.c
>
> diff --git a/Makefile b/Makefile
> index 2faf872b6808..35b294cc6f32 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1652,6 +1652,10 @@ distclean: mrproper
> %pkg: include/config/kernel.release FORCE
> $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.package $@
>
> +PHONY += scripts_exclude
> +scripts_exclude: scripts_basic
> + $(Q)$(MAKE) $(build)=scripts scripts/gen-exclude
> +
> # Brief documentation of the typical targets used
> # ---------------------------------------------------------------------------
>
> diff --git a/scripts/.gitignore b/scripts/.gitignore
> index 6e9ce6720a05..7f433bc1461c 100644
> --- a/scripts/.gitignore
> +++ b/scripts/.gitignore
> @@ -1,5 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0-only
> /asn1_compiler
> +/gen-exclude
> /generate_rust_target
> /insert-sys-cert
> /kallsyms
> diff --git a/scripts/Makefile b/scripts/Makefile
> index 32b6ba722728..5dcd7f57607f 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -38,7 +38,7 @@ HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
> endif
>
> # The following programs are only built on demand
> -hostprogs += unifdef
> +hostprogs += gen-exclude unifdef
>
> # The module linker script is preprocessed on demand
> targets += module.lds
> diff --git a/scripts/gen-exclude.c b/scripts/gen-exclude.c
> new file mode 100644
> index 000000000000..5c4ecd902290
> --- /dev/null
> +++ b/scripts/gen-exclude.c
> @@ -0,0 +1,623 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +//
> +// Traverse the source tree, parsing all .gitignore files, and print file paths
> +// that are not tracked by git.
> +// The output is suitable to the --exclude-from option of tar.
> +// This is useful until the --exclude-vcs-ignores option gets working correctly.
> +//
> +// Copyright (C) 2023 Masahiro Yamada <[email protected]>
> +
> +#include <dirent.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <fnmatch.h>
> +#include <getopt.h>
> +#include <stdarg.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
> +
> +// struct pattern - represent an ignore pattern (a line in .gitignroe)
> +// @negate: negate the pattern (prefixing '!')
> +// @dir_only: only matches directories (trailing '/')
> +// @path_match: true if the glob pattern is a path instead of a file name
> +// @double_asterisk: true if the glob pattern contains double asterisks ('**')
> +// @glob: glob pattern
> +struct pattern {
> + bool negate;
> + bool dir_only;
> + bool path_match;
> + bool double_asterisk;
> + char glob[];
> +};
> +
> +struct pattern **patterns;

Is there a reason, why patterns is not static? (sparse asked)

> +static int nr_patterns, alloced_patterns;
> +
> +// Remember the number of patterns at each directory level
> +static int *nr_patterns_at;
> +// Track the current/max directory level;
> +static int depth, max_depth;
> +static bool debug_on;
> +static FILE *out_fp;
> +static char *prefix = "";
> +static char *progname;
> +
> +static void __attribute__((noreturn)) perror_exit(const char *s)
> +{
> + perror(s);
> +
> + exit(EXIT_FAILURE);
> +}
> +
> +static void __attribute__((noreturn)) error_exit(const char *fmt, ...)
> +{
> + va_list args;
> +
> + fprintf(stderr, "%s: error: ", progname);
> +
> + va_start(args, fmt);
> + vfprintf(stderr, fmt, args);
> + va_end(args);
> +
> + exit(EXIT_FAILURE);
> +}
> +
> +static void debug(const char *fmt, ...)
> +{
> + va_list args;
> + int i;
> +
> + if (!debug_on)
> + return;
> +
> + fprintf(stderr, "[DEBUG]");
> +
> + for (i = 0; i < depth * 2; i++)
> + fputc(' ', stderr);
> +
> + va_start(args, fmt);
> + vfprintf(stderr, fmt, args);
> + va_end(args);
> +}
> +
> +static void *xrealloc(void *ptr, size_t size)
> +{
> + ptr = realloc(ptr, size);
> + if (!ptr)
> + perror_exit(progname);
> +
> + return ptr;
> +}
> +
> +static void *xmalloc(size_t size)
> +{
> + return xrealloc(NULL, size);
> +}
> +
> +static char *xstrdup(const char *s)
> +{
> + char *new = strdup(s);
> +
> + if (!new)
> + perror_exit(progname);
> +
> + return new;
> +}
> +
> +static bool simple_match(const char *string, const char *pattern)
> +{
> + return fnmatch(pattern, string, FNM_PATHNAME) == 0;
> +}
> +
> +// Handle double asterisks ("**") matching.
> +// FIXME:
> +// This function does not work if double asterisks apppear multiple times,
> +// like "foo/**/bar/**/baz".
> +static bool double_asterisk_match(const char *path, const char *pattern)
> +{
> + bool result = false;
> + int slash_diff = 0;
> + char *modified_pattern, *q;
> + const char *p;
> + size_t len;
> +
> + for (p = path; *p; p++)
> + if (*p == '/')
> + slash_diff++;
> +
> + for (p = pattern; *p; p++)
> + if (*p == '/')
> + slash_diff--;
> +
> + len = strlen(pattern) + 1;
> +
> + if (slash_diff > 0)
> + len += slash_diff * 2;
> + modified_pattern = xmalloc(len);
> +
> + q = modified_pattern;
> + for (p = pattern; *p; p++) {
> + if (!strncmp(p, "**/", 3)) {
> + // "**/" means zero of more sequences of '*/".
> + // "foo**/bar" matches "foobar", "foo*/bar",
> + // "foo*/*/bar", etc.
> + while (slash_diff-- > 0) {
> + *q++ = '*';
> + *q++ = '/';
> + }
> +
> + if (slash_diff == 0) {
> + *q++ = '*';
> + *q++ = '/';
> + }
> +
> + if (slash_diff < 0)
> + slash_diff++;
> +
> + p += 2;
> + } else if (!strcmp(p, "/**")) {
> + // A trailing "/**" matches everything inside.

In v2 you also checked against "(*p + 3) == '\0'". Is the explicit check
against end-of-string really not needed here? (pattern = "whatever/**/*.tmp"?)

> + while (slash_diff-- >= 0) {
> + *q++ = '/';
> + *q++ = '*';
> + }
> +
> + p += 2;
> + } else {
> + // Copy other patterns as-is.
> + // Other consecutive asterisks are considered regular
> + // asterisks. fnmatch() already handles them like that.
> + *q++ = *p;
> + }
> + }
> +
> + *q = '\0';
> +
> + result = simple_match(path, modified_pattern);
> +
> + free(modified_pattern);
> +
> + return result;
> +}
> +
> +// Return true if the given path is ignored by git.
> +static bool is_ignored(const char *path, const char *name, bool is_dir)
> +{
> + int i;
> +
> + // Search the patterns in the reverse order because the last matching
> + // pattern wins.
> + for (i = nr_patterns - 1; i >= 0; i--) {
> + struct pattern *p = patterns[i];
> +
> + if (!is_dir && p->dir_only)
> + continue;
> +
> + if (!p->path_match) {
> + // If the pattern has no slash at the beginning or
> + // middle, it matches against the basename. Most cases
> + // fall into this and work well with double asterisks.
> + if (!simple_match(name, p->glob))
> + continue;
> + } else if (!p->double_asterisk) {
> + // Unless the pattern has double asterisks, it is still
> + // simple but matches against the path instead.
> + if (!simple_match(path, p->glob))
> + continue;
> + } else {
> + // Double asterisks with a slash. Complex, but rare.
> + if (!double_asterisk_match(path, p->glob))
> + continue;
> + }
> +
> + debug("%s: matches %s%s%s\n", path, p->negate ? "!" : "",
> + p->glob, p->dir_only ? "/" : "");
> +
> + return !p->negate;
> + }
> +
> + debug("%s: no match\n", path);
> +
> + return false;
> +}
> +
> +// Return the length of the initial segment of the string that does not contain
> +// the unquoted sequence of the given character. Similar to strcspn() in libc.

I struggled across that comment and it took me quite some time to match it to
strcspn_trailers() behaviour. I expect it to strip all unescaped occurrences
of c at the end of str and return the resulting strlen. After reading it
several times, I can get a match. I _think_ main confusion came from my (quite
imperfect) English:

"one two "
^^^ initial segment of string not containing unquoted c ??

^^^^^^^ substr that is considered by strcspn_trailer

But this is just about a comment and I'm sure I understand what is intended.
No action required.

> +static size_t strcspn_trailer(const char *str, char c)
> +{
> + bool quoted = false;
> + size_t len = strlen(str);
> + size_t spn = len;
> + const char *s;
> +
> + for (s = str; *s; s++) {
> + if (!quoted && *s == c) {
> + if (s - str < spn)
> + spn = s - str;
> + } else {
> + spn = len;

Is this really intended? Or 'spn = str - s + 1'?

> +
> + if (!quoted && *s == '\\')
> + quoted = true;
> + else
> + quoted = false;
> + }
> + }
> +
> + return spn;
> +}
> +
> +// Add an gitignore pattern.
> +static void add_pattern(char *s, const char *dirpath)
> +{
> + bool negate = false;
> + bool dir_only = false;
> + bool path_match = false;
> + bool double_asterisk = false;
> + char *e = s + strlen(s);
> + struct pattern *p;
> + size_t len;
> +
> + // Skip comments
> + if (*s == '#')
> + return;
> +
> + // Trailing spaces are ignored unless they are quoted with backslash.
> + e = s + strcspn_trailer(s, ' ');
> + *e = '\0';
> +
> + // The prefix '!' negates the pattern
> + if (*s == '!') {
> + s++;
> + negate = true;
> + }
> +
> + // If there is slash(es) that is not escaped at the end of the pattern,
> + // it matches only directories.

Are escaped slashes allowed in file names in git? I think use of original
strcspn() would have been enough.

> + len = strcspn_trailer(s, '/');
> + if (s + len < e) {
> + dir_only = true;
> + e = s + len;
> + *e = '\0';
> + }
> +
> + // Skip if the line gets empty
> + if (*s == '\0')
> + return;
> +
> + // Double asterisk is tricky. Mark it to handle it specially later.
> + if (strstr(s, "**/") || strstr(s, "/**"))
> + double_asterisk = true;
> +
> + // If there is a slash at the beginning or middle, the pattern
> + // is relative to the directory level of the .gitignore.
> + if (strchr(s, '/')) {
> + if (*s == '/')
> + s++;
> + path_match = true;
> + }
> +
> + len = e - s;
> +
> + // We need more room to store dirpath and '/'
> + if (path_match)
> + len += strlen(dirpath) + 1;
> +
> + p = xmalloc(sizeof(*p) + len + 1);
> + p->negate = negate;
> + p->dir_only = dir_only;
> + p->path_match = path_match;
> + p->double_asterisk = double_asterisk;
> + p->glob[0] = '\0';

(bike-shedding)
p = (struct pattern) {
.negate = negate,
.dir_only = dir_only,
.path_match = path_match,
.double_asterisk = double_asterisk,
};


> +
> + if (path_match) {
> + strcat(p->glob, dirpath);
> + strcat(p->glob, "/");
> + }
> +
> + strcat(p->glob, s);
> +
> + debug("Add pattern: %s%s%s\n", negate ? "!" : "", p->glob,
> + dir_only ? "/" : "");
> +
> + if (nr_patterns >= alloced_patterns) {
> + alloced_patterns += 128;
> + patterns = xrealloc(patterns,
> + sizeof(*patterns) * alloced_patterns);
> + }
> +
> + patterns[nr_patterns++] = p;
> +}
> +
> +static void *load_gitignore(const char *dirpath)
> +{
> + struct stat st;
> + char path[PATH_MAX], *buf;
> + int fd, ret;
> +
> + ret = snprintf(path, sizeof(path), "%s/.gitignore", dirpath);
> + if (ret >= sizeof(path))
> + error_exit("%s: too long path was truncated\n", path);
> +
> + // If .gitignore does not exist in this directory, open() fails.
> + // It is ok, just skip it.
> + fd = open(path, O_RDONLY);
> + if (fd < 0)
> + return NULL;

Why don't you check against errno == 2 (ENOENT)? I assume, no other
errno value is expected, but for me it feels a bit odd to not check it
and exit loudly if something (unlikely) like EMFILE causes open() to
fail.

> +
> + if (fstat(fd, &st) < 0)
> + perror_exit(path);
> +
> + buf = xmalloc(st.st_size + 1);
> + if (read(fd, buf, st.st_size) != st.st_size)
> + perror_exit(path);
> +
> + buf[st.st_size] = '\0';
> + if (close(fd))
> + perror_exit(path);
> +
> + return buf;
> +}
> +
> +// Parse '.gitignore' in the given directory.
> +static void parse_gitignore(const char *dirpath)
> +{
> + char *buf, *s, *next;
> +
> + buf = load_gitignore(dirpath);
> + if (!buf)
> + return;
> +
> + debug("Parse %s/.gitignore\n", dirpath);
> +
> + for (s = buf; *s; s = next) {
> + next = s;
> +
> + while (*next != '\0' && *next != '\n')

Not relevant for in-tree use: git does not complain about '\0' in a .gitignore
but also handles the remaining part of the file.

> + next++;
> +
> + if (*next != '\0') {
> + *next = '\0';
> + next++;
> + }
> +
> + add_pattern(s, dirpath);
> + }
> +
> + free(buf);
> +}
> +
> +// Save the current number of patterns and increment the depth
> +static void increment_depth(void)
> +{
> + if (depth >= max_depth) {
> + max_depth += 1;
> + nr_patterns_at = xrealloc(nr_patterns_at,
> + sizeof(*nr_patterns_at) * max_depth);
> + }
> +
> + nr_patterns_at[depth] = nr_patterns;
> + depth++;
> +}
> +
> +// Decrement the depth, and free up the patterns of this directory level.
> +static void decrement_depth(void)
> +{
> + depth--;
> + if (depth < 0)
> + error_exit("BUG\n");
> +
> + while (nr_patterns > nr_patterns_at[depth])
> + free(patterns[--nr_patterns]);
> +}
> +
> +// If we find an ignored path, print it.
> +static void print_path(const char *path)
> +{
> + // The path always start with "./". If not, it is a bug.
> + if (strlen(path) < 2)
> + error_exit("BUG\n");
> +
> + // Replace the root directory with the prefix you like.
> + // This is useful for the tar command.
> + fprintf(out_fp, "%s%s\n", prefix, path + 2);
> +}
> +
> +// Traverse the entire directory tree, parsing .gitignore files.
> +// Print file paths that are not tracked by git.
> +//
> +// Return true if all files under the directory are ignored, false otherwise.
> +static bool traverse_directory(const char *dirpath)
> +{
> + bool all_ignored = true;
> + DIR *dirp;
> +
> + debug("Enter[%d]: %s\n", depth, dirpath);
> + increment_depth();
> +
> + // We do not know whether .gitignore exists in this directory or not.
> + // Anyway, try to open it.
> + parse_gitignore(dirpath);
> +
> + dirp = opendir(dirpath);
> + if (!dirp)
> + perror_exit(dirpath);
> +
> + while (1) {
> + char path[PATH_MAX];
> + struct dirent *d;
> + int ret;
> +
> + errno = 0;
> + d = readdir(dirp);
> + if (!d) {
> + // readdir() returns NULL on the end of the directory
> + // steam, and also on an error. To distinguish them,
> + // errno should be checked.
> + if (errno)
> + perror_exit(dirpath);
> + break;
> + }
> +
> + if (!strcmp(d->d_name, "..") || !strcmp(d->d_name, "."))
> + continue;
> +
> + ret = snprintf(path, sizeof(path), "%s/%s", dirpath, d->d_name);
> + if (ret >= sizeof(path))
> + error_exit("%s: too long path was truncated\n", path);
> +
> + if (is_ignored(path, d->d_name, d->d_type & DT_DIR)) {
> + debug("Ignore: %s\n", path);
> + print_path(path);
> + } else {
> + if ((d->d_type & DT_DIR) && !(d->d_type & DT_LNK)) {
> + if (!traverse_directory(path))
> + all_ignored = false;
> + } else {
> + all_ignored = false;
> + }
> + }
> + }
> +
> + if (closedir(dirp))
> + perror_exit(dirpath);
> +
> + // If all the files under this directory are ignored, let's ignore this
> + // directory as well in order to avoid empty directories in the tarball.
> + if (all_ignored) {
> + debug("Ignore: %s (due to all files inside ignored)\n", dirpath);
> + print_path(dirpath);
> + }
> +
> + decrement_depth();
> + debug("Leave[%d]: %s\n", depth, dirpath);
> +
> + return all_ignored;
> +}
> +
> +// Register hard-coded ignore patterns.
> +static void add_fixed_patterns(void)
> +{
> + const char * const fixed_patterns[] = {
> + ".git/",
> + };
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(fixed_patterns); i++) {
> + char *s = xstrdup(fixed_patterns[i]);
> +
> + add_pattern(s, ".");
> + free(s);
> + }
> +}
> +
> +static void usage(void)
> +{
> + fprintf(stderr,
> + "usage: %s [options]\n"
> + "\n"
> + "Print files that are not ignored by git\n"
> + "\n"
> + "options:\n"
> + " -d, --debug print debug messages to stderr\n"
> + " -e, --extra-pattern PATTERN Add extra ignore patterns. This behaves like it is prepended to the top .gitignore\n"
> + " -h, --help show this help message and exit\n"
> + " -o, --output FILE output to a file (default: '-', i.e. stdout)\n"
> + " -p, --prefix PREFIX prefix added to each path (default: empty string)\n"
> + " -r, --rootdir DIR root of the source tree (default: current working directory):\n",
> + progname);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + const char *output = "-";
> + const char *rootdir = ".";
> +
> + progname = strrchr(argv[0], '/');
> + if (progname)
> + progname++;
> + else
> + progname = argv[0];
> +
> + while (1) {
> + static struct option long_options[] = {
> + {"debug", no_argument, NULL, 'd'},
> + {"extra-pattern", required_argument, NULL, 'e'},
> + {"help", no_argument, NULL, 'h'},
> + {"output", required_argument, NULL, 'o'},
> + {"prefix", required_argument, NULL, 'p'},
> + {"rootdir", required_argument, NULL, 'r'},
> + {},
> + };
> +
> + int c = getopt_long(argc, argv, "de:ho:p:r:", long_options, NULL);
> +
> + if (c == -1)
> + break;
> +
> + switch (c) {
> + case 'd':
> + debug_on = true;
> + break;
> + case 'e':
> + add_pattern(optarg, ".");
> + break;
> + case 'h':
> + usage();
> + exit(0);
> + case 'o':
> + output = optarg;
> + break;
> + case 'p':
> + prefix = optarg;
> + break;
> + case 'r':
> + rootdir = optarg;
> + break;
> + case '?':
> + usage();
> + /* fallthrough */
> + default:
> + exit(EXIT_FAILURE);
> + }
> + }
> +
> + if (chdir(rootdir))
> + perror_exit(rootdir);
> +
> + if (strcmp(output, "-")) {
> + out_fp = fopen(output, "w");
> + if (!out_fp)
> + perror_exit(output);
> + } else {
> + out_fp = stdout;
> + }
> +
> + add_fixed_patterns();
> +
> + traverse_directory(".");
> +
> + if (depth != 0)
> + error_exit("BUG\n");
> +
> + while (nr_patterns > 0)
> + free(patterns[--nr_patterns]);
> + free(patterns);
> + free(nr_patterns_at);
> +
> + fflush(out_fp);
> + if (ferror(out_fp))
> + error_exit("not all data was written to the output\n");
> +
> + if (fclose(out_fp))
> + perror_exit(output);
> +
> + return 0;
> +}
> --
> 2.34.1

I like the idea of gen-exclude.

Testing with some strange patterns seems to reveal some missing points. It
should not be problematic, as nobody wants to write such .gitignore patterns,
but for completeness:

$ mkdir -p test/foo/bar
$ touch test/foo/bar/baz.tmp
$ cat <<-eof >test/.gitignore
**/*.tmp
**/baz.tmp
foo/**/*.tmp
**/bar/baz.tmp
/**/*.tmp
eof
$ cd test
$ ../scripts/gen-exclude --debug
[DEBUG]Add pattern: .git/
[DEBUG]Enter[0]: .
[DEBUG] ./test: no match
[DEBUG] Enter[1]: ./test
[DEBUG] Parse ./test/.gitignore
[DEBUG] Add pattern: ./test/**/*.tmp
[DEBUG] Add pattern: ./test/**/baz.tmp
[DEBUG] Add pattern: ./test/foo/**/*.tmp
[DEBUG] Add pattern: ./test/**/bar/baz.tmp
[DEBUG] Add pattern: ./test/**/*.tmp
[DEBUG] ./test/.gitignore: no match
[DEBUG] ./test/foo: no match
[DEBUG] Enter[2]: ./test/foo
[DEBUG] ./test/foo/bar: no match
[DEBUG] Enter[3]: ./test/foo/bar
[DEBUG] ./test/foo/bar/baz.tmp: no match
[DEBUG] Leave[3]: ./test/foo/bar
[DEBUG] Leave[2]: ./test/foo
[DEBUG] Leave[1]: ./test
[DEBUG]Leave[0]: .

Thus, no match. Everything else I tested, did what I expected.

Reviewed-by: Nicolas Schier <[email protected]>
Tested-by: Nicolas Schier <[email protected]>

Kind regards,
Nicolas


Attachments:
(No filename) (23.98 kB)
signature.asc (833.00 B)
Download all attachments

2023-02-06 03:30:27

by Masahiro Yamada

[permalink] [raw]
Subject: Re: [PATCH v4 1/6] kbuild: add a tool to generate a list of files ignored by git

On Thu, Feb 2, 2023 at 8:08 PM Nicolas Schier <[email protected]> wrote:
>
> On Thu, Feb 02, 2023 at 12:37:11PM +0900 Masahiro Yamada wrote:
> > In short, the motivation of this commit is to build a source package
> > without cleaning the source tree.
> >
> > The deb-pkg and (src)rpm-pkg targets first run 'make clean' before
> > creating a source tarball. Otherwise build artifacts such as *.o,
> > *.a, etc. would be included in the tarball. Yet, the tarball ends up
> > containing several garbage files since 'make clean' does not clean
> > everything.
> >
> > Cleaning the tree every time is annoying since it makes the incremental
> > build impossible. It is desirable to create a source tarball without
> > cleaning the tree.
> >
> > In fact, there are some ways to archive this.
> >
> > The easiest way is 'git archive'. Actually, 'make perf-tar*-src-pkg'
> > does this way, but I do not like it because it works only when the source
> > tree is managed by git, and all files you want in the tarball must be
> > committed in advance.
> >
> > I want to make it work without relying on git. We can do this.
> >
> > Files that are not tracked by git are generated files. We can list them
> > out by parsing the .gitignore files. Of course, .gitignore does not cover
> > all the cases, but it works well enough.
> >
> > tar(1) claims to support it:
> >
> > --exclude-vcs-ignores
> >
> > Exclude files that match patterns read from VCS-specific ignore files.
> > Supported files are: .cvsignore, .gitignore, .bzrignore, and .hgignore.
> >
> > The best scenario would be to use 'tar --exclude-vcs-ignores', but this
> > option does not work. --exclude-vcs-ignore does not understand any of
> > the negation (!), preceding slash, following slash, etc.. So, this option
> > is just useless.
> >
> > Hence, I wrote this gitignore parser. The previous version [1], written
> > in Python, was so slow. This version is implemented in C, so it works
> > much faster.
> >
> > This tool traverses the source tree, parsing the .gitignore files. It
> > prints the file paths that are not tracked by git. The output can be
> > used for tar's --exclude-from= option.
> >
> > [How to test this tool]
> >
> > $ git clean -dfx
> > $ make -s -j$(nproc) defconfig all # or allmodconifg or whatever
> > $ git archive -o ../linux1.tar --prefix=./ HEAD
> > $ tar tf ../linux1.tar | LANG=C sort > ../file-list1 # files emitted by 'git archive'
> > $ make scripts_exclude
> > HOSTCC scripts/gen-exclude
> > $ scripts/gen-exclude --prefix=./ -o ../exclude-list
> > $ tar cf ../linux2.tar --exclude-from=../exclude-list .
> > $ tar tf ../linux2.tar | LANG=C sort > ../file-list2 # files emitted by 'tar'
> > $ diff ../file-list1 ../file-list2 | grep -E '^(<|>)'
> > < ./Documentation/devicetree/bindings/.yamllint
> > < ./drivers/clk/.kunitconfig
> > < ./drivers/gpu/drm/tests/.kunitconfig
> > < ./drivers/gpu/drm/vc4/tests/.kunitconfig
> > < ./drivers/hid/.kunitconfig
> > < ./fs/ext4/.kunitconfig
> > < ./fs/fat/.kunitconfig
> > < ./kernel/kcsan/.kunitconfig
> > < ./lib/kunit/.kunitconfig
> > < ./mm/kfence/.kunitconfig
> > < ./net/sunrpc/.kunitconfig
> > < ./tools/testing/selftests/arm64/tags/
> > < ./tools/testing/selftests/arm64/tags/.gitignore
> > < ./tools/testing/selftests/arm64/tags/Makefile
> > < ./tools/testing/selftests/arm64/tags/run_tags_test.sh
> > < ./tools/testing/selftests/arm64/tags/tags_test.c
> > < ./tools/testing/selftests/kvm/.gitignore
> > < ./tools/testing/selftests/kvm/Makefile
> > < ./tools/testing/selftests/kvm/config
> > < ./tools/testing/selftests/kvm/settings
> >
> > The source tarball contains most of files that are tracked by git. You
> > see some diffs, but it is just because some .gitignore files are wrong.
> >
> > $ git ls-files -i -c --exclude-per-directory=.gitignore
> > Documentation/devicetree/bindings/.yamllint
> > drivers/clk/.kunitconfig
> > drivers/gpu/drm/tests/.kunitconfig
> > drivers/hid/.kunitconfig
> > fs/ext4/.kunitconfig
> > fs/fat/.kunitconfig
> > kernel/kcsan/.kunitconfig
> > lib/kunit/.kunitconfig
> > mm/kfence/.kunitconfig
> > tools/testing/selftests/arm64/tags/.gitignore
> > tools/testing/selftests/arm64/tags/Makefile
> > tools/testing/selftests/arm64/tags/run_tags_test.sh
> > tools/testing/selftests/arm64/tags/tags_test.c
> > tools/testing/selftests/kvm/.gitignore
> > tools/testing/selftests/kvm/Makefile
> > tools/testing/selftests/kvm/config
> > tools/testing/selftests/kvm/settings
> >
> > [1]: https://lore.kernel.org/all/[email protected]/
> >
> > Signed-off-by: Masahiro Yamada <[email protected]>
> > ---
> >
> > (no changes since v3)
> >
> > Changes in v3:
> > - Various code refactoring: remove struct gitignore, remove next: label etc.
> > - Support --extra-pattern option
> >
> > Changes in v2:
> > - Reimplement in C
> >
> > Makefile | 4 +
> > scripts/.gitignore | 1 +
> > scripts/Makefile | 2 +-
> > scripts/gen-exclude.c | 623 ++++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 629 insertions(+), 1 deletion(-)
> > create mode 100644 scripts/gen-exclude.c
> >
> > diff --git a/Makefile b/Makefile
> > index 2faf872b6808..35b294cc6f32 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1652,6 +1652,10 @@ distclean: mrproper
> > %pkg: include/config/kernel.release FORCE
> > $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.package $@
> >
> > +PHONY += scripts_exclude
> > +scripts_exclude: scripts_basic
> > + $(Q)$(MAKE) $(build)=scripts scripts/gen-exclude
> > +
> > # Brief documentation of the typical targets used
> > # ---------------------------------------------------------------------------
> >
> > diff --git a/scripts/.gitignore b/scripts/.gitignore
> > index 6e9ce6720a05..7f433bc1461c 100644
> > --- a/scripts/.gitignore
> > +++ b/scripts/.gitignore
> > @@ -1,5 +1,6 @@
> > # SPDX-License-Identifier: GPL-2.0-only
> > /asn1_compiler
> > +/gen-exclude
> > /generate_rust_target
> > /insert-sys-cert
> > /kallsyms
> > diff --git a/scripts/Makefile b/scripts/Makefile
> > index 32b6ba722728..5dcd7f57607f 100644
> > --- a/scripts/Makefile
> > +++ b/scripts/Makefile
> > @@ -38,7 +38,7 @@ HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
> > endif
> >
> > # The following programs are only built on demand
> > -hostprogs += unifdef
> > +hostprogs += gen-exclude unifdef
> >
> > # The module linker script is preprocessed on demand
> > targets += module.lds
> > diff --git a/scripts/gen-exclude.c b/scripts/gen-exclude.c
> > new file mode 100644
> > index 000000000000..5c4ecd902290
> > --- /dev/null
> > +++ b/scripts/gen-exclude.c
> > @@ -0,0 +1,623 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +//
> > +// Traverse the source tree, parsing all .gitignore files, and print file paths
> > +// that are not tracked by git.
> > +// The output is suitable to the --exclude-from option of tar.
> > +// This is useful until the --exclude-vcs-ignores option gets working correctly.
> > +//
> > +// Copyright (C) 2023 Masahiro Yamada <[email protected]>
> > +
> > +#include <dirent.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <fnmatch.h>
> > +#include <getopt.h>
> > +#include <stdarg.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <unistd.h>
> > +
> > +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
> > +
> > +// struct pattern - represent an ignore pattern (a line in .gitignroe)
> > +// @negate: negate the pattern (prefixing '!')
> > +// @dir_only: only matches directories (trailing '/')
> > +// @path_match: true if the glob pattern is a path instead of a file name
> > +// @double_asterisk: true if the glob pattern contains double asterisks ('**')
> > +// @glob: glob pattern
> > +struct pattern {
> > + bool negate;
> > + bool dir_only;
> > + bool path_match;
> > + bool double_asterisk;
> > + char glob[];
> > +};
> > +
> > +struct pattern **patterns;
>
> Is there a reason, why patterns is not static? (sparse asked)


No reason - I just forgot to run sparse.
Thanks for catching it.






> > + q = modified_pattern;
> > + for (p = pattern; *p; p++) {
> > + if (!strncmp(p, "**/", 3)) {
> > + // "**/" means zero of more sequences of '*/".
> > + // "foo**/bar" matches "foobar", "foo*/bar",
> > + // "foo*/*/bar", etc.
> > + while (slash_diff-- > 0) {
> > + *q++ = '*';
> > + *q++ = '/';
> > + }
> > +
> > + if (slash_diff == 0) {
> > + *q++ = '*';
> > + *q++ = '/';
> > + }
> > +
> > + if (slash_diff < 0)
> > + slash_diff++;
> > +
> > + p += 2;
> > + } else if (!strcmp(p, "/**")) {
> > + // A trailing "/**" matches everything inside.
>
> In v2 you also checked against "(*p + 3) == '\0'". Is the explicit check
> against end-of-string really not needed here? (pattern = "whatever/**/*.tmp"?)


This detects a trailing "/**".


See this documentation:
https://github.com/git/git/blob/v2.39.1/Documentation/gitignore.txt#L123



"whatever/**/*.tmp" is detected by the previous
if (!strncmp(p, "**/", 3))


strcmp(p, "/**") only matches the pattern at the end,
while strncmp(p, "**/", 3) matches the pattern anywhere.


Anyway, I will throw away this code in v5.






> > +}
> > +
> > +// Return the length of the initial segment of the string that does not contain
> > +// the unquoted sequence of the given character. Similar to strcspn() in libc.
>
> I struggled across that comment and it took me quite some time to match it to
> strcspn_trailers() behaviour. I expect it to strip all unescaped occurrences
> of c at the end of str and return the resulting strlen. After reading it
> several times, I can get a match. I _think_ main confusion came from my (quite
> imperfect) English:
>
> "one two "
> ^^^ initial segment of string not containing unquoted c ??
>
> ^^^^^^^ substr that is considered by strcspn_trailer
>
> But this is just about a comment and I'm sure I understand what is intended.
> No action required.


I am not good at English.

Indeed, this comment is really confusing.

Something like the following would have been clearer.

// This function strips the unescaped sequence of the given char from the end
// of the string, and returns the length of the resulting substring.





>
> > +static size_t strcspn_trailer(const char *str, char c)
> > +{
> > + bool quoted = false;
> > + size_t len = strlen(str);
> > + size_t spn = len;
> > + const char *s;
> > +
> > + for (s = str; *s; s++) {
> > + if (!quoted && *s == c) {
> > + if (s - str < spn)
> > + spn = s - str;
> > + } else {
> > + spn = len;
>
> Is this really intended? Or 'spn = str - s + 1'?


I think you meant, 'spn = s - str + 1'

My code works, but I think yours is cleaner
because it does not require 'len'.




BTW, I read the source code of GIT.

GIT's implementation is here:
https://github.com/git/git/blob/v2.39.1/dir.c#L934






>
> > +
> > + if (!quoted && *s == '\\')
> > + quoted = true;
> > + else
> > + quoted = false;
> > + }
> > + }
> > +
> > + return spn;
> > +}
> > +
> > +// Add an gitignore pattern.
> > +static void add_pattern(char *s, const char *dirpath)
> > +{
> > + bool negate = false;
> > + bool dir_only = false;
> > + bool path_match = false;
> > + bool double_asterisk = false;
> > + char *e = s + strlen(s);
> > + struct pattern *p;
> > + size_t len;
> > +
> > + // Skip comments
> > + if (*s == '#')
> > + return;
> > +
> > + // Trailing spaces are ignored unless they are quoted with backslash.
> > + e = s + strcspn_trailer(s, ' ');
> > + *e = '\0';
> > +
> > + // The prefix '!' negates the pattern
> > + if (*s == '!') {
> > + s++;
> > + negate = true;
> > + }
> > +
> > + // If there is slash(es) that is not escaped at the end of the pattern,
> > + // it matches only directories.
>
> Are escaped slashes allowed in file names in git? I think use of original
> strcspn() would have been enough.


Perhaps, I had some reason to implement it like this, but
I cannot recall it.



Anyway, GIT's implementation is very simple:

https://github.com/git/git/blob/v2.39.1/dir.c#L634

I will follow that.





>
> > +
> > + if (path_match) {
> > + strcat(p->glob, dirpath);
> > + strcat(p->glob, "/");
> > + }
> > +
> > + strcat(p->glob, s);
> > +
> > + debug("Add pattern: %s%s%s\n", negate ? "!" : "", p->glob,
> > + dir_only ? "/" : "");
> > +
> > + if (nr_patterns >= alloced_patterns) {
> > + alloced_patterns += 128;
> > + patterns = xrealloc(patterns,
> > + sizeof(*patterns) * alloced_patterns);
> > + }
> > +
> > + patterns[nr_patterns++] = p;
> > +}
> > +
> > +static void *load_gitignore(const char *dirpath)
> > +{
> > + struct stat st;
> > + char path[PATH_MAX], *buf;
> > + int fd, ret;
> > +
> > + ret = snprintf(path, sizeof(path), "%s/.gitignore", dirpath);
> > + if (ret >= sizeof(path))
> > + error_exit("%s: too long path was truncated\n", path);
> > +
> > + // If .gitignore does not exist in this directory, open() fails.
> > + // It is ok, just skip it.
> > + fd = open(path, O_RDONLY);
> > + if (fd < 0)
> > + return NULL;
>
> Why don't you check against errno == 2 (ENOENT)? I assume, no other
> errno value is expected, but for me it feels a bit odd to not check it
> and exit loudly if something (unlikely) like EMFILE causes open() to
> fail.


Good suggestion.

I will fix it.

GIT also checks this:

https://github.com/git/git/blob/v2.39.1/wrapper.c#L399


>
> > +
> > + if (fstat(fd, &st) < 0)
> > + perror_exit(path);
> > +
> > + buf = xmalloc(st.st_size + 1);
> > + if (read(fd, buf, st.st_size) != st.st_size)
> > + perror_exit(path);
> > +
> > + buf[st.st_size] = '\0';
> > + if (close(fd))
> > + perror_exit(path);
> > +
> > + return buf;
> > +}
> > +
> > +// Parse '.gitignore' in the given directory.
> > +static void parse_gitignore(const char *dirpath)
> > +{
> > + char *buf, *s, *next;
> > +
> > + buf = load_gitignore(dirpath);
> > + if (!buf)
> > + return;
> > +
> > + debug("Parse %s/.gitignore\n", dirpath);
> > +
> > + for (s = buf; *s; s = next) {
> > + next = s;
> > +
> > + while (*next != '\0' && *next != '\n')
>
> Not relevant for in-tree use: git does not complain about '\0' in a .gitignore
> but also handles the remaining part of the file.
>


You are right.

I confirmed it from the source code:
https://github.com/git/git/blob/v2.39.1/dir.c#L1141


I will follow that.





>
> Testing with some strange patterns seems to reveal some missing points. It
> should not be problematic, as nobody wants to write such .gitignore patterns,
> but for completeness:
>
> $ mkdir -p test/foo/bar
> $ touch test/foo/bar/baz.tmp
> $ cat <<-eof >test/.gitignore
> **/*.tmp
> **/baz.tmp
> foo/**/*.tmp
> **/bar/baz.tmp
> /**/*.tmp
> eof
> $ cd test
> $ ../scripts/gen-exclude --debug
> [DEBUG]Add pattern: .git/
> [DEBUG]Enter[0]: .
> [DEBUG] ./test: no match
> [DEBUG] Enter[1]: ./test
> [DEBUG] Parse ./test/.gitignore
> [DEBUG] Add pattern: ./test/**/*.tmp
> [DEBUG] Add pattern: ./test/**/baz.tmp
> [DEBUG] Add pattern: ./test/foo/**/*.tmp
> [DEBUG] Add pattern: ./test/**/bar/baz.tmp
> [DEBUG] Add pattern: ./test/**/*.tmp
> [DEBUG] ./test/.gitignore: no match
> [DEBUG] ./test/foo: no match
> [DEBUG] Enter[2]: ./test/foo
> [DEBUG] ./test/foo/bar: no match
> [DEBUG] Enter[3]: ./test/foo/bar
> [DEBUG] ./test/foo/bar/baz.tmp: no match
> [DEBUG] Leave[3]: ./test/foo/bar
> [DEBUG] Leave[2]: ./test/foo
> [DEBUG] Leave[1]: ./test
> [DEBUG]Leave[0]: .
>
> Thus, no match. Everything else I tested, did what I expected.


You are right.

test/foo/bar/baz.tmp must be ignored.


I read the code because I was curious how GIT does this.

GIT has its own fnmatch() that supports double asterisks too.
https://github.com/git/git/blob/v2.39.1/wildmatch.c#L55


I cannot write such clever code, so I will
import the matching code in v5.


V5 is almost ready for submission.
The code grew up to 1000 lines, but I can live with that.



In my local test, v5 worked correctly.


[DEBUG] Add pattern: .git/
[DEBUG] Enter[0]: .
[DEBUG] ./test: no match
[DEBUG] Enter[1]: ./test
[DEBUG] Parse ./test/.gitignore
[DEBUG] Add pattern: **/*.tmp
[DEBUG] Add pattern: **/baz.tmp
[DEBUG] Add pattern: foo/**/*.tmp
[DEBUG] Add pattern: **/bar/baz.tmp
[DEBUG] Add pattern: /**/*.tmp
[DEBUG] ./test/foo: no match
[DEBUG] Enter[2]: ./test/foo
[DEBUG] ./test/foo/bar: no match
[DEBUG] Enter[3]: ./test/foo/bar
[DEBUG] ./test/foo/bar/baz.tmp: matches /**/*.tmp (./test/.gitignore)
[DEBUG] Ignore: ./test/foo/bar/baz.tmp
test/foo/bar/baz.tmp
[DEBUG] Ignore: ./test/foo/bar (due to all files inside ignored)
test/foo/bar
[DEBUG] Leave[3]: ./test/foo/bar
[DEBUG] Ignore: ./test/foo (due to all files inside ignored)
test/foo
[DEBUG] Leave[2]: ./test/foo
[DEBUG] ./test/.gitignore: no match
[DEBUG] Leave[1]: ./test
[DEBUG] Leave[0]: .







>
> Reviewed-by: Nicolas Schier <[email protected]>
> Tested-by: Nicolas Schier <[email protected]>


Thanks for your close review, as always.



>
> Kind regards,
> Nicolas

--
Best Regards
Masahiro Yamada