LinuxLists.cc - email as a bona fide git transport

2019-10-16 14:21:15

Subject: email as a bona fide git transport

(cross-posted to git, LKML, and the kernel workflows mailing lists.)

Hi all,

I've been following Konstantin Ryabitsev's quest for better development
and communication tools for the kernel [1][2][3], and I would like to
propose a relatively straightforward idea which I think could bring a
lot to the table.

Step 1:

* git send-email needs to include parent SHA1s and generally all the
information needed to perfectly recreate the commit when applied so
that all the SHA1s remain the same

* git am (or an alternative command) needs to recreate the commit
perfectly when applied, including applying it to the correct parent

Having these two will allow a perfect mapping between email and git;
essentially email just becomes a transport for git. There are a lot of
advantages to this, particularly that you have a stable way to refer to
a patch or commit (despite it appearing on a mailing list), and there
is no need for "changeset IDs" or whatever, since you can just use the
git SHA1 which is unique, unambiguous, and stable.

As a rough proof of concept I've attached 3 git patches which implement
this. There are issues to work out like exact format, encodings, mail
mangling, error handling, etc., but hopefully the git community can
help out here. (Improvement suggestions are welcome!)

Step 2:

* A bot that follows LKML (and other lists) and imports patchsets into
a git repository hosted on git.kernel.org

* The bot can add git notes with URLs to lore (and/or other mailing
list archives) and store them in e.g. refs/notes/lore,
refs/notes/lkml, etc.

(For those who don't use git notes yet: they are essentially small
bits of information you can add to a commit without changing its SHA1,
and you can configure tools like 'git log' to show these at the bottom
of a commit. Notes can also exist in a repo completely separate from
the commits they attach data to, so there is _zero_ overhead for those
who don't want to use this.)

* Maintainers can either pull patchsets directly from this bot-
maintained repo OR they can continue to apply patches from their inbox
(the result should be the same either way) OR they can continue in the
old-style process (at least for a while) and just not have the
benefits of the new process.

Step 3:

* Instead of describing a patchset in a separate introduction email, we
can create a merge commit between the parent of the first commit in
the series and the last and put the patchset description in the merge
commit [5]. This means the patchset description also gets to be part
of git history.

(This would require support for git send-email/am to be able to send
and apply merge commits -- at least those which have the same tree as
one of the parents. This is _not_ yet supported in my proposed git
patches.)

* stable SHA1s means we can refer to previous versions of a patchset by
SHA1 rather than archive links. I propose a new changelog tag for
this, maybe "Previous:" or maybe even a full list of "v1:", "v2:",
etc. with a SHA1 or ref. Note that these SHA1s do *not* need to exist
in Linus's repo, but those who want can pull those branches from the
bot-maintained repo on git.kernel.org.

Advantages:

- we can keep using email to post patches/patchsets

- the process is opt-in (but should be encouraged) for both authors and
maintainers, and the transition can happen over time

- there is a central repo for convenience, but it is not necessary for
development to happen and is not a single point of failure -- it's
more like Linus's repo and can be moved or even replicated from
scratch by somebody else simply by having mailing list archives

- allows quick lookup of patch/patchset <-> email discussion within git

- allows diffing between versions of a single logical patchset

- patchset descriptions naturally become part of the changelog that ends
up in Linus's tree

Disadvantages:

- requires patching git

- requires a bot to continuously create branches for patchsets sent to
mailing lists

- increased storage/bandwidth for git.kernel.org (?)

- may need a couple of new wrapper scripts to automate patchset
construction/versioning

Thoughts?

Vegard

PS: Eric Wong described something that comes quite close to this idea,
but AFAICT without actually recreating commits exactly. I've included
the link for completeness. [4]

[1]: https://lwn.net/Articles/793037/ "Ryabitsev: Patches carved into
developer sigchains"

[2]: https://lwn.net/Articles/799134/ "Defragmenting the kernel
development process"

[3]:
https://lore.kernel.org/workflows/[email protected]/

[4]: https://lore.kernel.org/workflows/20191008003931.y4rc2dp64gbhv5ju@dcvr/

[5]: To create this merge commit one could use something like this (bash):

# usage: patchset BASE [PREVIOUS_VERSION]
patchset () {
start=$1
prev=$2

# construct tentative commit message
commit_editmsg="$(git rev-parse --git-dir)/COMMIT_EDITMSG"
(
if [ -z "$prev" ]
then
echo 'Patchset title'
echo
echo Commits:
echo
git log --oneline $start..HEAD
else
git show --format=format:%B --no-patch $prev
echo Previous-version: $(git rev-parse $prev)
fi
) > "${commit_editmsg}"

${EDITOR} "${commit_editmsg}"

merge=$(git commit-tree -p $start -p HEAD -F "${commit_editmsg}"
$(git rev-parse HEAD^{tree}))
echo $merge
}

This will open the editor to edit the patchset description and create a
merge commit that encompasses the patches in the patchset (use sha1^- to
view the patches in it).

Attachments:

0001-format-patch-add-complete.patch (3.76 kB)
0002-mailinfo-collect-commit-metadata-from-mail.patch (5.84 kB)
0003-am-add-exact.patch (5.77 kB)
Download all attachments

2019-10-16 14:56:10

by Willy Tarreau

[permalink] [raw]

Subject: Re: email as a bona fide git transport

Hi Vegard,

On Wed, Oct 16, 2019 at 12:22:54PM +0200, Vegard Nossum wrote:
> (cross-posted to git, LKML, and the kernel workflows mailing lists.)
>
> Hi all,
>
> I've been following Konstantin Ryabitsev's quest for better development
> and communication tools for the kernel [1][2][3], and I would like to
> propose a relatively straightforward idea which I think could bring a
> lot to the table.
>
> Step 1:
>
> * git send-email needs to include parent SHA1s and generally all the
> information needed to perfectly recreate the commit when applied so
> that all the SHA1s remain the same
>
> * git am (or an alternative command) needs to recreate the commit
> perfectly when applied, including applying it to the correct parent
>
> Having these two will allow a perfect mapping between email and git;
> essentially email just becomes a transport for git. There are a lot of
> advantages to this, particularly that you have a stable way to refer to
> a patch or commit (despite it appearing on a mailing list), and there
> is no need for "changeset IDs" or whatever, since you can just use the
> git SHA1 which is unique, unambiguous, and stable.

I agree this would be great and have been missing this a number of times,
eventhough I'm aware of git-send-pack/git-receive-pack. The text format
is way more convenient for a lot of reasons. It could also help with
Greg's idea of using the commit IDs to reference bugs, as such IDs could
remain stable within a series before it is merged, and as such referenced
in subsequent commit messages. It could also be useful to avoid losing
notes related to a patch once it's merged.

> Step 3:
>
> * Instead of describing a patchset in a separate introduction email, we
> can create a merge commit between the parent of the first commit in
> the series and the last and put the patchset description in the merge
> commit [5]. This means the patchset description also gets to be part
> of git history.
>
> (This would require support for git send-email/am to be able to send
> and apply merge commits -- at least those which have the same tree as
> one of the parents. This is _not_ yet supported in my proposed git
> patches.)

That's a good idea, as we've all seen long series with a very detailed
description in patch 0 and much less context in subsequent patches, thus
losing the context once merged.

> * stable SHA1s means we can refer to previous versions of a patchset by
> SHA1 rather than archive links. I propose a new changelog tag for
> this, maybe "Previous:" or maybe even a full list of "v1:", "v2:",
> etc. with a SHA1 or ref. Note that these SHA1s do *not* need to exist
> in Linus's repo, but those who want can pull those branches from the
> bot-maintained repo on git.kernel.org.

For me this mainly brings the benefit of finally having a unique identifier
for multiple iterations of a patchset. It then becomes easier to use this
identifier to designate the functional work, regardless of the number of
updates it gets. Of course it's never that black and white since such work
may itself merge multiple other patchsets but for most use cases it can
help.

Willy

2019-10-16 16:39:41

by Pratyush Yadav

[permalink] [raw]

Subject: Re: email as a bona fide git transport

Hi Vegard,

On 16/10/19 12:22PM, Vegard Nossum wrote:
> (cross-posted to git, LKML, and the kernel workflows mailing lists.)
>
> Hi all,
>
> I've been following Konstantin Ryabitsev's quest for better development
> and communication tools for the kernel [1][2][3], and I would like to
> propose a relatively straightforward idea which I think could bring a
> lot to the table.
>
> Step 1:
>
> * git send-email needs to include parent SHA1s and generally all the
> information needed to perfectly recreate the commit when applied so
> that all the SHA1s remain the same
>
> * git am (or an alternative command) needs to recreate the commit
> perfectly when applied, including applying it to the correct parent
>
> Having these two will allow a perfect mapping between email and git;
> essentially email just becomes a transport for git. There are a lot of
> advantages to this, particularly that you have a stable way to refer to
> a patch or commit (despite it appearing on a mailing list), and there
> is no need for "changeset IDs" or whatever, since you can just use the
> git SHA1 which is unique, unambiguous, and stable.

FWIW, I like the idea.

> As a rough proof of concept I've attached 3 git patches which implement
> this. There are issues to work out like exact format, encodings, mail
> mangling, error handling, etc., but hopefully the git community can
> help out here. (Improvement suggestions are welcome!)
>
> Step 2:
>
> * A bot that follows LKML (and other lists) and imports patchsets into
> a git repository hosted on git.kernel.org
>
> * The bot can add git notes with URLs to lore (and/or other mailing
> list archives) and store them in e.g. refs/notes/lore,
> refs/notes/lkml, etc.
>
> (For those who don't use git notes yet: they are essentially small
> bits of information you can add to a commit without changing its SHA1,
> and you can configure tools like 'git log' to show these at the bottom
> of a commit. Notes can also exist in a repo completely separate from
> the commits they attach data to, so there is _zero_ overhead for those
> who don't want to use this.)
>
> * Maintainers can either pull patchsets directly from this bot-
> maintained repo OR they can continue to apply patches from their inbox
> (the result should be the same either way) OR they can continue in the
> old-style process (at least for a while) and just not have the
> benefits of the new process.
>
> Step 3:
>
> * Instead of describing a patchset in a separate introduction email, we
> can create a merge commit between the parent of the first commit in
> the series and the last and put the patchset description in the merge
> commit [5]. This means the patchset description also gets to be part
> of git history.
>
> (This would require support for git send-email/am to be able to send
> and apply merge commits -- at least those which have the same tree as
> one of the parents. This is _not_ yet supported in my proposed git
> patches.)

Can sending merge commits via email work with your proposed '--exact'?
Say I'm the maintainer, and you fork off a feature branch off my master,
add a few commits that introduce your new feature, and then merge it
into my master, and then send those commits, including the merge.

Now in that scenario, say the tip of your feature branch was X and the
tip of my 'master' was Y when you sent your patches. Now while your
patches are still being reviewed, I merge in some other branch creating
a merge commit Z on my master.

Now your merge's first parent was Y and second parent was X. But now the
tip of my master is Z, so the first parent of your merge needs to be Z,
not Y. Changing the first parent would mean a different commit hash.

So, the way I see it, your proposed merge commits via email can't work
with '--exact'. Do I understand this situation correctly? Am I missing
something?

Maybe a better idea would be to allow 'am' to create these merges
locally when applying the patches. That would mean having to merge the
separate branch along with applying the patches, otherwise the cover
letter text is lost. This might not be something everyone wants. I for
one don't. When I apply patches via 'am', I first keep them on a
separate branch, test them out, and then merge them into 'master'.

So a yet another alternative could be to save the cover letter as the
branch description. This branch description can then be used to generate
the merge message. IIRC, Denton Liu is working on generating the cover
letter text from branch description, so this feature would be like its
inverse.

> * stable SHA1s means we can refer to previous versions of a patchset by
> SHA1 rather than archive links. I propose a new changelog tag for
> this, maybe "Previous:" or maybe even a full list of "v1:", "v2:",
> etc. with a SHA1 or ref. Note that these SHA1s do *not* need to exist
> in Linus's repo, but those who want can pull those branches from the
> bot-maintained repo on git.kernel.org.
>
> Advantages:
>
> - we can keep using email to post patches/patchsets
>
> - the process is opt-in (but should be encouraged) for both authors and
> maintainers, and the transition can happen over time
>
> - there is a central repo for convenience, but it is not necessary for
> development to happen and is not a single point of failure -- it's
> more like Linus's repo and can be moved or even replicated from
> scratch by somebody else simply by having mailing list archives
>
> - allows quick lookup of patch/patchset <-> email discussion within git
>
> - allows diffing between versions of a single logical patchset
>
> - patchset descriptions naturally become part of the changelog that ends
> up in Linus's tree
>
> Disadvantages:
>
> - requires patching git
>
> - requires a bot to continuously create branches for patchsets sent to
> mailing lists
>
> - increased storage/bandwidth for git.kernel.org (?)
>
> - may need a couple of new wrapper scripts to automate patchset
> construction/versioning

Just to play the devil's advocate, even though I'm in favor of something
like this, I'll add in another disadvantage:

- The maintainer can't make small edits before pushing the changes out.

I do that every now and then for git-gui, and Junio does that sometimes
for Git. I don't know if the folks over at Linux do something like this,
but using '--exact' would mean that contributors would have to send a
re-roll for even minor changes. Its mostly an inconvenience instead of a
problem, but I thought I'd point it out.

> Thoughts?

One more question, not strictly related to your proposal: right now,
when I apply patches from contributors, I pass '-s' to 'am', so the
applied commit would have my sign-off. The way I see it, that sign-off
is supposed to signify that I have the right to push out the commit to
the "main" repo, just like the author's sign-off means that they have
the right to send me that commit.

Looking at git.git, I notice that Junio does the same. The new '--exact'
would be incompatible with '-s', correct (since the commit message has
changed, the SHA1 would also change)? So firstly, make sure you account
for something like that if you haven't already (I haven't found the time
to read your patches yet). Secondly, is it all right for the maintainer
to just not sign-off on the commits they push out?

--
Regards,
Pratyush Yadav

2019-10-16 17:05:01

by Laurent Pinchart

[permalink] [raw]

Subject: Re: email as a bona fide git transport

On Thu, Oct 17, 2019 at 06:30:29PM -0700, Greg KH wrote:
> On Thu, Oct 17, 2019 at 04:45:32PM -0400, Konstantin Ryabitsev wrote:
> > On Thu, Oct 17, 2019 at 01:43:43PM -0700, Greg KH wrote:
> >>> I wonder if it'd be also possible to then embed gpg signatures over
> >>> send-mail payloads so as they can be transparently transferred to the
> >>> commit.
> >>
> >> That's a crazy idea. It would be nice if we could do that, I like it :)
> >
> > It could only possibly work if nobody ever adds their own "Signed-Off-By" or
> > any other bylines. I expect this is a deal-breaker for most maintainers.
>
> Yeah it is :(
>
> But, if we could just have the signature on the code change, not the
> changelog text, that would help with that issue.

I ran into a related issue recently when thinking about how to implement
server-side workflows (for a non-kernel project). My goal is to ensure a
patch can only be pushed to the master branch if it has received review.
The easy way to do so it to check the Reviewed-by tags, but those can
easily be forged. I was thus wondering if we should have a way to sign
tags (as in commit message tags, not git tags).

--
Regards,

Laurent Pinchart

2019-10-20 06:31:36

by Vegard Nossum

[permalink] [raw]

Subject: Re: email as a bona fide git transport

On 10/20/19 5:17 AM, Willy Tarreau wrote:
> On Fri, Oct 18, 2019 at 03:14:56PM -0400, Theodore Y. Ts'o wrote:
>> On Fri, Oct 18, 2019 at 06:50:51PM +0200, Vegard Nossum wrote:
>>> The problem I ran into with putting the metadata at the end was
>>> detecting where the diff ends. A comment in 'git apply' suggested that
>>> detecting the difference between "--" as a diff/signature separator and
>>> as part of the diff is nontrivial in the sense that you need to actually
>>> do some parsing and keep track of hunk sizes.
>>
>> Could we cheat by having "git format-patch" add a "Diff-size" in the
>> header which gives the number of lines in the diff so git am can just
>> count lines to find the Trailer section?
>
> Be careful with this, it starts like this and ends up with non-editable
> patches. I'd rather have git-am use best-effort detection of the end.

Expect filesystem developers to come up with a format that uses extents ;-)

> Also when dealing with stable backports, I've done a lot of
> "cat foo.diff >> bar.patch" to fixup some patches in which I just had
> to move some parts around. Having to count lines and edit a counter
> somewhere is going to become really painful.

I almost have some new patches ready for putting the metadata after the
patch using a very bare-bones diff parser (it's actually not that bad),
I just need to fix a few corner cases that are causing breakage in the
git test suite.

Vegard

2019-10-22 13:32:45

by Vegard Nossum

[permalink] [raw]

Subject: Re: email as a bona fide git transport

On 10/20/19 8:28 AM, Vegard Nossum wrote:
>
> On 10/20/19 5:17 AM, Willy Tarreau wrote:
>> On Fri, Oct 18, 2019 at 03:14:56PM -0400, Theodore Y. Ts'o wrote:
>>> On Fri, Oct 18, 2019 at 06:50:51PM +0200, Vegard Nossum wrote:
>>>> The problem I ran into with putting the metadata at the end was
>>>> detecting where the diff ends. A comment in 'git apply' suggested that
>>>> detecting the difference between "--" as a diff/signature separator and
>>>> as part of the diff is nontrivial in the sense that you need to
>>>> actually
>>>> do some parsing and keep track of hunk sizes.
>>>
>>> Could we cheat by having "git format-patch" add a "Diff-size" in the
>>> header which gives the number of lines in the diff so git am can just
>>> count lines to find the Trailer section?
>>
>> Be careful with this, it starts like this and ends up with non-editable
>> patches. I'd rather have git-am use best-effort detection of the end.
>
> Expect filesystem developers to come up with a format that uses extents ;-)
>
>> Also when dealing with stable backports, I've done a lot of
>> "cat foo.diff >> bar.patch" to fixup some patches in which I just had
>> to move some parts around. Having to count lines and edit a counter
>> somewhere is going to become really painful.
>
> I almost have some new patches ready for putting the metadata after the
> patch using a very bare-bones diff parser (it's actually not that bad),
> I just need to fix a few corner cases that are causing breakage in the
> git test suite.

I sent v2 of the patches (with metadata _after_ the diff) to the git
list here:

https://public-inbox.org/git/[email protected]/T/#u

As I wrote in there, we could already today start using

git am --message-id

when applying patches and this would provide something that a bot could
annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
would already be a pretty nice improvement over today's situation.

Sadly, since the beginning of 2018, this was only used for a measly
~0.14% of all non-merge commits in the kernel:

$ git rev-list --count --no-merges --since='2018-01-01' --grep
'Message-Id: ' linus/master
178

$ git rev-list --count --no-merges --since='2018-01-01' linus/master
130777

So how can we spread the word about --message-id and get maintainers to
actually use it? I don't suppose it's reasonable to change the 'git am'
default setting?

Vegard

2019-10-22 15:55:14

by Theodore Ts'o

[permalink] [raw]

Subject: Re: email as a bona fide git transport

On Tue, Oct 22, 2019 at 02:11:22PM +0200, Vegard Nossum wrote:
>
> As I wrote in there, we could already today start using
>
> git am --message-id
>
> when applying patches and this would provide something that a bot could
> annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
> would already be a pretty nice improvement over today's situation.
>
> Sadly, since the beginning of 2018, this was only used for a measly
> ~0.14% of all non-merge commits in the kernel:
>
> $ git rev-list --count --no-merges --since='2018-01-01' --grep 'Message-Id:
> ' linus/master
> 178

You might also want to count commits which have a link tag with a
Message-Id:

Link: https://lore.kernel.org/r/c3438dad66a34a7d4e7509a5dd64c2326340a52a.1571647180.git.mbobrowski@mbobrowski.org

That's because some kernel developers have been using a hook script like this:

#!/bin/sh
# For .git/hooks/applypatch-msg
#
# You must have the following in .git/config:
# [am]
# messageid = true

. git-sh-setup
perl -pi -e 's|^Message-Id:\s*<?([^>]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1"
test -x "$GIT_DIR/hooks/commit-msg" &&
exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"}
:

.... as we had reached rough consensus that this was the best way to
incorprate the message id (since it could made to be a clickable link
in tools like gitk, for example). This rough consensus has only been
in place since around the time of the Maintainer's Summit in Lisbon,
so uptake is still probably a bit slow. I'd expect to see a lot more
of this in the next merge window, though.

- Ted

2019-10-22 17:17:57

by Vegard Nossum

[permalink] [raw]

Subject: Re: email as a bona fide git transport

On 10/22/19 3:53 PM, Theodore Y. Ts'o wrote:
> On Tue, Oct 22, 2019 at 02:11:22PM +0200, Vegard Nossum wrote:
>>
>> As I wrote in there, we could already today start using
>>
>> git am --message-id
>>
>> when applying patches and this would provide something that a bot could
>> annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
>> would already be a pretty nice improvement over today's situation.
>>
>> Sadly, since the beginning of 2018, this was only used for a measly
>> ~0.14% of all non-merge commits in the kernel:
>>
>> $ git rev-list --count --no-merges --since='2018-01-01' --grep 'Message-Id:
>> ' linus/master
>> 178
>
> You might also want to count commits which have a link tag with a
> Message-Id:
>
> Link: https://lore.kernel.org/r/c3438dad66a34a7d4e7509a5dd64c2326340a52a.1571647180.git.mbobrowski@mbobrowski.org
>
> That's because some kernel developers have been using a hook script like this:
>
> #!/bin/sh
> # For .git/hooks/applypatch-msg
> #
> # You must have the following in .git/config:
> # [am]
> # messageid = true
>
> . git-sh-setup
> perl -pi -e 's|^Message-Id:\s*<?([^>]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1"
> test -x "$GIT_DIR/hooks/commit-msg" &&
> exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"}
> :
>
> .... as we had reached rough consensus that this was the best way to
> incorprate the message id (since it could made to be a clickable link
> in tools like gitk, for example). This rough consensus has only been
> in place since around the time of the Maintainer's Summit in Lisbon,
> so uptake is still probably a bit slow. I'd expect to see a lot more
> of this in the next merge window, though.

Thanks, I was not aware of this!

Seems like something that should go in Documentation/maintainer/,
right?

The figure is much better, 16.7% on all non-merges since 2018-01-01.
This should help and we can maybe already do some interesting things
with git notes and lore/public-inbox.

Vegard

2019-10-22 23:25:22

by Eric Wong

[permalink] [raw]

Subject: Re: email as a bona fide git transport

Vegard Nossum <[email protected]> wrote:
> I sent v2 of the patches (with metadata _after_ the diff) to the git
> list here:
>
> https://public-inbox.org/git/[email protected]/T/#u
>
> As I wrote in there, we could already today start using
>
> git am --message-id
>
> when applying patches and this would provide something that a bot could
> annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
> would already be a pretty nice improvement over today's situation.
>
> Sadly, since the beginning of 2018, this was only used for a measly
> ~0.14% of all non-merge commits in the kernel:

--message-id helps provide a concrete reference, yes. However,
being able to search for commit subjects in the mail archives is
already implemented via cgit filter. An example is here:

https://80x24.org/mirrors/git.git/commit/?id=8da56a484800023a545d7a7c022473f5aa9e720f

The link at "userdiff: fix some corner cases in dts regex" makes a link to:

https://public-inbox.org/git/?x=t&q=%22userdiff:+fix+some+corner+cases+in+dts+regex%22
(side note: not sure if that "x=t" to expand the whole message is good...)

That link is generated by examples/cgit-commit-filter.lua in the
public-inbox source:

https://public-inbox.org/meta/1677253/s/?b=examples/cgit-commit-filter.lua

My longer term plan is to be able to use the post-image blob OIDs
from cgit to generate a search query for public-inbox such as:

https://public-inbox.org/git/?q=dfpost:afc6b5b404+dfpost:072d58b69d+dfpost:4353b8220c+dfpost:333a625c70+dfpost:e187d356f6

Which finds all versions of the userdiff patch posted. But AFAIK
there's no easy way to get at blob OIDs from cgit to a Lua filter...