I spent a lot of time trying to find an LKML archive in Maildir format
that I could use for local searches with nutmuch or something, but all
the links I was able to find were all dead.
I ended up just compiling one myself and I currently host it at:
https://alyptik.org/lkml.tar.xz
It's possible I'm the only weirdo who finds this kind of thing useful, but
I figured I should share it just in case I'm not.
It's about 1.1 million files, I was wondering if anyone had an idea of a
better way to host this? I've tried Github and GitLab, but they don't
appreciate repos with that many files, hah.
Open to suggestions, thanks!
--
Cheers,
Joey Pabalinas
On Sun, 2018-12-16 at 09:06 -1000, Joey Pabalinas wrote:
> I spent a lot of time trying to find an LKML archive in Maildir format
> that I could use for local searches with nutmuch or something, but all
> the links I was able to find were all dead.
You might instead use
https://www.kernel.org/lore.html
https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/
On Sun, Dec 16, 2018 at 11:17:34AM -0800, Joe Perches wrote:
> On Sun, 2018-12-16 at 09:06 -1000, Joey Pabalinas wrote:
> > I spent a lot of time trying to find an LKML archive in Maildir format
> > that I could use for local searches with nutmuch or something, but all
> > the links I was able to find were all dead.
>
> You might instead use
>
> https://www.kernel.org/lore.html
> https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/
That was my first attempt, but the ducumentation for the public-inbox
format is sort of terrible, and after a few hours trying to convert it
to Maildir I just gave up.
I ended up just slowly scraping lkml.org for a couple weeks so I
wouldn't disrupt anything and it worked fairly well. Just looking for
advice on where to host this now so others might be able to use it.
--
Cheers,
Joey Pabalinas
On Sun, Dec 16, 2018 at 02:46:49PM -0500, Konstantin Ryabitsev wrote:
> On Sun, Dec 16, 2018 at 09:06:39AM -1000, Joey Pabalinas wrote:
> > I spent a lot of time trying to find an LKML archive in Maildir format
> > that I could use for local searches with nutmuch or something, but all
> > the links I was able to find were all dead.
> >
> > I ended up just compiling one myself and I currently host it at:
> >
> > https://alyptik.org/lkml.tar.xz
>
> You seem to have duplicated a lot of effort that has already been done
> to compile the archive on lore.kernel.org.
Absolutely correct, haha.
>
> > It's possible I'm the only weirdo who finds this kind of thing useful, but
> > I figured I should share it just in case I'm not.
>
> The maildir format is kind of terrible for LKML, because having millions
> of messages in a single directory is very hard on the underlying FS. If
> you break it up into multiple folders, then it becomes difficult to
> search. This is the main reason why we have chosen to go with the
> public-inbox format, which solves both of these problems and allows for
> a very efficient archive updating and replication using git.
>
> > It's about 1.1 million files, I was wondering if anyone had an idea of a
> > better way to host this? I've tried Github and GitLab, but they don't
> > appreciate repos with that many files, hah.
>
> Like I said, you seem to be going down the road we've already tried and
> rejected. :)
Yes, I had a strong suspicion I might be the only crazy person who prefers this
kind of format :)
My only comment on the public-mailbox choice is that the documentation
is very sparse and erratic. Myself and a couple other people just
couldn't figure out how to convert that format to Maildir or some other
format you could feed into a reader like neomutt.
Do you have any advice on how to convert those public-inbox files
correctly?
--
Cheers,
Joey Pabalinas
On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> That was my first attempt, but the ducumentation for the public-inbox
> format is sort of terrible,
I'm surprised you think so, because it's basically a simple file called
"m" that is updated on each commit and contains the body of the
message.
> and after a few hours trying to convert it to Maildir I just gave up.
It's as easy as something like this:
for commit in $(git rev-list master); do:
git show $commit:m > maildir/new/$commit
done
You have to do it per each of the shards for the complete archive.
-K
On Sun, Dec 16, 2018 at 09:06:39AM -1000, Joey Pabalinas wrote:
> I spent a lot of time trying to find an LKML archive in Maildir format
> that I could use for local searches with nutmuch or something, but all
> the links I was able to find were all dead.
>
> I ended up just compiling one myself and I currently host it at:
>
> https://alyptik.org/lkml.tar.xz
You seem to have duplicated a lot of effort that has already been done
to compile the archive on lore.kernel.org.
> It's possible I'm the only weirdo who finds this kind of thing useful, but
> I figured I should share it just in case I'm not.
The maildir format is kind of terrible for LKML, because having millions
of messages in a single directory is very hard on the underlying FS. If
you break it up into multiple folders, then it becomes difficult to
search. This is the main reason why we have chosen to go with the
public-inbox format, which solves both of these problems and allows for
a very efficient archive updating and replication using git.
> It's about 1.1 million files, I was wondering if anyone had an idea of a
> better way to host this? I've tried Github and GitLab, but they don't
> appreciate repos with that many files, hah.
Like I said, you seem to be going down the road we've already tried and
rejected. :)
-K
On Sun, Dec 16, 2018 at 02:55:05PM -0500, Konstantin Ryabitsev wrote:
> On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> > That was my first attempt, but the ducumentation for the public-inbox
> > format is sort of terrible,
>
> I'm surprised you think so, because it's basically a simple file called
> "m" that is updated on each commit and contains the body of the
> message.
>
> > and after a few hours trying to convert it to Maildir I just gave up.
>
> It's as easy as something like this:
>
> for commit in $(git rev-list master); do:
> git show $commit:m > maildir/new/$commit
> done
>
> You have to do it per each of the shards for the complete archive.
Ah dang, I was trying to use stuff like ssoma to split it, no wonder it
didn't work. Not sure why I didn't think to try any git commands...
Well, at least now I know, ha. Thanks!
--
Cheers,
Joey Pabalinas
Hi Joey,
On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> > > I spent a lot of time trying to find an LKML archive in Maildir format
> > > that I could use for local searches with nutmuch or something, but all
> > > the links I was able to find were all dead.
> >
> > You might instead use
> >
> > https://www.kernel.org/lore.html
> > https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/
>
> That was my first attempt, but the ducumentation for the public-inbox
> format is sort of terrible, and after a few hours trying to convert it
> to Maildir I just gave up.
>
> I ended up just slowly scraping lkml.org for a couple weeks so I
> wouldn't disrupt anything and it worked fairly well. Just looking for
> advice on where to host this now so others might be able to use it.
Now you've caught my attention; first of all, there are more than 3M
messages stored in the lkml.org datase, so I guess you've missed some
messages or something is really broken.
Besides, unless you figured out how to get to the raw data, you've just
scraped a rendering which discards stuff like pgp signatures etc and has
very incomplete headers. Unless you don't care for those of course :)
Note that I've also been toying with the lore dataset, and wrote a tiny tool
to get Maildir-like data out of it; this code is a bit of a single-use-jig
so you'll need to do some coding if you really want to use it. Attached
anyway.
All the best and enjoy,
Jasper
On Tue, Dec 18, 2018 at 09:26:27PM +0100, Jasper Spaans wrote:
> Now you've caught my attention; first of all, there are more than 3M
> messages stored in the lkml.org datase, so I guess you've missed some
> messages or something is really broken.
>
> Besides, unless you figured out how to get to the raw data, you've just
> scraped a rendering which discards stuff like pgp signatures etc and has
> very incomplete headers. Unless you don't care for those of course :)
>
> Note that I've also been toying with the lore dataset, and wrote a tiny tool
> to get Maildir-like data out of it; this code is a bit of a single-use-jig
> so you'll need to do some coding if you really want to use it. Attached
> anyway.
Yeah, after looking closer at it last week, something here is very
weird. This is definitely far from complete.
When I have some free time I'm just going to give it another go with
the public-inbox conversion.
--
Cheers,
Joey Pabalinas
Joey Pabalinas <[email protected]> wrote:
> My only comment on the public-mailbox choice is that the documentation
> is very sparse and erratic. Myself and a couple other people just
> couldn't figure out how to convert that format to Maildir or some other
> format you could feed into a reader like neomutt.
Sorry, I didn't notice this before. I started making some attempts
at improving documentation (among other things, when time permits)
to public-inbox:
https://public-inbox.org/meta/[email protected]/
And without knowing anything about git or public-inbox, you can
get NNTP messages into Maildir or mboxrd pretty easily. Nothing
new to learn :)
I wrote a one-off Ruby years ago (before public-inbox) for
converting slrnspools to Maildir (sample slrnpull.conf below).
But yeah, I wouldn't recommend 3M+ messages in a Maildir...
==> slrnspool2maildir <==
#!/usr/bin/ruby
require 'socket'
require 'fileutils'
HOSTNAME = Socket.gethostname
usage = "Usage #$0 <spooldir> <maildir>"
spooldir = ARGV[0] or abort usage
maildir = ARGV[1] or abort usage
f = base = nil
nr = 0
%w(cur new tmp).each { |x| FileUtils.mkpath("#{maildir}/#{x}") }
Dir.glob("#{spooldir}/*").each do |src|
File.file?(src) or next
base = File.basename(src)
dest = "#{maildir}/new/#{Time.now.to_i}_#{base}_0.#{HOSTNAME}:2,"
begin
File.link(src, dest)
rescue Errno::EEXIST
warn "#{dest} already exists"
next
end
File.unlink(src)
end
__END__
==> slrnpull.conf <==
# group_name max expire headers_only
inbox.com.example.news.group.name 1000000000 1000000000 0
# usage: slrnpull -d $PWD -h news.example.com --no-post
# Wouldn't be hard to script something using Net::NNTP in Perl
# to write directly to Maildirs, either.
OK, so I understand how to clone archives from lore.kernel.org and how
to convert a git archive to a maildir (thanks, Konstantin!)
What I *don't* understand is how to effectively read this locally.
Ideally I'd like to run mutt, possibly with notmuch for indexing. But
a maildir with 3M files seems impractical. I did actually try it
(without notmuch), but it takes mutt about 5 minutes to start up. And
the maildir is about 23G, compared with 7.5G for the git archive.
Any pointers? I guess there's no mutt backend that can read a
public-inbox archive directly?
Bjorn Helgaas <[email protected]> wrote:
> OK, so I understand how to clone archives from lore.kernel.org and how
> to convert a git archive to a maildir (thanks, Konstantin!)
>
> What I *don't* understand is how to effectively read this locally.
> Ideally I'd like to run mutt, possibly with notmuch for indexing. But
> a maildir with 3M files seems impractical. I did actually try it
> (without notmuch), but it takes mutt about 5 minutes to start up. And
> the maildir is about 23G, compared with 7.5G for the git archive.
Right, relying on Maildir for long-term storage of giant archives
is not a usable solution with any general purpose FSes I know about.
git itself had the same problem with loose object scalability in
the old days and packs were invented as a result.
> Any pointers? I guess there's no mutt backend that can read a
> public-inbox archive directly?
There's mutt patches to support reading over NNTP, so that
works:
mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP
I don't think mutt handles mboxrd 100% correctly, but it's close
enough that you can can download the gzipped mboxrd of a search
query and open it via "mutt -f /path/to/downloaded/mbox.gz"
curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY&x=m"
POST is required(*), and -OJ lets it use the
Content-Disposition: header for a meaningful server-generated
name, but you can also redirect the result to whatever you want.
For all messages since March 1, you could use:
SEARCH_QUERY=d:20190301..
All the supported search queries are documented in
$INBOX_URL/_/text/help/ and the search prefixes (e.g. "d:",
"s:", "b:") are modeled after what's in mairix. You'll need to
escape the queries for URIs (e.g. " " => "+", and so on).
Xapian requires date ranges to be denoted with ".." whereas
mairix uses "-" for ranges.
The main thing public-inbox search misses from mairix is support
for "-t" which grabs non-matching messages from the same thread.
I would like to support that someday, but don't have enough time
(or funding) to make it happen at the moment.
(*) to reliably avoid wasting resources from spiders/prefetchers
On Tue, Mar 5, 2019 at 5:26 PM Eric Wong <[email protected]> wrote:
> Bjorn Helgaas <[email protected]> wrote:
> > Any pointers? I guess there's no mutt backend that can read a
> > public-inbox archive directly?
>
> There's mutt patches to support reading over NNTP, so that
> works:
>
> mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP
Neomutt includes NNTP support, so I tried this:
neomutt -f news://nntp.lore.kernel.org/org.kernel.vger.linux-kernel
which worked OK but (1) I only see the most recent 1000 messages and
(2) obviously isn't reading a *local* archive. Neomutt took about 45
seconds to start up over my wimpy ISP.
I assume I could probably have a local archive and run a local NNTP
server and point neomutt at that local server. But I don't know how
full-archive searching would work there.
> I don't think mutt handles mboxrd 100% correctly, but it's close
> enough that you can can download the gzipped mboxrd of a search
> query and open it via "mutt -f /path/to/downloaded/mbox.gz"
>
> curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY&x=m"
I got nothing at all with -XPOST, but this:
curl -OJ "https://lore.kernel.org/linux-pci/?q=d:20190301..&x=m"
got me the HTML source. Nothing that looks like mboxrd. I assume
this is stupid user error on my part, but even with that resolved, it
wouldn't have the nice git fetch properties of the git archive, i.e.,
incremental updates of only new stuff, would it?
I think my ideal solution would be a mutt that could read the git
archive directly, plus a notmuch index. But AFAIK, mutt can't do
that, and notmuch only works with one message per file, not with the
git archive.
Something that might work would be to use Konstantin's "git archive to
maildir" hint but shard into a bunch of smaller maildirs instead of
one big one, then have notmuch index those, and use mutt or vim with
notmuch queries instead of having it read in a maildir.
But I feel like I must be missing the solution that's obvious to
everybody but me.
Bjorn
Bjorn Helgaas <[email protected]> wrote:
> On Tue, Mar 5, 2019 at 5:26 PM Eric Wong <[email protected]> wrote:
> > Bjorn Helgaas <[email protected]> wrote:
>
> > > Any pointers? I guess there's no mutt backend that can read a
> > > public-inbox archive directly?
> >
> > There's mutt patches to support reading over NNTP, so that
> > works:
> >
> > mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP
>
> Neomutt includes NNTP support, so I tried this:
>
> neomutt -f news://nntp.lore.kernel.org/org.kernel.vger.linux-kernel
>
> which worked OK but (1) I only see the most recent 1000 messages and
> (2) obviously isn't reading a *local* archive. Neomutt took about 45
> seconds to start up over my wimpy ISP.
>
> I assume I could probably have a local archive and run a local NNTP
> server and point neomutt at that local server. But I don't know how
> full-archive searching would work there.
Right. AFAIK there isn't a good solution for search via NNTP.
> > I don't think mutt handles mboxrd 100% correctly, but it's close
> > enough that you can can download the gzipped mboxrd of a search
> > query and open it via "mutt -f /path/to/downloaded/mbox.gz"
> >
> > curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY&x=m"
>
> I got nothing at all with -XPOST, but this:
Ah, I guess nginx (or something in AWS) rejects POST without
Content-Length headers. Adding "-HContent-Length:0"
to the command-line with -XPOST works for lore.
> curl -OJ "https://lore.kernel.org/linux-pci/?q=d:20190301..&x=m"
>
> got me the HTML source. Nothing that looks like mboxrd. I assume
Right. The "x=m" requests an mbox; but it's only available via
POST requests (to prevent search engine spiders from wasting
time on non-HTML content). With the HTML output in a browser,
the "mbox.gz" button makes the POST request and allows you to
download the mbox.
> this is stupid user error on my part, but even with that resolved, it
> wouldn't have the nice git fetch properties of the git archive, i.e.,
> incremental updates of only new stuff, would it?
You could bump d:YYYYMMDD (there's also "dt:" for date-time if
you need more precision).
> I think my ideal solution would be a mutt that could read the git
> archive directly, plus a notmuch index. But AFAIK, mutt can't do
> that, and notmuch only works with one message per file, not with the
> git archive.
>
> Something that might work would be to use Konstantin's "git archive to
> maildir" hint but shard into a bunch of smaller maildirs instead of
> one big one, then have notmuch index those, and use mutt or vim with
> notmuch queries instead of having it read in a maildir.
Small Maildirs work great, but large ones fall over. I don't
think having a bunch of smaller Maildirs would help notmuch
since notmuch still needs to know each file path.
The only way I could see notmuch/Maildir working well is to keep
the overall number of messages relatively small.
One of my longer-term goals is to write a mairix-like tool in
Perl which works with public-inbox archives; but I barely have
enough time for public-inbox these days :<
mairix works with gzipped mboxes, which is great for large
archives; but the indexing falls over since it rewrites the
entire search index every time. SSDs have died as a result :<
> But I feel like I must be missing the solution that's obvious to
> everybody but me.
Nope, you're not alone :) There's not a lot of mail software
which can handle LKML-sized histories efficiently.