2002-10-26 01:51:47

by Rob Landley

[permalink] [raw]
Subject: The return of the return of crunch time (2.5 merge candidate list 1.6)

Changes since last time (closing down the list):

On Thursday, the list hit 30 different features proposed for 2.5 integration.
That's too much, they're obviously not all going to get in, and I'm now tring
to collate the list into something vaguely reasonable.

The -mm tree is now listed as one (nicely broken-up) patch. Anything in it,
Linus is bound to see, so it doesn't need to be tracked separately.

Some other things are low-impact enough they can go in during the stable
series. Online EXT3 resize support (resizing a mounted ext3 partition
without having to unmount it first) seems to have resolved itself
into that category ( See thread at:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7680.html ).
Reiser4 is probably in this category as well, since Reiser3 went into
the 2.4 stable series and Reiser4 claims to be a seperate filesystem
(like EXT2 and EXT3). Add in the fact that Hans Reiser still hasn't
produced a patch yet, and the decision's pretty easy. (If you disagree,
yell out now...)

I'm NOT going to list the post-freeze things, the 2.5 status list at
http://kernelnewbies.org/status does a fine job of that.

Most of the other "unresolved issues" are probably either in this category
or are going to have to wait for the next development series, because
nobody's piped up in support of them yet. I'm going to drop those
by sunday. If you have a concern on that list (or that should be
on that list), time is running out.

I'm also looking for other things that can similarly be removed from
this list and pushed for integration during the next stable series.
Criteria for this: no API changes, and no impact on people who don't
actually try to use the thing.

If people familiar with these features can suggest stuff that's
deferrable, please let me know. I've been trying very hard not to make
judgement calls on these patches (not my job), but I'm certainly open
to advice.

And so:

================================= Intro ====================================

Linus returns from the Linux Lunacy Cruise after Sunday, October 27th.
The following features aim to be ready for submission to Linus by Monday,
October 28th, to be considered for inclusion (in 2.5.45) before the feature
freeze on Thursday, October 31 (halloween).

This list is just pending features trying to get in before feature freeze.
It's primarily for features that need more testing, or might otherwise get
forgotten in the rush. If you want to know what's already gone in, or what's
being worked on for the next development cycle, or self-contained things that
might be merged during the stable series, check out:
http://kernelnewbies.org/status

Thanks to Rusty Russell and Guillaume Boissiere, whose respective 2.5 merge
candidate lists have been ruthlessly strip-mined in the process of
assembling this. And to everybody who's emailed stuff.

============================ Pending features: =============================

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

1) Andrew Morton's -mm tree. (Andre Morton, editor.)

Andrew Morton's -mm tree collates several other projects, including:

The ext2/ext3 Extended Attributes and Access Control Lists patch from Ted Tso
(ext23-*.patch), Page Table Sharing from Danliel Phillips and Dave McCracken
(shpte-ng.patch), Andrew Morton's own deadline IO scheduler
(akpm-deadline.patch), a bunch of huge page upgrades from Richard J. Moore
(hugetlb*.patch), the orlov allocator, Ingo's generic nonlinear mappings...

Stuff. Lots of stuff.

You can get Andrew Morton's MM tree from the following URL, including a
broken-out patches directory and a description file. (The latest version
as of this writing is -mm5.)

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.44

Issues: Did Ed Tomlinson's page table bug get fixed?

http://lists.insecure.org/lists/linux-kernel/2002/Oct/7147.html

----------------------------------------------------------------------------

2) Device mapper for Logical Volume Manager (LVM2) (LVM2 team) (in -ac)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103536883428443&w=2

Download:
http://people.sistina.com/~thornber/patches/2.5-stable/

Home page:
http://www.sistina.com/products_lvm.htm

Note: this is in the 2.5-ac tree, available at:
http://www.kernel.org/pub/linux/kernel/people/alan/

----------------------------------------------------------------------------

3) EVMS (Enterprise Volume Management System) (IBM, Contact: Kevin Corry)

Fighting with LVM2 for a place in the tree, a bigger solution to a bigger
set of problems:

Home page:
http://sourceforge.net/projects/evms

Home page:
http://evms.sourceforge.net

Download:
http://evms.sourceforge.net/patches/

Some related discussions:
http://marc.theaimsgroup.com/?t=103359686900003&r=1&w=2
http://marc.theaimsgroup.com/?t=103439913000001&r=1&w=2
http://marc.theaimsgroup.com/?w=2&r=1&s=%5Bpatch%5D+evms+core&q=t

----------------------------------------------------------------------------

4) New kernel configuration system (Roman Zippel)

Announcement:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6898.html

Code:
http://www.xs4all.nl/~zippel/lc/

Linus has actually looked fairly favorably on this one so far:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/3250.html

And an AOL for it:

http://lists.insecure.org/lists/linux-kernel/2002/Oct/8255.html

----------------------------------------------------------------------------

5) Linux Trace Toolkit (LTT) (Karim Yaghmour)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7016.html

Patch:
http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.44-vanilla-021022-2.2.bz2

User tools:
http://opersys.com/ftp/pub/LTT/TraceToolkit-0.9.6pre2.tgz

----------------------------------------------------------------------------

6) Kernel Probes (IBM, contact: Vamsi Krishna S)

Kprobes announcement:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528410215211&w=2

Base Kprobes Patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528425615302&w=2

KProbes->DProbes patches:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454215523&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528454015520&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103528485415813&w=2

Official IBM download site for most recent versions (gzipped
tarballs):
http://www-124.ibm.com/linux/patches/?project_id=141

See also the DProbes Home Page:
http://oss.software.ibm.com/developerworks/opensource/linux/projects/dprobes

A good explanation of the difference between kprobes, dprobes,
and kernel hooks is here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103532874900445&w=2

And a clarification: just kprobes is being submitted for
2.5.45, not the whole of dprobes:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103536827928012&w=2

----------------------------------------------------------------------------

7) High resolution timers (George Anzinger, etc.)

Home page:
http://high-res-timers.sourceforge.net/

Sourceforge download page for this patch:
http://sourceforge.net/projects/high-res-timers

Descriptions of each patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103557676007653&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103557677207693&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103558349714128&w=2

Linus had concerns with this one (possibly resolved?):
http://lists.insecure.org/lists/linux-kernel/2002/Oct/3463.html

----------------------------------------------------------------------------

8) Posix clocks and timers (non-highres) (George Anzinger or Jim Houston)

There are two different posix timer patches. The one from George Anzinger
is here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103553654329827&w=2

An alternate version from Jim Houston is here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103549000027416&w=2

----------------------------------------------------------------------------

9) Linux Kernel Crash Dumps (Matt Robinson, LKCD team)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103536576625905&w=2

Code:
http://lkcd.sourceforge.net/download/latest/

----------------------------------------------------------------------------

10) Rewrite of the console layer (James Simmons)

Home page:
http://linuxconsole.sourceforge.net/

Patch (Unknown version, but home page only has random CVS du jour link.):
http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz

Bitkeeper tree:
http://linuxconsole.bkbits.net


----------------------------------------------------------------------------

11) Kexec, luanch new linux kernel from Linux (Eric W. Biederman)

Announcement with links:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6584.html

And this thread is just too brazen not to include:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7952.html

----------------------------------------------------------------------------

12) USAGI IPv6 (Yoshifujy Hideyaki)

README:
ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/README.IPSEC

Patch:
ftp://ftp.linux-ipv6.org/pub/usagi/patch/ipsec/ipsec-2.5.43-ALL-03.patch.gz

----------------------------------------------------------------------------

13) MMU-less processor support (Greg Ungerer)

Announcement with lots of links:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/7027.html

----------------------------------------------------------------------------

14) sys_epoll (I.E. /dev/poll) (Davide Libenzi)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103542994232004&w=2

homepage:
http://www.xmailserver.org/linux-patches/nio-improve.html

Auto-updating URL to most recent patch:
http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff

Linus participated repeatedly in a thread on this one too, expressing
concerns which (hopefully) have been addressed. See:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6428.html

----------------------------------------------------------------------------

15) CD Recording/sgio patches (Jens Axboe)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/8060.html

Patch:
http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.5/2.5.44/sgio-14b.diff.bz2

----------------------------------------------------------------------------

16) In-kernel module loader (Rusty Russell.)

Announce:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/6214.html

Patch:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/module-x86-18-10-2002.2.5.43.diff.gz

----------------------------------------------------------------------------

17) Unified Boot/Module parameter support (Rusty Russell)

Note: depends on in-kernel module loader.

Huge disorganized heap 'o patches with no explanation:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/

----------------------------------------------------------------------------

18) Hotplug CPU Removal (Rusty Russell)

Even bigger, more disorganized Heap 'o patches:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Hotplug/

----------------------------------------------------------------------------

19) Unlimited groups patch (Tim Hockin.)

Announce:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761319825&w=2

Patch set:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524717119443&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761819834&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761619831&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103524761519829&w=2

----------------------------------------------------------------------------

20) Initramfs (Al Viro)

Way back when, Al said:
http://www.cs.helsinki.fi/linux/linux-kernel/2001-30/0110.html

I THINK this is the most recent patch:
ftp://ftp.math.psu.edu/pub/viro/N0-initramfs-C40

And Linus recently made happy noises about the idea:
http://lists.insecure.org/lists/linux-kernel/2002/Oct/1110.html

----------------------------------------------------------------------------

21) Kernel Hooks (IBM contact: Vamsi Krishna S.)

Website:
http://www-124.ibm.com/linux/projects/kernelhooks/

Download site:
http://www-124.ibm.com/linux/patches/?patch_id=595

Posted patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103364774926440&w=2

----------------------------------------------------------------------------

22) NMI request/release interface (Corey Minyard)

He says:
> Add a request/release mechanism to the kernel (x86 only for now) for NMIs.
...
>I have modified the nmi watchdog to use this interface, and it
>seems to work ok. Keith Owens is copied to see if he would be
>interested in converting kdb to use this, if it gets put into the kernel.

There was a lot of back and forth, resulting in the latest patch (version 8):
http://marc.theaimsgroup.com/?l=linux-kernel&m=103555247911211&w=2

----------------------------------------------------------------------------

23) Digital Video Broadcasting Layer (LinuxTV team)

Home page:
http://www.linuxtv.org:81/dvb/

Download:
http://www.linuxtv.org:81/download/dvb/

----------------------------------------------------------------------------

24) DriverFS Topology (Matthew Dobson)

Announcement:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103523702710396&w=2

Patches:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540707113401&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757613962&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540758013984&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757513957&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=103540757813966&w=2

----------------------------------------------------------------------------

25) Advanced TCA Disk Hotswap (Steven Dake)

Announcement of most recent patch, with links:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103558466315221&w=2

Steven's comments:

> This is a generic feature that provides good hotswap support for SCSI
> and FibreChannel disk devices. The entire SCSI layer has been properly
> analyzed to provide correct locking and a complete RAMFS filesystem is
> available to control the kernel disk hotswap operations.
>
> Both Alan Cox and Greg KH have looked at the patch for 2.4 and suggested
> if I ported to 2.5 and made some changes (as I have in the latest port)
> this feature would be a good candidate for the 2.5 kernel.
>
> A thread discussing Advanced TCA hotswap (of which this partch is one
> part of) can be found at:
> http://marc.theaimsgroup.com/?t=103462115700001&r=1&w=2

----------------------------------------------------------------------------

26) Mobile IPV6 (contact: Antti Tuominen)

Antti Tuominen says:

> We've been working on an implementation of Mobility Support in IPv6
> specification, called MIPL Mobile IPv6 for Linux. We are now trying
> to get it included in the kernel. Mobile IPv6 is an integral part of
> the IPv6 protocol.
>
> We've had discussion with Alexey Kuznetsov and Dave Miller. Dave says
> he does not know enough about IPv6, and trusts Alexey on this one.
> Alexey requested the patch to be split, which we did, and we are
> currently waiting for additional comments whether he is going to
> recommend inclusion.
>
> This project has nothing to do with USAGI IPv6 Project (though they do
> merge our code from time to time). However, we would benefit from
> having IPSec support for IPv6 in the kernel.
>
> MIPL Mobile IPv6 for Linux Project site:
> http://www.mipl.mediapoli.com/
>
> Patches:
> http://www.mipl.mediapoli.com/patches/

----------------------------------------------------------------------------

27) NUMA Scheduler Upgrade

Erich Focht and Michael Hohnbaum have two different NUMA scheduler
patches.

Michael has a stripped down NUMA scheduler, which he says was created
because the full Node Affine NUMA Scheduler didn't look like it would
be ready for 2.5. He talks about it here, with links to patches:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103548635122591&w=2

Meanwhile, Erich Focht says the full Node Affine Numa Scheduler is
indeed ready for 2.5, and already in use at customer sites. He makes
his case here, with links to patches, home page, LWN review, etc:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103549657202782&w=2

Here's Erich's scheduler's home page:
http://home.arcor.de/efocht/sched/


======================== Unresolved issues: =========================

1) hyperthread-aware scheduler
2) connection tracking optimizations.

No URLs to patch. Anybody want to come out in favor of these
with an announcement and pointer to a version being suggested
for inclusion?

3) IPSEC (David Miller, Alexy)
4) New CryptoAPI (James Morris)

David S. Miller said:

> No URLs, being coded as I type this :-)
>
> Some of the ipv4 infrastructure is in 2.5.44

Note, this may conflict with Yoshifuji Hideyaki's ipv6 ipsec stuff. If not,
I'd like to collate or clarify the entries.) USAGI ipv6 is in the first
section and this isn't because I have a URL to an existing patch to
USAGI, and don't for this. I have no idea how much overlap there is
between these projects, and whether they're considered parts of the
same project or submitted individually...

6) 32bit dev_t

Alan Cox said:

> The big one missing is 32bit dev_t. Thats the killer item we have left.

But did not provide a URL to a patch. Presumably, it's in his tree and
is capable of being extracted out of it, so I guess it's already in
good hands? (I dunno, ask him.)

Then Dan Kegel pointed out:

> One possible page to quote for 32 bit dev_t:
> http://lwn.net/Articles/11583/

--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?


2002-10-26 02:29:04

by James Morris

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6) (crypto api info)

On Fri, 25 Oct 2002, Rob Landley wrote:

> 4) New CryptoAPI (James Morris)

This is currently available for review at:

http://samba.org/~jamesm/crypto/

The patch there tracks Dave Miller's tree.

There's not much documentation yet, so here's a brief rundown:

The API takes page vectors (scatterlists) as arguments, and works directly
on pages. This is required to support a clean IPsec implementation (which
is being worked on by Dave Miller and Alexey Kuznetzov), and is also a
requirement from Linus. In some cases (e.g. ECB mode ciphers), this will
allow for pages to be encrypted in place with no copying.

At the lowest level are algorithms, which register dynamically with the
API.

'Transforms' are user-instantiated objects, which maintain state, handle all
of the implementation logic (e.g. manipulating page vectors), provide an
abstraction to the underlying algorithms, and handle common logical
operations (e.g. cipher modes, HMAC for digests). However, at the user
level they are very simple.

Conceptually, the API layering looks like this:

[transform api] (user interface)
[transform ops] (per-type logic glue e.g. cipher.c, digest.c)
[algorithm api] (for registering algorithms)

The idea is to make the user interface and algorithm registration API
very simple, while hiding the core logic from both. Many good ideas
from existing APIs such as Cryptoapi and Nettle have been adapted for this.

The API currently supports three types of transforms: Ciphers, Digests and
Compressors. The compression algorithms especially seem to be performing
very well so far.

An asynchronous scheduling interface is in planning but not yet
implemented, as we need to further analyze the requirements of all of
the possible hardware scenarios (e.g. IPsec NIC offload).

Here's an example of how to use the API:

#include <linux/crypto.h>

struct scatterlist sg[2];
char result[128];
struct crypto_tfm *tfm;

tfm = crypto_alloc_tfm(CRYPTO_ALG_MD5);
if (tfm == NULL)
fail();

/* ... set up the scatterlists ... */

crypto_digest_init(tfm);
crypto_digest_update(tfm, &sg, 2);
crypto_digest_final(tfm, result);

crypto_free_tfm(tfm);


Many real examples are available in the regression test module (tcrypt.c).


- James
--
James Morris
<[email protected]>


2002-10-26 02:36:14

by Hans Reiser

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

Rob Landley wrote:

>Reiser4 is probably in this category as well, since Reiser3 went into
>the 2.4 stable series and Reiser4 claims to be a seperate filesystem
>(like EXT2 and EXT3). Add in the fact that Hans Reiser still hasn't
>produced a patch yet, and the decision's pretty easy. (If you disagree,
>yell out now...)
>
We will probably release a "very beta" not intended for inclusion on the
27th, and ship a patch for inclusion on Halloween before midnight in
some time zone.

In the version we will ship on Sunday, reads are only 50% faster than
ext2/3 for reading the linux kernel source tree. We have an old kernel
in which reads are 105% faster when reading one copy of the linux kernel
tree. The delay until Halloween is so that we can figure out why reads
were faster in the old kernel, and hopefully make them 105% faster in
the newest version of the code.

Not sure you want to ship a 3.0 without it. It is 50-150% faster than
V3, which makes it a significant competitive advantage. I forget how
much faster writes are, something well over 100% faster, and the newest
version is faster yet.

How do I put it. I'm the last straggler coming back from the hunt, and
I've got what looks like it might be a wooly mammoth on my shoulders,
and my tribesmen are complaining that I'm late for dinner. How about
helping me by cutting down a tree for the roasting spit instead? Think
thoughts of the poor hungry Microsoft tribe eating NTFS.

Think thoughts of Microsoft suits going into some corporate board room
explaining that Windows is worth paying for because of the value add,
and some guy in sandals in the back suggests that the company could
replace all its 15k rpm SCSI hard drives with 5400rpm IDE drives, and
Linux would still be much faster than using NTFS.

Think thoughts of Microsoft's OFS catching up to ext2 at long last
(surely it will, with all the money they have spent to hire people), and
then discovering Windows still offers negative value add filesystem
performance-wise.

Oh, and it has features too, not just performance....

Hans

2002-10-26 07:47:12

by Andi Kleen

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

Rob Landley <[email protected]> writes:

I have some patches to offer better than second resolution for stat.
That is needed for parallel make on MP systems, because otherwise it
cannot detect changes that need less than a second to execute. With CPUs
being as fast as they are and getting even faster currently it is becomming
a bigger problem. You don't hit it that easily with gcc because it's
getting faster slower than cpus are getting faster so it's usually slower
than a second, but some people use make rules with other compilers or other
commands.

I see it in the same category as "necessary changes" similar to 32bit dev_t.

Linux already has several filesystems in tree that support ns or better than s
resolution in their underlying formats (XFS,JFS,NFSv3,VFAT)

Patches available on request or older versions from
ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec*
They don't actually add ns resolution, but jiffies resolution, which
is 1ms on 2.5 and should be good enough for now. It reuses reserved fields
in struct stat and doesn't need any user interface changes.

It requires editing of a lot of file systems in a straight forward way,
so should be better done before the stable series starts.

There are some minor compatbility issues with fs that only support
second timestamps like ext2/ext3, see nsec.notes in the ftp directory
or past threads on that on the list.

-Andi

2002-10-26 08:09:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Oct 26, 2002 09:53 +0200, Andi Kleen wrote:
> Patches available on request or older versions from
> ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec*
> They don't actually add ns resolution, but jiffies resolution, which
> is 1ms on 2.5 and should be good enough for now. It reuses reserved fields
> in struct stat and doesn't need any user interface changes.
>
> It requires editing of a lot of file systems in a straight forward way,
> so should be better done before the stable series starts.
>
> There are some minor compatbility issues with fs that only support
> second timestamps like ext2/ext3, see nsec.notes in the ftp directory
> or past threads on that on the list.

Just as an FYI - this is "in the pipes" for ext2/ext3 also, which the
(very basic) support for variable-sized inodes that Ted has already
submitted is the groundwork for (among other things). We will then have
space to add usec timestamps to ext2/ext3 inodes for people who choose to
have larger inodes (new filesystems only, to start with), and when we get
more efficient EA storage we will also be able to store the "large inode
fields" into EAs for existing filesystems.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-10-26 15:05:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

Rob Landley <[email protected]> writes:

> I'm also looking for other things that can similarly be removed from
> this list and pushed for integration during the next stable series.
> Criteria for this: no API changes, and no impact on people who don't
> actually try to use the thing.
>
> If people familiar with these features can suggest stuff that's
> deferrable, please let me know. I've been trying very hard not to make
> judgement calls on these patches (not my job), but I'm certainly open
> to advice.

> 11) Kexec, luanch new linux kernel from Linux (Eric W. Biederman)
>
> Announcement with links:
> http://lists.insecure.org/lists/linux-kernel/2002/Oct/6584.html
>
> And this thread is just too brazen not to include:
> http://lists.insecure.org/lists/linux-kernel/2002/Oct/7952.html

sys_kexec introduces no new APIs to the rest of the kernel and is
fairly self contained. Making it non intrusive enough that by that
criterion it may be deferred.

Currently the device driver support is lacking. sys_kexec calls the
reboot notifier call chain, and device_shutdown to shut devices down
cleanly. Due to a bug fix/cleanup that went into of 2.5.44
device_shutdown is neutered, and does nothing.

So far with all of the review sys_kexec has gotten not one bug has
been found in it's core. However actually using it is problematic.
In the 2.5.44 kexec to 2.5.44 case quite a few devices cannot
reinitialize when the new kernel comes up.

The proof of concept of what sys_kexec can do is etherboot.
http://www.etherboot.org. Etherboot contains real hardware drivers often
adapted from the linux kernel drivers. It is quite possible to boot
DOS from etherboot, and I can quite definitely run all of setup.S.
However when I attempt this from sys_kexec in a number of significant
cases I cannot even reliably execute the BIOS calls in setup.S after
the kernel has run. Though most of them can reliably be executed.

So the remaining work with sys_kexec is to track down why it is less
reliable than etherboot. A few cases like being loaded from loadlin
and the BIOS interrupt table has hooks to code that is not longer
running is quite explainable. The rest of the failures need more
investigation.

All of the hardware driver stabilization work for sys_kexec can be
postponed until after the feature freeze. And on that note I plan
on removing the few driver fixes in my current patch and posting a
stripped down version later today.

Having sys_kexec in the kernel makes what I am doing more explainable,
and makes people think a little differently about device_shutdown, and
the reboot notifier call chain. And with sys_kexec in the kernel
people mutating the internal kernel interfaces will be encouraged to
take sys_kexec into account.

My point is that while the sys_kexec patch is not especially intrusive
in and of itself, I am fairly certain usage of it can be stabilized
easier in the kernel than outside of it.

Unless something comes up the plan for today is to incorporate the
very minor changes that have been suggested (Makefile, Config.in,
Config.help type), to split out the driver fixes I currently have
into separate patches, and post just a bare bones sys_kexec patch,
ready for inclusion in 2.5.

After the feature freeze I have on the todo list to look at porting
sys_kexec to the Itanium and Hammer. As well as building debugging
tools, and in general debugging sys_kexec so it is generally useful.

Eric

2002-10-27 06:55:09

by Andi Kleen

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

[cc'ed back to linux-kernel in case other people are interested in
the rationale again]

On Sat, Oct 26, 2002 at 03:09:06PM -0400, Andrew Pimlott wrote:
> On Sat, Oct 26, 2002 at 09:53:26AM +0200, Andi Kleen wrote:
> > I have some patches to offer better than second resolution for stat.
> > That is needed for parallel make on MP systems, because otherwise it
> > cannot detect changes that need less than a second to execute.
>
> Would you mind spelling out the problem case? It's ususally not a
> big deal, because when a target and dependency have the same
> timestamp, make considers the target to be newer. Is there a
> subtlety with parallel make that I am not seeing? (And does it
> really depend on MP?)

I assume you mean 'older', not 'newer'?

Any default action is wrong in some case when an rule can take less
than a second, there is no replacement for an accurate time stamp.
Either the rule gets rebuilt too often (=annoying for the user because
it's slow) or not often enough (= broken result).

Parallel make makes it worse because parallel make processes need
the timestamp to comunicate - make cannot know from internal data
if it has already run a rule or not. On MP systems the parallelism
is higher, making the problem show more often.


> > There are some minor compatbility issues with fs that only support
> > second timestamps like ext2/ext3, see nsec.notes in the ftp directory
> > or past threads on that on the list.
>
> I really feel strongly that you should not export resolution finer
> that what the filesystem can store. There is too much risk of
> breakage (especially given the late date of submission), and if (as
> you said) all common filesystems will be able to store sub-second
> timestamps soon, this shouldn't be a significant drawback. If this
> requires a new hook into the filesystem, so be it.

You have to export in some unit and it is convenient to use the most
finegrained available (ns). This matches what other Unixes like
Solaris do too. The program can always chose to ignore the ns
(which will most do at least initially) part or even round more.

What happens currently in my patch is that the inode in memory stores jiffies
resolution. As long as you don't run out of inode cache and need to
flush/reload an inode you always have the best resolution.

When an inode is flushed on an old fs with only second resolution the
subsecond part is truncated. This has the drawback that an inode
timestamp can jump backwards on reload as seen by user space.
Another way would be to round on flush, but that also has some problems :-
for example you can get timestamps which are ahead of the current
wall clock. Neither of them is ideal. Rounding properly requires
hooks.

In my current patchkit I just chose to truncate because that was the
easiest and the other more complicated solutions didn't offer any
compeling advantage. One can hope that nanosecond aware applications
know how to deal with these problems. One possibility would be to
do the same as Solaris here so that ns aware apps stay portable,
but I don't have access to a Solaris 8 system and cannot test what
they do. It's a kind of arbitary choice. Also I don't see it as a big
problem.

Long term of course all the file systems should support fine grained properly
so that these issues do not occur anymore. I hope no new file systems will
make the same mistake as to limiting the timestamp to a second resolution.

-Andi

2002-10-27 15:29:06

by Andrew Pimlott

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Sun, Oct 27, 2002 at 08:01:25AM +0100, Andi Kleen wrote:
> On Sat, Oct 26, 2002 at 03:09:06PM -0400, Andrew Pimlott wrote:
> > Would you mind spelling out the problem case? It's ususally not a
> > big deal, because when a target and dependency have the same
> > timestamp, make considers the target to be newer.
>
> I assume you mean 'older', not 'newer'?

No (but maybe I phrased it badly):

% cat Makefile
foo: bar
echo did it
% touch foo bar
% ls --full-time foo bar
-rw-r--r-- 1 pimlott pimlott 0 Sun Oct 27 09:36:26 2002 bar
-rw-r--r-- 1 pimlott pimlott 0 Sun Oct 27 09:36:26 2002 foo
% make
make: `foo' is up to date.

Ie, foo is considered newer.

> Any default action is wrong in some case when an rule can take less
> than a second,

I'm sure there is a case where this is true, but my imagination and
googling failed to provide one. Even the messages to the GNU make
mailing list when Paul Eggert implemented nanosecond support didn't
include a specific rationale.

> there is no replacement for an accurate time stamp.

While I agree, I thought that a concrete example might help persuade
others. (I think I've even run into instances where second
resolution was a real problem, I just can't recall them.)

> > I really feel strongly that you should not export resolution finer
> > that what the filesystem can store. There is too much risk of
> > breakage (especially given the late date of submission), and if (as
> > you said) all common filesystems will be able to store sub-second
> > timestamps soon, this shouldn't be a significant drawback. If this
> > requires a new hook into the filesystem, so be it.
>
> You have to export in some unit and it is convenient to use the most
> finegrained available (ns). This matches what other Unixes like
> Solaris do too. The program can always chose to ignore the ns
> (which will most do at least initially) part or even round more.
>
> What happens currently in my patch is that the inode in memory stores jiffies
> resolution. As long as you don't run out of inode cache and need to
> flush/reload an inode you always have the best resolution.
>
> When an inode is flushed on an old fs with only second resolution the
> subsecond part is truncated. This has the drawback that an inode
> timestamp can jump backwards on reload as seen by user space.

Example problem case (assuming a fs that stores only seconds, and a
make that uses nanoseconds):

- I run the "save and build" command while editing foo.c at T = 0.1.
- foo.o is built at T = 0.2.
- I do some read-only operations on foo.c (eg, checkin), such that
foo.o gets flushed but foo.c stays in memory.
- I build again. foo.o is reloaded and has timestamp T = 0, and so
gets spuriously rebuilt.

> Another way would be to round on flush, but that also has some problems :-
> for example you can get timestamps which are ahead of the current
> wall clock.

Only if the flush is less than a second after the write, right?
How likely is that in Linux?

I tend to prefer the proposal to set the nanosecond field to 10^9-1.
At least my scenario above doesn't happen.

Andrew

2002-10-27 20:05:47

by Rob Landley

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Friday 25 October 2002 21:42, Hans Reiser wrote:
> Rob Landley wrote:
> >Reiser4 is probably in this category as well, since Reiser3 went into
> >the 2.4 stable series and Reiser4 claims to be a seperate filesystem
> >(like EXT2 and EXT3). Add in the fact that Hans Reiser still hasn't
> >produced a patch yet, and the decision's pretty easy. (If you disagree,
> >yell out now...)
>
> We will probably release a "very beta" not intended for inclusion on the
> 27th, and ship a patch for inclusion on Halloween before midnight in
> some time zone.
...
> Not sure you want to ship a 3.0 without it.

What I want is irrelevant, it's Linus's call. I'm just doing a little
volunteer secretarial work until his return.

I'll put out one more version of the list this evening, and possibly one more
on Monday if Linus hasn't replied to it by then and there are significant
objections. But that's it.

> How do I put it. I'm the last straggler coming back from the hunt, and
> I've got what looks like it might be a wooly mammoth on my shoulders,
> and my tribesmen are complaining that I'm late for dinner.

Well you are. :)

Nice mammoth, though.

> Think thoughts of the poor hungry Microsoft tribe...

What, as in "glad we're not them"? (I suppose thanksgiving is coming up. But
the holiday in question is Halloween. Celebration of sweet tooth, not
generic gluttony. We americans put enough effort into this area we need
specific holidays for the aspects of it, dont'cha know. :)

> Think thoughts of Microsoft...
...
> Think thoughts of Microsoft's...

And spoil my appetite for dinner?

I haven't personally been motivated by the potential for microsoft to make any
sort of technical achievement since... At least 1996, probably earlier. I'm
not thinking about Microsoft, I'm thinking about Linux.

> Oh, and it has features too, not just performance....

I'm all for it. I just can't put a patch on the list that doesn't exist yet.
(I suppose the entry could say "go bug Hans for a patch"... :)

I'll think of something...

> Hans

Rob

--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?

2002-10-28 00:21:04

by Hans Reiser

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)



Rob Landley wrote:

>
>
>
>>Oh, and it has features too, not just performance....
>>
>>
>
>I'm all for it. I just can't put a patch on the list that doesn't exist yet.
>(I suppose the entry could say "go bug Hans for a patch"... :)
>
>I'll think of something...
>
>
>
>>Hans
>>
>>
>
>Rob
>
>
>

Thanks. The guys will hopefully put a very beta patch (for developers
only) on our download page sometime today (I am getting on an airplane
to the USA, and leaving this to them), and a release appropriate for
acceptance as an experimental filesystem will ship Halloween day.

We want to make one last disk format change before it goes into the
kernel.;-) Yes, I know, plugins make disk format changes easy, but I
still don't want there to be plugins that were only used for one week if
I can avoid it, and our holding a final comprehensive disk format review
on Friday found something better changed.

Read performance got worse, and write performance got better. We'll
have updated benchmarks on our website by the end of the day.

Hans


2002-10-28 03:38:30

by Rob Landley

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Sunday 27 October 2002 09:20, Andrew Pimlott wrote:

> Example problem case (assuming a fs that stores only seconds, and a
> make that uses nanoseconds):
>
> - I run the "save and build" command while editing foo.c at T = 0.1.
> - foo.o is built at T = 0.2.
> - I do some read-only operations on foo.c (eg, checkin), such that
> foo.o gets flushed but foo.c stays in memory.
> - I build again. foo.o is reloaded and has timestamp T = 0, and so
> gets spuriously rebuilt.

If your system, and your disks, are so fast that they can not only finish the
build in under a second but can also flush the cache and reload it from disk
in under a second, then:

A) the spurious rebuild is still a tiny fraction of a second.
B) You're seeing a penalty for using a filesystem that's too old for your
setup. This is a configuration problem in userspace.
C) How would having ALL times rounded to a second be an improvement?

Rob

--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?

2002-10-28 04:02:41

by Andi Kleen

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

> A) the spurious rebuild is still a tiny fraction of a second.

It can trigger other rebuilds which can take much longer.

-Andi

2002-10-28 04:10:37

by Andrew Pimlott

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Sun, Oct 27, 2002 at 12:57:46PM -0500, Rob Landley wrote:
> On Sunday 27 October 2002 09:20, Andrew Pimlott wrote:
>
> > Example problem case (assuming a fs that stores only seconds, and a
> > make that uses nanoseconds):
> >
> > - I run the "save and build" command while editing foo.c at T = 0.1.
> > - foo.o is built at T = 0.2.
> > - I do some read-only operations on foo.c (eg, checkin), such that
> > foo.o gets flushed but foo.c stays in memory.
> > - I build again. foo.o is reloaded and has timestamp T = 0, and so
> > gets spuriously rebuilt.
>
> If your system, and your disks, are so fast that they can not only finish the
> build in under a second but can also flush the cache and reload it from disk
> in under a second

That is not required. The requirement is that, when the last step
happens (which can be any time in the future), (the inode of) foo.o
has been flushed, and foo.c hasn't. Step 3 argues that this is
plausible.

> C) How would having ALL times rounded to a second be an improvement?

foo.c and foo.o would both have timestamps of 0. make considers
the target foo.o newer in this case, so will not rebuild it.

Andrew

2002-10-28 04:23:46

by Andi Kleen

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Sun, Oct 27, 2002 at 10:20:38AM -0500, Andrew Pimlott wrote:
>
> I'm sure there is a case where this is true, but my imagination and
> googling failed to provide one. Even the messages to the GNU make

foo: bar
action1 <something that takes less than a second>

frob: foo
action2 <something that takes a long time>


action1 is executed. foo and bar have the same time stamp. action2
is executed.


make runs again. Default rule sees foo.mtime == bar.mtime and starts
action1 and action2 again. action2 takes a long time. But it's unnecessary,
because bar has not really changed.


> Example problem case (assuming a fs that stores only seconds, and a
> make that uses nanoseconds):
>
> - I run the "save and build" command while editing foo.c at T = 0.1.
> - foo.o is built at T = 0.2.
> - I do some read-only operations on foo.c (eg, checkin), such that
> foo.o gets flushed but foo.c stays in memory.
> - I build again. foo.o is reloaded and has timestamp T = 0, and so
> gets spuriously rebuilt.

Yes, when you file system has only second resolution then you can get
spurious rebuilds if your inodes get flushed. There is no way my patch
can fix that. While some of the cases may be avoided by better
rounding, it would be better to handle such heuristics in user space
if you really wanted to be clever. Or just make sure you have enough
ram.

The point of my patchkit is to allow the file systems
who support better resolution to handle it properly. Other filesystems
are not worse than before when they flush inodes (and better off when
they keep everything in ram for your build because then they will enjoy
full time resolution)

My notes about possible problems with older fs were really not about
make, but about other programs that could see inconsistencies

It's a fairly obscure case because the inode has to be flushed
and reloaded in less than a second (so not likely to trigger
often in practice)


>
> > Another way would be to round on flush, but that also has some problems :-
> > for example you can get timestamps which are ahead of the current
> > wall clock.
>
> Only if the flush is less than a second after the write, right?
> How likely is that in Linux?

Not very, but could happen in extreme cases.


>
> I tend to prefer the proposal to set the nanosecond field to 10^9-1.
> At least my scenario above doesn't happen.

If you really wanted that I would recommend to change make.
When all nanosecond parts are 0 it is reasonable for make to assume that
the fs doesn't support finegrained resolution. But I'm not sure it's
worth it.

-Andi

2002-10-28 04:25:49

by Andi Kleen

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

> > C) How would having ALL times rounded to a second be an improvement?
>
> foo.c and foo.o would both have timestamps of 0. make considers
> the target foo.o newer in this case, so will not rebuild it.

But other stuff could break because it sees mtime > gettimeofday
(strictly make could trigger a "your clock is warping" warning)

It's a tradeoff, all has some disadvantages. The simple truncation
wins here because it's the simplest (KISS)

-Andi

2002-10-28 06:07:10

by Andrew Pimlott

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Mon, Oct 28, 2002 at 05:30:04AM +0100, Andi Kleen wrote:
> On Sun, Oct 27, 2002 at 10:20:38AM -0500, Andrew Pimlott wrote:
> >
> > I'm sure there is a case where this is true, but my imagination and
> > googling failed to provide one. Even the messages to the GNU make
>
> foo: bar
> action1 <something that takes less than a second>
>
> frob: foo
> action2 <something that takes a long time>
>
>
> action1 is executed. foo and bar have the same time stamp. action2
> is executed.

Try it:

% cat Makefile
foo: bar
touch foo
frob: foo
sleep 10
touch frob
% rm foo bar frob
% touch bar
% make frob
touch foo
sleep 10
touch frob
% make frob
make: `frob' is up to date.

No problem with this case.

> make runs again. Default rule sees foo.mtime == bar.mtime and starts
> action1 and action2 again.

make is not that broken. (Well, according to one post I googled, it
was in 1970, but it was noticed and fixed, and the fixed behavior
has long been standardized.)

> > Example problem case (assuming a fs that stores only seconds, and a
> > make that uses nanoseconds):
> >
> > - I run the "save and build" command while editing foo.c at T = 0.1.
> > - foo.o is built at T = 0.2.
> > - I do some read-only operations on foo.c (eg, checkin), such that
> > foo.o gets flushed but foo.c stays in memory.
> > - I build again. foo.o is reloaded and has timestamp T = 0, and so
> > gets spuriously rebuilt.
>
> Yes, when you file system has only second resolution then you can get
> spurious rebuilds if your inodes get flushed. There is no way my patch
> can fix that.

I grant that second-resolution timestamps are broken. But you seem
to misunderstand how make works--the current problem is not that
severe. Whereas your change introduces a different problem that (in
my estimation) is more likely to appear, and will cause mare pain.

I'm saying you're replacing a problem (bad graularity) that

- is well known
- is intuitive
- doesn't cause severe problems in practice (or at least, nobody
has provided an example)

with one (timestamps jumping at unpredictable times) that

- is obscure
- requires knowledge of kernel internals to understand
- will bite people (I claim, and have provided a concrete
example)
- will be wickedly hard to reproduce and diagnose

> The point of my patchkit is to allow the file systems
> who support better resolution to handle it properly.

If that is the point, why not leave the behavior unchanged for other
filesystems? (Other than that it would be a bit more work.)
Doesn't it make sense, on general principles, to be conservative?

> It's a fairly obscure case because the inode has to be flushed
> and reloaded in less than a second (so not likely to trigger
> often in practice)

If that were true, I would agree that it's probably not an issue in
practice. But unless I misunderstand, in the example I gave, the
flush and reload of foo.o can happen any time between the first and
second builds, which could be arbitrarily far apart. So I believe
it's a fairly plausible scenario.

Anyway, this isn't the biggest deal in the world. Maybe I'm wrong
and nobody will ever notice. But it doesn't seem like a good risk
to take.

Andrew

2002-11-08 21:30:57

by Pavel Machek

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

Hi!

> The point of my patchkit is to allow the file systems
> who support better resolution to handle it properly. Other filesystems
> are not worse than before when they flush inodes (and better off when
> they keep everything in ram for your build because then they will enjoy
> full time resolution)

What about always rounding down even when inode is
in memory? That is both simple and consistent.

> If you really wanted that I would recommend to change make.
> When all nanosecond parts are 0 it is reasonable for make to assume that
> the fs doesn't support finegrained resolution. But I'm not sure it's
> worth it.

Thats really ugly heuristics. What about filling
nanosecond part with ~0 when unavailable?

Psvel

2002-11-09 01:10:13

by Andrew Pimlott

[permalink] [raw]
Subject: Re: The return of the return of crunch time (2.5 merge candidate list 1.6)

On Tue, Jan 15, 2002 at 12:44:28PM -0500, Pavel Machek wrote:
> > The point of my patchkit is to allow the file systems
> > who support better resolution to handle it properly. Other filesystems
> > are not worse than before when they flush inodes (and better off when
> > they keep everything in ram for your build because then they will enjoy
> > full time resolution)
>
> What about always rounding down even when inode is
> in memory? That is both simple and consistent.

I assume you mean, for filesystems that don't support high
resolution. That is what I think should be done as well. People
have gotten along with second resolution all these years, so it's no
great tragedy to make them continue; plus, it seems like the common
filesystems will support high resolution soon anyway.

Paul Eggert listed lots of programs that could be broken if
timestamps jump around. They could all implement heuristic
work-arounds, but that would be 1) a miracle and 2) a waste of
effort.

The only issue (as far as I can tell) is knowing when to round.
Hard-coding a list of filesystems seems reasonable (though even that
could be wrong if I load a newer filesystem module). Ideally,
though, you would add a simple hook into the filesystem that asks it
to round the timestamp for you. This would also accommodate
filesystems that don't store seconds or nanoseconds, but some other
resolution.

Andrew