Message-ID: <48AEBBD6.6010707@novell.com>
Date: Fri, 22 Aug 2008 09:15:02 -0400
From: Gregory Haskins <ghaskins@novell.com>
User-Agent: Thunderbird 2.0.0.16 (X11/20080720)
MIME-Version: 1.0
To: Esben Nielsen <nielsen.esben@googlemail.com>
CC: mingo@elte.hu, paulmck@linux.vnet.ibm.com, peterz@infradead.org,
       tglx@linutronix.de, rostedt@goodmis.org, linux-kernel@vger.kernel.org,
       linux-rt-users@vger.kernel.org, gregory.haskins@gmail.com,
       David.Holmes@sun.com, jkacur@gmail.com
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface
References: <20080815202408.668.23736.stgit@dev.haskins.net>	 <20080815202823.668.26199.stgit@dev.haskins.net> <59929f7a0808220555r5a0b972cl5db047f74d7cabec@mail.gmail.com>
In-Reply-To: <59929f7a0808220555r5a0b972cl5db047f74d7cabec@mail.gmail.com>
OpenPGP: id=D8195319
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature";
 boundary="------------enig158CA6191FCD10A12A9890FD"
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 12584
Lines: 333

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig158CA6191FCD10A12A9890FD
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Esben,
  Thank you for the review.  Comments inline.

Esben Nielsen wrote:
> Disclaimer: I am no longer actively involved and I must admit I might
> have lost out on much of
> what have been going on since I contributed to the PI system 2 years
> ago. But I allow myself to comment
> anyway.
>
> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <ghaskins@novell.com>=
 wrote:
>  =20
>> The kernel currently addresses priority-inversion through priority-
>> inheritence.  However, all of the priority-inheritence logic is
>> integrated into the Real-Time Mutex infrastructure.  This causes a few=

>> problems:
>>
>>  1) This tightly coupled relationship makes it difficult to extend to
>>    other areas of the kernel (for instance, pi-aware wait-queues may
>>    be desirable).
>>  2) Enhancing the rtmutex infrastructure becomes challenging because
>>    there is no seperation between the locking code, and the pi-code.
>>
>> This patch aims to rectify these shortcomings by designing a stand-alo=
ne
>> pi framework which can then be used to replace the rtmutex-specific
>> version.  The goal of this framework is to provide similar functionali=
ty
>> to the existing subsystem, but with sole focus on PI and the
>> relationships between objects that can boost priority, and the objects=

>> that get boosted.
>>    =20
>
> This is really a good idea. When I had time (2 years ago) to actively
> work on these problem
> I also came to the conclusion that PI should be more general than just
> the rtmutex. Preemptive RCU
> was the example which drove it.
>
> But I do disagree that general objects should get boosted: The end
> targets are always tasks. The objects might
> be boosted as intermediate steps, but priority end the only applies to =
tasks.
>  =20
Actually I fully agree with you here.  Its probably just poor wording on
my part, but this is exactly what happens.  We may "boost" arbitrary
objects on the way to boosting a task...but the intermediate objects are
just there to help find our way to the proper tasks.  Ultimately
everything ends up at the scheduler eventually ;)

> I also have a few comments to the actual design:
>
>  =20
>> ....
>> +
>> +Multiple sinks per Node:
>> +
>> +We allow multiple sinks to be associated with a node.  This is a slig=
ht departure from the previous implementation which had the notion of onl=
y a single sink (i.e. "task->pi_blocked_on").  The reason why we added th=
e ability to add more than one sink was not to change the default chainin=
g model (I.e. multiple boost targets), but rather to add a flexible notif=
ication mechanism that is peripheral to the chain, which are informally c=
alled "leaf sinks".
>> +
>> +Leaf-sinks are boostable objects that do not perpetuate a chain per s=
e.  Rather, they act as endpoints to a priority boosting.  Ultimately, ev=
ery chain ends with a leaf-sink, which presumably will act on the new pri=
ority information.  However, there may be any number of leaf-sinks along =
a chain as well.  Each one will act on its localized priority in its own =
implementation specific way.  For instance, a task_struct pi-leaf may cha=
nge the priority of the task and reschedule it if necessary.  Whereas an =
rwlock leaf-sink may boost a list of reader-owners.
>>    =20
>
> This is bad from a RT point of view: You have a hard time determininig
> the number of sinks per node. An rw-lock could have an arbitrary
> number of readers (is supposed to really). Therefore
> you have no chance of knowing how long the boost/deboost operation
> will take. And you also know for how long the boosted tasks stay
> boosted. If there can be an arbitrary number of
> such tasks you can no longer be deterministic.
>  =20

While you may have a valid concern about what rwlocks can do to
determinism, not that we already have PI enabled rwlocks before my
patch, so I am not increasing nor decreasing determinism in this
regard.  That being said, Steven Rostedt (author of the pi-rwlocks,
CC'd) has facilities to manage this (such as limiting the number of
readers to num_online_cpus) which this design would retain.  Long story
short, I do not believe I have made anything worse here, so this is a
different discussion if you are still concerned.


>  =20
>> ...
>> +
>> +#define MAX_PI_DEPENDENCIES 5
>>    =20
>
>
> WHAT??? There is a finite lock depth defined. I know we did that
> originally but it wasn't hardcoded (as far as I remember) and
> it was certainly not as low as 5.
>  =20

Note that this is simply in reference to how many direct sinks you can
link to a node, not how long the resulting chain can grow.  The chain
depth is actually completely unconstrained by the design.  I chose "5"
here because typically we need 1 sink for the next link in the chain,
and 1 sink for local notifications.  The other 3 are there for head-room
(we often hit 3-4 as we transition between nodes (add one node -> delete
another, etc).

You are not the first to comment about this, however, so it makes me
realize it is not very clear ;)  I will comment the code better.


> Remember: PI is used by the user space futeces as well!
>  =20

Yes, and on a slight tangent from your point, this incidentally is
actually a problem in the design such that I need to respin at least a
v5.  My current design uses recursion against the sink->update()
methods, which Peter Zijlstra pointed out would blow up with large
userpspace chains.  My next version will forgo the recursion in favor of
an iterative method more reminiscent of the original design.

>  =20
>> ....
>> +/*
>> + * _pi_node_update - update the chain
>> + *
>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale =
entries
>> + * that need to propagate up the chain.  This is a step-wise process =
where we
>> + * have to be careful about locking and preemption.  By trying MAX_PI=
_DEPs
>> + * times, we guarantee that this update routine is an effective barri=
er...
>> + * all modifications made prior to the call to this barrier will have=
 completed.
>> + *
>> + * Deadlock avoidance: This node may participate in a chain of nodes =
which
>> + * form a graph of arbitrary structure.  While the graph should techn=
ically
>> + * never close on itself barring any bugs, we still want to protect a=
gainst
>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockd=
ep
>> + * from detecting this potential).  To do this, we employ a dual-lock=
ing
>> + * scheme where we can carefully control the order.  That is: node->l=
ock
>> + * protects most of the node's internal state, but it will never be h=
eld
>> + * across a chain update.  sinkref->lock, on the other hand, can be h=
eld
>> + * across a boost/deboost, and also guarantees proper execution order=
=2E Also
>> + * note that no locks are held across an sink->update.
>> + */
>> +static int
>> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
>> +{
>> +       struct pi_node    *node =3D node_of(sink);
>> +       struct pi_sinkref *sinkref;
>> +       unsigned long      iflags;
>> +       int                count =3D 0;
>> +       int                i;
>> +       int                pprio;
>> +       struct updater     updaters[MAX_PI_DEPENDENCIES];
>> +
>> +       spin_lock_irqsave(&node->lock, iflags);
>> +
>> +       pprio =3D node->prio;
>> +
>> +       if (!plist_head_empty(&node->srcs))
>> +               node->prio =3D plist_first(&node->srcs)->prio;
>> +       else
>> +               node->prio =3D MAX_PRIO;
>> +
>> +       list_for_each_entry(sinkref, &node->sinks, list) {
>> +               /*
>> +                * If the priority is changing, or if this is a
>> +                * BOOST/DEBOOST, we consider this sink "stale"
>> +                */
>> +               if (pprio !=3D node->prio
>> +                   || sinkref->state !=3D pi_state_boosted) {
>> +                       struct updater *iter =3D &updaters[count++];
>>    =20
>
> What prevents count from overrun?
>  =20

The node->sinks list will never have more than MAX_PI_DEPs in it, by desi=
gn.

>  =20
>> +
>> +                       BUG_ON(!atomic_read(&sinkref->sink->refs));
>> +                       _pi_sink_get(sinkref);
>> +
>> +                       iter->update  =3D 1;
>> +                       iter->sinkref =3D sinkref;
>> +                       iter->sink     =3D sinkref->sink;
>> +               }
>> +       }
>> +
>> +       spin_unlock(&node->lock);
>> +
>> +       for (i =3D 0; i < count; ++i) {
>> +               struct updater    *iter =3D &updaters[i];
>> +               unsigned int       lflags =3D PI_FLAG_DEFER_UPDATE;
>> +               struct pi_sink    *sink;
>> +
>> +               sinkref =3D iter->sinkref;
>> +               sink =3D iter->sink;
>> +
>> +               spin_lock(&sinkref->lock);
>> +
>> +               switch (sinkref->state) {
>> +               case pi_state_boost:
>> +                       sinkref->state =3D pi_state_boosted;
>> +                       /* Fall through */
>> +               case pi_state_boosted:
>> +                       sink->ops->boost(sink, &sinkref->src, lflags);=

>> +                       break;
>> +               case pi_state_deboost:
>> +                       sink->ops->deboost(sink, &sinkref->src, lflags=
);
>> +                       sinkref->state =3D pi_state_free;
>> +
>> +                       /*
>> +                        * drop the ref that we took when the sinkref
>> +                        * was allocated.  We still hold a ref from
>> +                        * above.
>> +                        */
>> +                       _pi_sink_put_all(node, sinkref);
>> +                       break;
>> +               case pi_state_free:
>> +                       iter->update =3D 0;
>> +                       break;
>> +               default:
>> +                       panic("illegal sinkref type: %d", sinkref->sta=
te);
>> +               }
>> +
>> +               spin_unlock(&sinkref->lock);
>> +
>> +               /*
>> +                * We will drop the sinkref reference while still hold=
ing the
>> +                * preempt/irqs off so that the memory is returned syn=
chronously
>> +                * to the system.
>> +                */
>> +               _pi_sink_put_local(node, sinkref);
>> +       }
>> +
>> +       local_irq_restore(iflags);
>>    =20
>
> Yack! You keep interrupts off while doing the chain.

Actually, not quite.  The first pass (with interrupts off) simply sets
the new priority value at each local element (limited to 5, typically
1-2).  Short and sweet.  Its the "update" that happens next (with
interrupts/preemption enabled) that updates the chain.


>  I think my main
> contribution to the PI system 2 years ago was to do this preemptively.
> I.e. there was points in the loop where interrupts and preemption
> where turned on.
>  =20

I agree this is important, but I think you will see with further review
that this is in fact what I do too.

> Remember: It goes into user space again. An evil user could craft an
> application with a very long lock depth and keep higher priority real
> time tasks from running for an arbitrary long time (if
> no limit on the lock depth is set, which is bad because it will be too
> low in some cases.)
>
> But as I said I have had no time to watch what has actually been going
> on in the kernel for the last 2 years roughly. The said defects might
> have creeped in by other contributers already :-(
>
> Esben
>  =20

Esben,
  Your review and insight are very much appreciated.  I will be sure to
address the concerns mentioned above and CC you on the next release.

Thanks again,
-Greg


--------------enig158CA6191FCD10A12A9890FD
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iEYEARECAAYFAkiuu9YACgkQlOSOBdgZUxkfIwCggUxizj7J+HdH/M+/YHhysHmK
nYgAn3TLXu7EqiUhSadbsGfnK1Li/i7e
=EcR8
-----END PGP SIGNATURE-----

--------------enig158CA6191FCD10A12A9890FD--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/