LinuxLists.cc - Re: netfront for review

2007-05-02 03:37:34

Subject: Re: netfront for review

On Tue, 2007-05-01 at 17:08 -0700, Jeremy Fitzhardinge wrote:
> Add the Xen virtual network device driver.

(Herbert: there's a question for you: grep for Herbert)

OK, this is a remarkably non-trivial driver. If the v0.1 of the driver
had been in the kernel, I'm sure it would have been about 1/4 the size
and far easier to review 8(

Nonetheless, I have scoured the entire thing. It's not actually too
bad.

> struct netfront_cb {
> struct page *page;
> unsigned offset;
>};

Comment this please. This is used when assembling incoming skbs, and
may well correspond to skb_shinfo(skb)->frags[0].page, but not always it
seems.

> struct netfront_info {
...
> struct net_device *netdev;
...
> unsigned int evtchn, irq;

The netdev has an irq field already. Using it will probably help
ifconfig output, too.

> u8 mac[ETH_ALEN];

You simply copy this into netdev->dev_addr: put it on the stack instead?

> /*
> * {tx,rx}_skbs store outstanding skbuffs. The first entry in tx_skbs
> * is an index into a chain of free entries.
> */
> struct sk_buff *tx_skbs[NET_TX_RING_SIZE+1];

This screams "union" to me, since you're stuffing ints into pointers.

> #define TX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256)

This seems not to be used here, yet it's declared in the middle of the struct?

> grant_ref_t gref_tx_head;
> grant_ref_t grant_tx_ref[NET_TX_RING_SIZE + 1];
> grant_ref_t gref_rx_head;
> grant_ref_t grant_rx_ref[NET_RX_RING_SIZE];

There seems to be a correspondence between tx_skbs[], gref_tx_head and
grant_tx_ref[] - perhaps group them together with a nice comment?
Similarly the rx side.

> +/*
> + * Implement our own carrier flag: the network stack's version causes delays
> + * when the carrier is re-enabled (in particular, dev_activate() may not
> + * immediately be called, which can cause packet loss).
> + */
> +#define netfront_carrier_on(netif) ((netif)->carrier = 1)
> +#define netfront_carrier_off(netif) ((netif)->carrier = 0)
> +#define netfront_carrier_ok(netif) ((netif)->carrier)

Well, you only call netfront_carrier_on() from one place, so it's pretty
easy to do "netif_carrier_on(); dev_activate();" there.

I don't think this is critical though.

> + id = txrsp->id;
> + skb = np->tx_skbs[id];
> + if (unlikely(gnttab_query_foreign_access(
> + np->grant_tx_ref[id]) != 0)) {
> + printk(KERN_ALERT "xennet_tx_buf_gc: warning "
> + "-- grant still in use by backend "
> + "domain.\n");
> + BUG();
> + }

I shall resist the urge to point out the can of worms that Xen opened by
trying to allow domains to directly access each others' memory.

> if ( nr_flips != 0 ) {

Style.

> +static void xennet_make_frags(struct sk_buff *skb, struct net_device *dev,
> + struct netif_tx_request *tx)

This needs a big comment. AFAICT, entries in the ring cannot cross page
boundaries. And gso means that you have to put the first (partial) page
of the packet in the ring first, with the NETTXF_extra_info flag, then
the GSO stuff, then the rest of the packet. This results in this
strange xennet_make_frags which does everything after the first packet
page (which may be only *part* of the skb head).

This also complicates xennet_get_responses(), where the loop is awkward
because it has to get the first bit, then do the loop.

> skb_queue_head_init(&rxq);
> skb_queue_head_init(&errq);
> skb_queue_head_init(&tmpq);

I'd love a comment explaining exactly what these three queues are used
for. It seems that rxq is the queue of received skbs (the result), tmpq
is parts of the current skb for multi-fragment skbs, and errq is skbs to
be discarded, which are kept around during the function because ? (we
may need to unmap the pages?)

> + /*
> + * Truesize must approximates the size of true data plus

s/approximates/approximate/ or s/must //.

> +/*
> + * Nothing to do here. Virtual interface is point-to-point and the
> + * physical interface is probably promiscuous anyway.
> + */
> +static void xennet_set_multicast_list(struct net_device *dev)
> +{
> +}

Hmm, you can just leave this as NULL then. It will fail if someone
tries to set multicast on it, but that's probably correct behaviour.

> +static int xennet_change_mtu(struct net_device *dev, int mtu)
> +{
> + int max = xennet_can_sg(dev) ? 65535 - ETH_HLEN : ETH_DATA_LEN;
> +
> + if (mtu > max)
> + return -EINVAL;
> + dev->mtu = mtu;
> + return 0;
> +}

This seems odd to me: just because a device does TSO should we really
allow huge mtu settings? Herbert?

> + /* Initialise {tx,rx}_skbs as a free chain containing every entry. */
> + for (i = 0; i <= NET_TX_RING_SIZE; i++) {
> + np->tx_skbs[i] = (void *)((unsigned long) i+1);
> + np->grant_tx_ref[i] = GRANT_INVALID_REF;
> + }
> +
> + for (i = 0; i < NET_RX_RING_SIZE; i++) {
> + np->rx_skbs[i] = NULL;
> + np->grant_rx_ref[i] = GRANT_INVALID_REF;
> + }

Yay, a comment! Boo, it's wrong! rx_skbs isn't a chain at all.

> static struct net_device * __devinit xennet_create_dev(struct xenbus_device *dev)
> {
> int i, err = 0;

I wouldn't initialize err here: you don't need it, and it just prevents
gcc from warning about uninitialized uses if someone screws up the code.

> + for (i = 0; i < ETH_ALEN; i++) {
> + mac[i] = simple_strtoul(s, &e, 16);
> + if ((s == e) || (*e != ((i == ETH_ALEN-1) ? '\0' : ':'))) {
> + kfree(macstr);
> + return -ENOENT;
> + }
> + s = e+1;
> + }

There are several places in the kernel which do this. I think I'll
write a parse_macaddr() helper. But that's an aside.

> + txs = (struct netif_tx_sring *)get_zeroed_page(GFP_KERNEL);
> + if (!txs) {
> + err = -ENOMEM;
> + xenbus_dev_fatal(dev, err, "allocating tx ring page");
> + goto fail;
> + }
> + SHARED_RING_INIT(txs);
> + FRONT_RING_INIT(&info->tx, txs, PAGE_SIZE);
> +
> + err = xenbus_grant_ring(dev, virt_to_mfn(txs));
> + if (err < 0) {
> + free_page((unsigned long)txs);
> + goto fail;
> + }
> +
> + info->tx_ring_ref = err;
> + rxs = (struct netif_rx_sring *)get_zeroed_page(GFP_KERNEL);
> + if (!rxs) {
> + err = -ENOMEM;
> + xenbus_dev_fatal(dev, err, "allocating rx ring page");
> + goto fail;

Why doesn't this (and the following) need to free txs? Higher level
magic? In general this function seems to lack cleanup.

> + err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
> + "feature-rx-copy", "%u", &feature_rx_copy);
> + if (err != 1)
> + feature_rx_copy = 0;
> + err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
> + "feature-rx-flip", "%u", &feature_rx_flip);
> + if (err != 1)
> + feature_rx_flip = 1;

This second one deserves a comment. If it doesn't set feature_rx_flip
it's *on*? Historical reasons?

> + /*
> + * Recovery procedure:
> + * NB. Freelist index entries are always going to be less than
> + * PAGE_OFFSET, whereas pointers to skbs will always be equal or
> + * greater than PAGE_OFFSET: we use this property to distinguish
> + * them.
> + */

You know, this comment would be great near that union declaration. Not
here where it's a long way from the code which uses that fact.

> +static int xennet_sysfs_addif(struct net_device *netdev)
> +{
> + int i;
> + int error = 0;

Again with the unused error initialization (plus, elsewhere it's "err").

> +static void xennet_sysfs_delif(struct net_device *netdev)
> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(xennet_attrs); i++) {
> + device_remove_file(&netdev->dev, &xennet_attrs[i]);
> + }
> +}

Gratuitous braces around single line.

> + printk(KERN_INFO "Initialising virtual ethernet driver.\n");

Some mention of Xen in this printk would be good, since we're already
have multiple virtual ethernet drivers.

Phew, that's the end.

Cheers!
Rusty.

2007-05-02 03:51:44

by Herbert Xu

[permalink] [raw]

Subject: Re: netfront for review

On Wed, May 02, 2007 at 01:37:13PM +1000, Rusty Russell wrote:
>
> > +static int xennet_change_mtu(struct net_device *dev, int mtu)
> > +{
> > + int max = xennet_can_sg(dev) ? 65535 - ETH_HLEN : ETH_DATA_LEN;
> > +
> > + if (mtu > max)
> > + return -EINVAL;
> > + dev->mtu = mtu;
> > + return 0;
> > +}
>
> This seems odd to me: just because a device does TSO should we really
> allow huge mtu settings? Herbert?

Actually this has nothing to do with TSO/GSO. This driver inherently
supports arbitrary MTUs that's only limited by our network stack. If
the physical network device can't handle it you'll just get an ICMP
error back or fragmentation.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-05-02 04:18:25

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: netfront for review

Rusty Russell wrote:
> On Tue, 2007-05-01 at 17:08 -0700, Jeremy Fitzhardinge wrote:
>
>> Add the Xen virtual network device driver.
>>
>
> (Herbert: there's a question for you: grep for Herbert)
>
> OK, this is a remarkably non-trivial driver. If the v0.1 of the driver
> had been in the kernel, I'm sure it would have been about 1/4 the size
> and far easier to review 8(
>
> Nonetheless, I have scoured the entire thing. It's not actually too
> bad.
>

Lovely, thanks. I'm at the double disadvantage of not really knowing
how the network stack works or precisely how Xen does networking, but
the homework exercises ("comment this") will definitely help my education.

J

2007-05-02 04:23:18

by Rusty Russell

[permalink] [raw]

Subject: Re: netfront for review

On Wed, 2007-05-02 at 13:51 +1000, Herbert Xu wrote:
> On Wed, May 02, 2007 at 01:37:13PM +1000, Rusty Russell wrote:
> >
> > > +static int xennet_change_mtu(struct net_device *dev, int mtu)
> > > +{
> > > + int max = xennet_can_sg(dev) ? 65535 - ETH_HLEN : ETH_DATA_LEN;
> >
> > This seems odd to me: just because a device does TSO should we really
> > allow huge mtu settings? Herbert?
>
> Actually this has nothing to do with TSO/GSO. This driver inherently
> supports arbitrary MTUs that's only limited by our network stack. If
> the physical network device can't handle it you'll just get an ICMP
> error back or fragmentation.

Oops, I misread "xennet_can_sg" as a test for GSO.

Thanks for the clarification,
Rusty.

2007-05-02 19:48:12

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: netfront for review

(Gerd, Herbert: there's some questions directed to you down there.)

Rusty Russell wrote:
>> /*
>> * {tx,rx}_skbs store outstanding skbuffs. The first entry in tx_skbs
>> * is an index into a chain of free entries.
>> */
>> struct sk_buff *tx_skbs[NET_TX_RING_SIZE+1];
>>
>
> This screams "union" to me, since you're stuffing ints into pointers.
>

This was a useful cleanup, and I think it revealed a bug.

Gerd, in change 11196:b85da7cd9ea5 "front: Fix rx buffer leak when
tearing down an interface." you added a call to
"add_id_to_freelist(np->rx_skbs, id);". However, rx_skbs doesn't have
an extra entry for the list head, and there's never any corresponding
get_id_from_freelist(np->rx_skbs). What should it be?

>> grant_ref_t gref_tx_head;
>> grant_ref_t grant_tx_ref[NET_TX_RING_SIZE + 1];
>> grant_ref_t gref_rx_head;
>> grant_ref_t grant_rx_ref[NET_RX_RING_SIZE];
>>
>
> There seems to be a correspondence between tx_skbs[], gref_tx_head and
> grant_tx_ref[] - perhaps group them together with a nice comment?
> Similarly the rx side.
>

Yep, rearranged. And I added an explicit separate freelist head for
tx_skbs, so it matches the grant side (which doesn't need the +1 as a
result).

>> +/*
>> + * Implement our own carrier flag: the network stack's version causes delays
>> + * when the carrier is re-enabled (in particular, dev_activate() may not
>> + * immediately be called, which can cause packet loss).
>> + */
>> +#define netfront_carrier_on(netif) ((netif)->carrier = 1)
>> +#define netfront_carrier_off(netif) ((netif)->carrier = 0)
>> +#define netfront_carrier_ok(netif) ((netif)->carrier)
>>
>
> Well, you only call netfront_carrier_on() from one place, so it's pretty
> easy to do "netif_carrier_on(); dev_activate();" there.
>
> I don't think this is critical though.
>

Are you saying that we wouldn't need to have a private carrier flag if
we do the "netif_carrier_on(); dev_activate()" sequence instead?

>> +static void xennet_make_frags(struct sk_buff *skb, struct net_device *dev,
>> + struct netif_tx_request *tx)
>>
>
> This needs a big comment. AFAICT, entries in the ring cannot cross page
> boundaries. And gso means that you have to put the first (partial) page
> of the packet in the ring first, with the NETTXF_extra_info flag, then
> the GSO stuff, then the rest of the packet. This results in this
> strange xennet_make_frags which does everything after the first packet
> page (which may be only *part* of the skb head).
>
> This also complicates xennet_get_responses(), where the loop is awkward
> because it has to get the first bit, then do the loop.
>
>
>> skb_queue_head_init(&rxq);
>> skb_queue_head_init(&errq);
>> skb_queue_head_init(&tmpq);
>>
>
> I'd love a comment explaining exactly what these three queues are used
> for. It seems that rxq is the queue of received skbs (the result), tmpq
> is parts of the current skb for multi-fragment skbs, and errq is skbs to
> be discarded, which are kept around during the function because ? (we
> may need to unmap the pages?)
>

Um, Herbert?

>> + txs = (struct netif_tx_sring *)get_zeroed_page(GFP_KERNEL);
>> + if (!txs) {
>> + err = -ENOMEM;
>> + xenbus_dev_fatal(dev, err, "allocating tx ring page");
>> + goto fail;
>> + }
>> + SHARED_RING_INIT(txs);
>> + FRONT_RING_INIT(&info->tx, txs, PAGE_SIZE);
>> +
>> + err = xenbus_grant_ring(dev, virt_to_mfn(txs));
>> + if (err < 0) {
>> + free_page((unsigned long)txs);
>> + goto fail;
>> + }
>> +
>> + info->tx_ring_ref = err;
>> + rxs = (struct netif_rx_sring *)get_zeroed_page(GFP_KERNEL);
>> + if (!rxs) {
>> + err = -ENOMEM;
>> + xenbus_dev_fatal(dev, err, "allocating rx ring page");
>> + goto fail;
>>
>
> Why doesn't this (and the following) need to free txs? Higher level
> magic? In general this function seems to lack cleanup.
>

Yes, I need to look at this more closely. It does seem that txs and rxs
will get leaked.

>> + err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
>> + "feature-rx-copy", "%u", &feature_rx_copy);
>> + if (err != 1)
>> + feature_rx_copy = 0;
>> + err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
>> + "feature-rx-flip", "%u", &feature_rx_flip);
>> + if (err != 1)
>> + feature_rx_flip = 1;
>>
>
> This second one deserves a comment. If it doesn't set feature_rx_flip
> it's *on*? Historical reasons

Guess so. It defaults to flip. I simplified the rx_copy/flip module
parameter to a simple rx_mode=0/1, but this is preserved from the
original. My guess is that originally there was only flip, and copy was
added later.

J

2007-05-03 07:34:09

by Gerd Hoffmann

[permalink] [raw]

Subject: Re: netfront for review

Hi,

> Gerd, in change 11196:b85da7cd9ea5 "front: Fix rx buffer leak when
> tearing down an interface." you added a call to
> "add_id_to_freelist(np->rx_skbs, id);". However, rx_skbs doesn't have
> an extra entry for the list head, and there's never any corresponding
> get_id_from_freelist(np->rx_skbs). What should it be?

The function has an effect in page flipping mode only. It walks the
whole list of rx skbufs (id is the loop variable ...), checks whenever
they are handed out to the frontend driver to fill in packet data and
not returned yet, and if so reclaim them ...

>>> + err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
>>> + "feature-rx-flip", "%u", &feature_rx_flip);
>>> + if (err != 1)
>>> + feature_rx_flip = 1;
>>>
>> This second one deserves a comment. If it doesn't set feature_rx_flip
>> it's *on*? Historical reasons
>
> Guess so. It defaults to flip. I simplified the rx_copy/flip module
> parameter to a simple rx_mode=0/1, but this is preserved from the
> original. My guess is that originally there was only flip, and copy was
> added later.

Yep, page flipping is the old (pre xen 3.0.3) way, copy was added (and
made the default) later.

cheers,
Gerd

2007-05-03 14:28:00

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: netfront for review

Gerd Hoffmann wrote:
>> Gerd, in change 11196:b85da7cd9ea5 "front: Fix rx buffer leak when
>> tearing down an interface." you added a call to
>> "add_id_to_freelist(np->rx_skbs, id);". However, rx_skbs doesn't have
>> an extra entry for the list head, and there's never any corresponding
>> get_id_from_freelist(np->rx_skbs). What should it be?
>
> The function has an effect in page flipping mode only. It walks the
> whole list of rx skbufs (id is the loop variable ...), checks whenever
> they are handed out to the frontend driver to fill in packet data and
> not returned yet, and if so reclaim them ...

Yes, but why use add_id_to_freelist? rx_skbs are not being used on a
freelist anywhere else. It just means the rx_skb array gets filled with
small integers, but the rest of the code assumes they're either NULL or
an skb pointer.

J

2007-05-03 14:31:10

by Keir Fraser

[permalink] [raw]

Subject: Re: netfront for review

On 3/5/07 15:27, "Jeremy Fitzhardinge" <[email protected]> wrote:

>> The function has an effect in page flipping mode only. It walks the
>> whole list of rx skbufs (id is the loop variable ...), checks whenever
>> they are handed out to the frontend driver to fill in packet data and
>> not returned yet, and if so reclaim them ...
>
> Yes, but why use add_id_to_freelist? rx_skbs are not being used on a
> freelist anywhere else. It just means the rx_skb array gets filled with
> small integers, but the rest of the code assumes they're either NULL or
> an skb pointer.

The need for it went away when Herbert Xu made the mapping between
receive-ring slots and receive-request/response identifiers static. I think
there was a race between Gerd writing his patch, Herbert removing the need
for add_id_to_freelist, and Gerd's patch being checked in.

-- Keir

2007-05-03 14:36:45

by Gerd Hoffmann

[permalink] [raw]

Subject: Re: netfront for review

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>>> Gerd, in change 11196:b85da7cd9ea5 "front: Fix rx buffer leak when
>>> tearing down an interface." you added a call to
>>> "add_id_to_freelist(np->rx_skbs, id);". However, rx_skbs doesn't have
>>> an extra entry for the list head, and there's never any corresponding
>>> get_id_from_freelist(np->rx_skbs). What should it be?
>> The function has an effect in page flipping mode only. It walks the
>> whole list of rx skbufs (id is the loop variable ...), checks whenever
>> they are handed out to the frontend driver to fill in packet data and
>> not returned yet, and if so reclaim them ...
>
> Yes, but why use add_id_to_freelist? rx_skbs are not being used on a
> freelist anywhere else. It just means the rx_skb array gets filled with
> small integers, but the rest of the code assumes they're either NULL or
> an skb pointer.

Hmm, good point. Have to look at the code again, it has been some time
I've written that, and it took me some time to figure how all the grant
table stuff works ...

Maybe the add_id_to_freelist() call can simply be dropped. The whole
interface is released shortly thereafter, probably thats why filling the
freelist with yunk never caused any problems ...

cheers,
Gerd

2007-05-03 15:18:08

by Christoph Hellwig

[permalink] [raw]

Subject: Re: netfront for review

On Thu, May 03, 2007 at 09:33:43AM +0200, Gerd Hoffmann wrote:
> >Guess so. It defaults to flip. I simplified the rx_copy/flip module
> >parameter to a simple rx_mode=0/1, but this is preserved from the
> >original. My guess is that originally there was only flip, and copy was
> >added later.
>
> Yep, page flipping is the old (pre xen 3.0.3) way, copy was added (and
> made the default) later.

So can we please just rip out the obsolete code completely? There's
no point adding this code anymore.

2007-05-03 15:42:50

by Gerd Hoffmann

[permalink] [raw]

Subject: Re: netfront for review

Christoph Hellwig wrote:
> On Thu, May 03, 2007 at 09:33:43AM +0200, Gerd Hoffmann wrote:
>>> Guess so. It defaults to flip. I simplified the rx_copy/flip module
>>> parameter to a simple rx_mode=0/1, but this is preserved from the
>>> original. My guess is that originally there was only flip, and copy was
>>> added later.
>> Yep, page flipping is the old (pre xen 3.0.3) way, copy was added (and
>> made the default) later.
>
> So can we please just rip out the obsolete code completely? There's
> no point adding this code anymore.

Drawback is that the guest kernel wouldn't work with older xen versions
(dom0 netback driver to be exact) any more. Probably wouldn't be a
showstopper though, given that xen 3.0.3 probably is almost one year out
by the time 2.6.22 will be released ...

2007-05-03 16:00:53

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: netfront for review

Gerd Hoffmann wrote:
> Drawback is that the guest kernel wouldn't work with older xen
> versions (dom0 netback driver to be exact) any more. Probably
> wouldn't be a showstopper though, given that xen 3.0.3 probably is
> almost one year out by the time 2.6.22 will be released ...

I don't think we've decided how backwards-compatible the pv_ops guest
should be. I've been working on the basis "as much as possible", so
long as it isn't a large burden.

J