2017-04-25 08:37:16

by Waldemar Rymarkiewicz

[permalink] [raw]
Subject: Network cooling device and how to control NIC speed on thermal condition

Hi,

I am not much aware of linux networking architecture so I'd like to
ask first before will start to dig into the code. Appreciate any
feedback.

I am looking on Linux thermal framework and on how to cool down the
system effectively when it hits thermal condition. Already existing
cooling methods cpu_cooling and clock_cooling are good. However, I
wanted to go further and dynamically control also a switch ports'
speed based on thermal condition. Lowering speed means less power,
less power means lower temp.

Is there any in-kernel interface to configure switch port/NIC from other driver?

Is there any mechanism to power save, when port/interface is not
really used (not much or low data traffic), embedded in networking
stack or is it a task for NIC driver itself ?

I was thinking to create net_cooling device similarly to cpu_cooling
device which cool down the system scaling down cpu freq. net_cooling
could lower down interface speed (or tune more parameters to achieve
). Do you thing could this work form networking stack perspective?

Any pointers to the code or a doc highly appreciated.

Thanks,
/Waldek


2017-04-25 13:18:11

by Andrew Lunn

[permalink] [raw]
Subject: Re: Network cooling device and how to control NIC speed on thermal condition

On Tue, Apr 25, 2017 at 10:36:28AM +0200, Waldemar Rymarkiewicz wrote:
> Hi,
>
> I am not much aware of linux networking architecture so I'd like to
> ask first before will start to dig into the code. Appreciate any
> feedback.
>
> I am looking on Linux thermal framework and on how to cool down the
> system effectively when it hits thermal condition. Already existing
> cooling methods cpu_cooling and clock_cooling are good. However, I
> wanted to go further and dynamically control also a switch ports'
> speed based on thermal condition. Lowering speed means less power,
> less power means lower temp.
>
> Is there any in-kernel interface to configure switch port/NIC from other driver?

Hi Waldemar

Linux models switch ports as network interfaces, so mostly, there is
little difference between a NIC and a switch port. What you define for
one, should work for the other. Mostly.

However, i don't think you need to be too worried about the NIC level
of the stack. You can mostly do this higher up in the stack. I would
expect there is a relationship between Packets per Second and
generated heat. You might want the NIC to give you some sort of
heating coefficient, 1PPS is ~ 10uC. Given that, you want to throttle
the PPS in the generic queuing layers. This sounds like a TC filter.
You have userspace install a TC filter, which is a net_cooling device.

This however does not directly work for so called 'fastpath' traffic
in switches. Frames which ingress one switch port and egress another
switch port are mostly never seen by Linux. So a software TC filter
will not affect them. However, there is infrastructure in place to
accelerate TC filters by pushing them down into the hardware. So the
same basic concept can be used for switch fastpath traffic, but
requires a bit more work.

Andrew

2017-04-25 13:45:29

by Alan Cox

[permalink] [raw]
Subject: Re: Network cooling device and how to control NIC speed on thermal condition

> I am looking on Linux thermal framework and on how to cool down the
> system effectively when it hits thermal condition. Already existing
> cooling methods cpu_cooling and clock_cooling are good. However, I
> wanted to go further and dynamically control also a switch ports'
> speed based on thermal condition. Lowering speed means less power,
> less power means lower temp.
>
> Is there any in-kernel interface to configure switch port/NIC from other driver?

No but you can always hook that kind of functionality to the thermal
daemon. However I'd be careful with your assumptions. Lower speed also
means more time active.

https://github.com/01org/thermal_daemon

For example if you run a big encoding job on an atom instead of an Intel
i7, the atom will often not only take way longer but actually use more
total power than the i7 did.

Thus it would often be far more efficient to time synchronize your
systems, batch up data on the collecting end, have the processing node
wake up on an alarm, collect data from the other node and then actually
go back into suspend.

Modern processors are generally very good in idle state (less so
sometimes the platform around them) so trying to lower speeds may
actually be the wrong thing to do, versus say trying to batch up activity
so that you handle a burst and then sleep the entire platform.

It also makes sense to keep policy like that mostly user space - because
what you do is going to be very device specific - eg with things like
dimming the screen, lowering the wifi power, pausing some system
services, pausing battery charge etc.

Now at platform design time there are some interesting trade offs between
100Mbit and 1Gbit ethernet although less so than there used to be 8)

Alan

2017-04-25 16:23:14

by Florian Fainelli

[permalink] [raw]
Subject: Re: Network cooling device and how to control NIC speed on thermal condition

Hello,

On 04/25/2017 01:36 AM, Waldemar Rymarkiewicz wrote:
> Hi,
>
> I am not much aware of linux networking architecture so I'd like to
> ask first before will start to dig into the code. Appreciate any
> feedback.
>
> I am looking on Linux thermal framework and on how to cool down the
> system effectively when it hits thermal condition. Already existing
> cooling methods cpu_cooling and clock_cooling are good. However, I
> wanted to go further and dynamically control also a switch ports'
> speed based on thermal condition. Lowering speed means less power,
> less power means lower temp.
>
> Is there any in-kernel interface to configure switch port/NIC from other driver?

Well, there is mostly under the form of notifiers though. For instance
there are lots of devices that do converged FCoE/RoCE/Ethernet that have
a two headed set of drivers, one for normal ethernet, and another one
for RDMA/IB for instance. To some extent stacked devices (VLAN, bond,
team, etc.) also call back down into their lower device, but in an
abstracted way, at the net_device level of course (layering).

>
> Is there any mechanism to power save, when port/interface is not
> really used (not much or low data traffic), embedded in networking
> stack or is it a task for NIC driver itself ?

The thing we did (currently out of tree) in the Starfighter 2 switch
driver (drivers/net/dsa/bcm_sf2.c) is that any time a port is brought
up/down (a port = a network device) we recalculate the switch core
clock, and we also resize the buffers and that yields to a little bit of
power savings here and there. I don't recall the numbers from the top
of my head, but it was significant enough our HW designers convinced me
into doing it ;)

>
> I was thinking to create net_cooling device similarly to cpu_cooling
> device which cool down the system scaling down cpu freq. net_cooling
> could lower down interface speed (or tune more parameters to achieve
> ). Do you thing could this work form networking stack perspective?

This sounds like a good idea, but it could be very tricky to get right,
because even if you can somehow throttle your transmit activity (since
the host is in control), you can't do that without being disruptive to
the receive path (or not as effectively).

Unlike any kind of host driven activity: CPU run queue, block devices,
USB etc. (SPI, I2C and so on when no using slave driven interrupts) you
cannot simply apply a "duty cycle" pattern where you turn on your HW
just enough of time that is needed for you to set it up for transfer,
signal transfer completion and go back to sleep. Networking needs to be
able to asynchronously receive packets in a way that is usually not
predictable although it could be for very specific workloads though.

Another thing is that there is still a fair amount of energy that needs
to be spent in maintaining the link, and the HW design may be entirely
clocked based on the link speed. Depending on the HW architecture (store
and forward, cut through etc.) there would still be a cost associated
with maintaining RAMs in a state where they are operational and so on.

You could imagine writing a queuing discipline driver that would
throttle transmission based on temperature sensors present in your NIC,
you could definitively do this in a way that is completely device driver
agnostic by using Linux's thermal framework trip point and temperature
notifications.

For reception, if you are okay with dropping some packets, you could
implement something similar, but chances are that your NIC would still
need to receive packets, be able to fully process them before SW drops
them, at which point, you have a myriad of solutions about how not to
process incoming traffic.

Hope this helps

>
> Any pointers to the code or a doc highly appreciated.
>
> Thanks,
> /Waldek
>


--
Florian

2017-04-28 08:05:02

by Waldemar Rymarkiewicz

[permalink] [raw]
Subject: Re: Network cooling device and how to control NIC speed on thermal condition

On 25 April 2017 at 15:45, Alan Cox <[email protected]> wrote:
>> I am looking on Linux thermal framework and on how to cool down the
>> system effectively when it hits thermal condition. Already existing
>> cooling methods cpu_cooling and clock_cooling are good. However, I
>> wanted to go further and dynamically control also a switch ports'
>> speed based on thermal condition. Lowering speed means less power,
>> less power means lower temp.
>>
>> Is there any in-kernel interface to configure switch port/NIC from other driver?
>
> No but you can always hook that kind of functionality to the thermal
> daemon. However I'd be careful with your assumptions. Lower speed also
> means more time active.
>
> https://github.com/01org/thermal_daemon

This is one of the option indeed. Will consider this option as well. I
would see, however, a generic solution in the kernel (configurable
of course) as every network device can generate higher heat with
higher link speed.

> For example if you run a big encoding job on an atom instead of an Intel
> i7, the atom will often not only take way longer but actually use more
> total power than the i7 did.
>
> Thus it would often be far more efficient to time synchronize your
> systems, batch up data on the collecting end, have the processing node
> wake up on an alarm, collect data from the other node and then actually
> go back into suspend.

Yes, that's true in a normal thermal conditions. However, if the
platform reaches max temp trip we don't really care about performance
and time efficiency we just try to avoid critical trip and system
shutdown by cooling the system eg. lowering cpu freq, limiting usb phy
speed, or net link speed etc.

I did a quick test to show you what I am about.

I collect SoC temp every a few secs. Meantime, I use ethtool -s ethX
speed <speed> to manipulate link speed and to see how it impacts SoC
temp. My 4 PHYs and switch are integrated into SoC and I always
change link speed for all PHYs , no traffic on the link for this test.
Starting with 1Gb/s and then scaling down to 100 Mb/s and then to
10Mb/s, I see significant ~10 *C drop in temp while link is set to
10Mb/s.

So, throttling link speed can really help to dissipate heat
significantly when the platform is under threat.

Renegotiating link speed costs something I agree, it also impacts user
experience, but such a thermal condition will not occur often I
believe.


/Waldek

2017-04-28 11:56:40

by Andrew Lunn

[permalink] [raw]
Subject: Re: Network cooling device and how to control NIC speed on thermal condition

> I collect SoC temp every a few secs. Meantime, I use ethtool -s ethX
> speed <speed> to manipulate link speed and to see how it impacts SoC
> temp. My 4 PHYs and switch are integrated into SoC and I always
> change link speed for all PHYs , no traffic on the link for this test.
> Starting with 1Gb/s and then scaling down to 100 Mb/s and then to
> 10Mb/s, I see significant ~10 *C drop in temp while link is set to
> 10Mb/s.

Is that a realistic test? No traffic over the network? If you are
hitting your thermal limit, to me that means one of two things:

1) The device is under very heavy load, consuming a lot of power to do
what it needs to to.

2) Your device is idle, no packets are flowing, but your thermal
design is wrong, so that it cannot dissipate enough heat.

It seems to me, you are more interested in 1). But your quick test is
more about 2).

I would be more interested in do quick tests of switching 8Gbps,
4Gbps, 2Gbps, 1Gbps, 512Mbps, 256Bps, ... What effect does this have
on temperature?

> So, throttling link speed can really help to dissipate heat
> significantly when the platform is under threat.
>
> Renegotiating link speed costs something I agree, it also impacts user
> experience, but such a thermal condition will not occur often I
> believe.

It is a heavy handed approach, and you have to be careful. There are
some devices which don't work properly, e.g. if you try to negotiate
1000 half duplex, you might find the link just breaks.

Doing this via packet filtering, dropping packets, gives you a much
finer grained control and is a lot less disruptive. But it assumes
handling packets is what it causing you heat problems, not the links
themselves.

Andrew