Lime2 instability

Started by olHelp, July 26, 2015, 10:37:45 PM

Previous topic - Next topic

olHelp

Hello,

i am having stability problems with the lime2. I intend to use it as a server, running some service that will use a threaded load of about 35%cpu on an A7@1ghz, with mild i/o on the storage and about 200Kb/s network load, no planned downtime.
Connected to the board is the barrel connector, ethernet, an usb disk for storage and the micro sd card containing Linux (tried different distributions). Using the black enclosure.


However, i the lime2 crashes for unknown reasons after 2-11h. Nothing in the logs, so i guessed it is related to bas power supply or temperature problems. Measures taken:

1) tried different 5v 1A supply (using ubuntu and self compiled mainline kernel)
2) opened the enclosure
3) different usb stick
4) switched to 5v 2A suppy
5) used cpufreq to limit frequency to ~720mhz and after that ~560mhz
6) different linux distribution on different sd card (arch linux for arm)
7) removed usb stick and moved storage on the sd card for testing

...however its still not stable.

Do you have any suggestions if there could be anything else wrong or how to diagnose the problem?
The lime2 sits right by an old pandaboard, thats running rock stable for years under the same conditions


Gerrit

What current is mentioned on your HD ?

Try the 1Amp supply exclusive for the board, and the USB disk to have an external power supply or when that is not possible trough a powered USB hub.

soenke

Try to connect a serial console via UART0 (the 3 pins besides the ethernet port) and see, if there is some kernel output. If you see something like "mmcqd blocked ...." it is a half-broken controller on the sd-card. I had exactly the same issuses with about 5 miroSD-cards (sandisk extreme PRO) until now.

JohnS

I'd start by seeing if there are kernel messages on the console uart as it crashes (or before that).

I like to leave the console open from another Linux system, logging the output, such as by screen -L

John

olHelp

Hey,

i am not even using an USB HDD, just an USB Stick. Sorry if i wasnt clear.

I connected another board via the UART0, however minicom just switched to offline, there is no message when the board stops working.
After that i  moved the rootfs onto the usb stick and pulled the sd card after booting, still stops working after a while. Same result with connecting the usb stick via powered USB hub.

Maybe it is a temperature related issue? The load is not really high and without the case there should be enough airflow. Digging further..

JohnS

When does minicom do that?  Do you at least see the messages during boot?

It's not connected right if not.

John

olHelp

Quote from: JohnS on July 28, 2015, 03:30:29 PM
When does minicom do that?  Do you at least see the messages during boot?

It's not connected right if not.

John

The bootlog shows up, ditto for kernel messages like connected peripherals.

soenke

have you tried something like stress -c 2 etc. to create different kind of loads (cpu/memory/IO/all at once)?

olHelp

i tried stress with some random loads, mostly all at once (4 threads, maxing out the ram,i/o on all mounted filesystems etc), but nothing systematically (and it did not crash).

Thats a good suggestion, running stress right now

JohnS

Something to check: all the voltages being set (dcdc2 etc) for the various parts of the board.  Wrong volts/clock (cpu speed) = liable to crash.

E.g. 1008MHz does not seem reliable except with over-voltages -- themselves a bad idea.

(A recent topic on linux-sunxi ML.)

John

olHelp

Quote from: JohnS on July 28, 2015, 07:55:54 PM
Something to check: all the voltages being set (dcdc2 etc) for the various parts of the board.  Wrong volts/clock (cpu speed) = liable to crash.

E.g. 1008MHz does not seem reliable except with over-voltages -- themselves a bad idea.

(A recent topic on linux-sunxi ML.)

John

I just ran cpufreq-ljt-stress-test as mentioned on http://linux-sunxi.org/Hardware_Reliability_Tests#Reliability_of_cpufreq_voltage.2Ffrequency_settings

and the system froze at the second core, even with the freq. max. at 528mhz.

How can i check the specified voltages?

JohnS

Depends on kernel.  Fex and/or uboot and/or DT.  See linux-sunxi pages etc.

A voltmeter is useful with the chip datasheet!

John

Pawel_W

#12
IMO the CPU overheats in this black enclosure under server workload.
I glued a heat sink bar to the CPU and memory by using thermopads
(warning - surfaces of the CPU and memory chips are not at the same level).
It helped, but the case still has no air circulation.
I'm going to connect UEXT ribbon cable, so I'll have to make a hole for it.
You can test Lime2 without the lid of this enclosure (or these yellow panels) and buy the original heat sink for the A20 CPU.
P.S.
Bad solder joints are also a possibility.

soenke

Thanks for the testing!
Maybe the next case should have some vents on the top side... The current solution involves a drill ;)

Does your A20 also overheat without a heat sink and with an open enclosure?

The temperature of the DRAM should not be an issue, normally they dont require additional cooling.

olHelp

Currently testing a custom build kernel with increased voltage,

(as suggested by http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/334714.html)
Quoteoperating-points = <
            /* kHz     uV */
            1008000 1450000
            912000  1425000
            864000  1350000
            720000  1250000
            528000  1150000
            312000  1100000
            144000  1050000
            >;

instead of
Quote
         operating-points = <
            /* kHz    uV */
            960000  1400000
            912000  1400000
            864000  1300000
            720000  1200000
            528000  1100000
            312000  1000000
            144000  900000
            >;

with uptime well over 12h now. With stress running for various tests!

http://postimg.org/image/vtdr22v6z/