Hello,
i am having stability problems with the lime2. I intend to use it as a server, running some service that will use a threaded load of about 35%cpu on an A7@1ghz, with mild i/o on the storage and about 200Kb/s network load, no planned downtime.
Connected to the board is the barrel connector, ethernet, an usb disk for storage and the micro sd card containing Linux (tried different distributions). Using the black enclosure.
However, i the lime2 crashes for unknown reasons after 2-11h. Nothing in the logs, so i guessed it is related to bas power supply or temperature problems. Measures taken:
1) tried different 5v 1A supply (using ubuntu and self compiled mainline kernel)
2) opened the enclosure
3) different usb stick
4) switched to 5v 2A suppy
5) used cpufreq to limit frequency to ~720mhz and after that ~560mhz
6) different linux distribution on different sd card (arch linux for arm)
7) removed usb stick and moved storage on the sd card for testing
...however its still not stable.
Do you have any suggestions if there could be anything else wrong or how to diagnose the problem?
The lime2 sits right by an old pandaboard, thats running rock stable for years under the same conditions
What current is mentioned on your HD ?
Try the 1Amp supply exclusive for the board, and the USB disk to have an external power supply or when that is not possible trough a powered USB hub.
Try to connect a serial console via UART0 (the 3 pins besides the ethernet port) and see, if there is some kernel output. If you see something like "mmcqd blocked ...." it is a half-broken controller on the sd-card. I had exactly the same issuses with about 5 miroSD-cards (sandisk extreme PRO) until now.
I'd start by seeing if there are kernel messages on the console uart as it crashes (or before that).
I like to leave the console open from another Linux system, logging the output, such as by screen -L
John
Hey,
i am not even using an USB HDD, just an USB Stick. Sorry if i wasnt clear.
I connected another board via the UART0, however minicom just switched to offline, there is no message when the board stops working.
After that i moved the rootfs onto the usb stick and pulled the sd card after booting, still stops working after a while. Same result with connecting the usb stick via powered USB hub.
Maybe it is a temperature related issue? The load is not really high and without the case there should be enough airflow. Digging further..
When does minicom do that? Do you at least see the messages during boot?
It's not connected right if not.
John
Quote from: JohnS on July 28, 2015, 03:30:29 PM
When does minicom do that? Do you at least see the messages during boot?
It's not connected right if not.
John
The bootlog shows up, ditto for kernel messages like connected peripherals.
have you tried something like stress -c 2 etc. to create different kind of loads (cpu/memory/IO/all at once)?
i tried stress with some random loads, mostly all at once (4 threads, maxing out the ram,i/o on all mounted filesystems etc), but nothing systematically (and it did not crash).
Thats a good suggestion, running stress right now
Something to check: all the voltages being set (dcdc2 etc) for the various parts of the board. Wrong volts/clock (cpu speed) = liable to crash.
E.g. 1008MHz does not seem reliable except with over-voltages -- themselves a bad idea.
(A recent topic on linux-sunxi ML.)
John
Quote from: JohnS on July 28, 2015, 07:55:54 PM
Something to check: all the voltages being set (dcdc2 etc) for the various parts of the board. Wrong volts/clock (cpu speed) = liable to crash.
E.g. 1008MHz does not seem reliable except with over-voltages -- themselves a bad idea.
(A recent topic on linux-sunxi ML.)
John
I just ran cpufreq-ljt-stress-test as mentioned on http://linux-sunxi.org/Hardware_Reliability_Tests#Reliability_of_cpufreq_voltage.2Ffrequency_settings
and the system froze at the second core, even with the freq. max. at 528mhz.
How can i check the specified voltages?
Depends on kernel. Fex and/or uboot and/or DT. See linux-sunxi pages etc.
A voltmeter is useful with the chip datasheet!
John
IMO the CPU overheats in this black enclosure under server workload.
I glued a heat sink bar to the CPU and memory by using thermopads
(warning - surfaces of the CPU and memory chips are not at the same level).
It helped, but the case still has no air circulation.
I'm going to connect UEXT ribbon cable, so I'll have to make a hole for it.
You can test Lime2 without the lid of this enclosure (or these yellow panels) and buy the original heat sink for the A20 CPU.
P.S.
Bad solder joints are also a possibility.
Thanks for the testing!
Maybe the next case should have some vents on the top side... The current solution involves a drill ;)
Does your A20 also overheat without a heat sink and with an open enclosure?
The temperature of the DRAM should not be an issue, normally they dont require additional cooling.
Currently testing a custom build kernel with increased voltage,
(as suggested by http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/334714.html)
Quoteoperating-points = <
/* kHz uV */
1008000 1450000
912000 1425000
864000 1350000
720000 1250000
528000 1150000
312000 1100000
144000 1050000
>;
instead of
Quote
operating-points = <
/* kHz uV */
960000 1400000
912000 1400000
864000 1300000
720000 1200000
528000 1100000
312000 1000000
144000 900000
>;
with uptime well over 12h now. With stress running for various tests!
http://postimg.org/image/vtdr22v6z/
I made some photos of my A20-Lime2 with heat sink:
http://www.fotosik.pl/pokaz_obrazek/68cbc3aa6d2c6170.html
http://www.fotosik.pl/pokaz_obrazek/a288ee10e356f77f.html
http://www.fotosik.pl/pokaz_obrazek/bc68c956f8cfc6e9.html
Heat sink is glued to memory chips by additional thermopad to offset height difference between them and CPU.
I tested Lime2 without heat sink, locked in the case - it hanged sometimes under load (Android games).
Now is much better, but without ventilation holes the heat transfer is still not sufficient for stable 24/7 100% load and the case gets warm.
maybe you also try to increase the core voltage and see if that solves your problem?
That huge heasink seem a little bit overkill :)
Sounds like you're running it too fast already. More voltage = more heat!!
John
If it is running on standard clock speeds, i would not change it.
I dont think that increasing the max. voltage from 1.4 to 1.45 has a sigificant impact on the CPU temperature.
He is already using a huge heatsink so i dont think his problem is related to the core temperature but more to undervoltage on certain clock speeds.
The typical "standard" speed is often too fast due to being beyond the spec. Some chips work, some don't.
Many Android tabs say they run at 1008MHz but actually don't.
He appears to have a heat problem now, as he pretty much says.
John
By the way: my lime2 is over 24h uptime, with a further increase:
Quote
912000 1430000
864000 1375000
720000 1275000
528000 1175000
312000 1125000
144000 1075000
running with opened case. survived prolonged runs of stress with multiple i/o on all mounted filesystems. Max frequency is 960mhz
1.43V is out of spec isn't it?
Yours apparently works. It's to be expected that some will not, as indeed others have found.
Each owner can choose what they do; it may or may not work reliably. It may not be the problem here, but there again it may be. I'm not seeing many other suggestions.
John
Meh, i am apparently still able to crash the board, even if its not reproducible. Given the increased voltage, it may be overheating now.
The last resort would be another high quality power brick, even if 2.5A (by now) should be plenty for the board+usb stick.
If the CPU volts are wrong, or the frequency too high, adding current will not help.
John
Yes, i am aware of that. Over at the arch linux forums someone mentioned running into instabilities with a cubieboard at default settings powered by a 2A plug, so maybe the small plugs can be instable, or i simply tried 3 different, very low quality power adaptors.
Will report back if Kernel 4.3 is out or if i feel like buying another power supply
Just to update this topic:
I updated the kernel to 4.3rc5, limited the max.freq to 864mhz, using the default .dtb/voltages and got a new sd card. Board is running without attachments or case....but it still hangs.
There are two different cron-Jobs running,each full hour starts stress -c 2 with 1000s timeout, every half hour (8:30,9:30..) stress runs with -d 2 and 1000s timeout.
Both scripts may crash the system, but its not reproducible. Next stop: another board :'(
Your kernel may be setting any/some of the on-board supplies (dcdc2 etc) to bad values and/or setting bad RAM parameters.
It may be worth asking for help on the linux-sunxi list (or IRC). They'll want the boot (dmesg or equivalent) logs and so on.
John
Oh well, i did not mean to sound that negative. Currently downloading a20-lime2_debian_3.4.90_release_2.img. At least Arch Linux was using the mainline kernel, but it still could be something wrong in main. Will report back in a week or so
Hello again,
the default debian 3.4 seems to be pretty stable. Board running without case, crashes only for insane loads with multiple stress and load > 12 or so. Maybe its 100% safe with a heatsink. Now if i only could get 4.3 to be that solid
I think 4.3 is not stable. If you haven't already, referring to linux-sunxi ML / IRC may be useful.
If you build from source you can compare the machine-dependent parts. It's probably a speed/timing issue.
John
4.3 maybe not be because it's not released yet and things are still fixing.
But I am running 4.2.3 on Lime now for two weeks. It doesn't want's to hang :P, kernel 3.4 with Lime 2 for two months with running a lot of stuff on it.
But it's not just kernel. U-boot can also be responsible for problems.