A13-OLinuXino dissapointing performance

Started by LeMury, March 25, 2016, 10:20:52 PM

Previous topic - Next topic

LeMury

Hello forum,

I am evaluating the A13-OLinuXino board and I am a bit shocked how slow the performance of the board is.

We run the A13 board @1008Mhz CPU and 300mhz MBUS using custom SPL loading a bare metal application.
All the SOCs peripherals uart, sd, codec, dma etc. work fine and stable.

However, the board barely manages to run a custom application that runs effortless on a STM32F4 (Olimex STM32-E407 board)!

In fact, every test app we tried, ran much faster on the STM32F4 then the A13.

We suspected that this slow performance might be caused by slow external DDRAM access so we ran tests solely in the available internal SRAM.
Alas, the results were practically the same!

Any ideas anyone?

Is the A10 perhaps any or much better?

JohnS

Run the tests or some other tests on the F4 and then under Linux (or Android) on the A13 so you can rule out some setting(s) you may have wrong.

John

LubOlimex

Hey,

This is completely possible - STM32-E407 might be faster for certain tasks. It doesn't have to run an operating system. The Linux OS is quite taxing on the performance.

Consider that it might be a software issue. You might need to optimize the values for core voltage and CPU clock. If the tasks are very intensive and the environment temperature high consider an aluminum heatsink or some sort of cooler for the processor.

You might also need to optimize the Linux - disable some processes, disable modules, etc.

Furthermore, A13 is the first and slowest Allwinner chip. Consider a board with Allwinner A20. Like A20-OLinuXino-MICRO, A20-OLinuXino-LIME2, A20-SOM-EVB.

Best regards,
Lub/OLIMEX
Technical support and documentation manager at Olimex

LeMury

Quote from: JohnS on March 26, 2016, 05:41:01 PM
Run the tests or some other tests on the F4 and then under Linux (or Android) on the A13 so you can rule out some setting(s) you may have wrong.

John
That might be a good idea.
Ok, writing android image to SD as we speak.

LeMury

#4
Quote from: LubOlimex on March 26, 2016, 10:01:17 PM
Hey,

This is completely possible - STM32-E407 might be faster for certain tasks. It doesn't have to run an operating system. The Linux OS is quite taxing on the performance.

Consider that it might be a software issue. You might need to optimize the values for core voltage and CPU clock. If the tasks are very intensive and the environment temperature high consider an aluminum heatsink or some sort of cooler for the processor.

You might also need to optimize the Linux - disable some processes, disable modules, etc.

Furthermore, A13 is the first and slowest Allwinner chip. Consider a board with Allwinner A20. Like A20-OLinuXino-MICRO, A20-OLinuXino-LIME2, A20-SOM-EVB.

Best regards,
Lub/OLIMEX
Hello,
1. I don' run any OS. I only do bare metal on both the STM32F4 and the sunxi A13.
2. I already raised CPU voltage to 1400 mvolt and set CPU clock to 1008Mhz
3. The A13 CPU doesn't even get warm, so no cooling problem
4. To rule out wrong DDRAM settings, I run the tests from internal SRAM.

I think I initialized the SOC correctly, but it's of course possible I may have overlooked something.
So I will try Johns suggestion to run android/linux and do some tests.

Still, I'm a puzzled that a cortex-A8 (A13) running at 1Ghz is performing ~3 times slower than a cortex-M4f (STM324F) running at 168Mhz!

At this point I think it's not ARM core related, but very slow memory access in the A13 SOC itself.

JohnS

#5
It's going to depend a lot on what your code does in tight loops.

E.g. accessing I/O ports when maybe DMA would make sense.

Suppose you have a default very cautious setting that's overlooked, you could be crippling performance.

Or you don't compile with the right compiler flags.

Of course with more RAM you might use a different algorithm that would not fit on the smaller RAM machine.

Well, you get the idea.

For myself, generally any modern CPU (I don't count those tiny Atmel chips with hardly any RAM, sorry) is so fast that speed is not an issue at all.  (Actually, speed can be good enough on those chips but RAM vanishes fast.)

John

LeMury

Quote from: JohnS on March 27, 2016, 12:03:42 AM
It's going to depend a lot on what your code does in tight loops.

E.g. accessing I/O ports when maybe DMA would make sense.

Suppose you have a default very cautious setting that's overlooked, you could be crippling performance.

Or you don't compile with the right compiler flags.

Of course with more RAM you might use a different algorithm that would not fit on the smaller RAM machine.

Well, you get the idea.

For myself, generally any modern CPU (I don't count those tiny Atmel chips with hardly any RAM, sorry) is so fast that speed is not an issue at all.  (Actually, speed can be good enough on those chips but RAM vanishes fast.)

John
No, I'm pretty sure I use the correct compiler flags, and the tests I run don't rely on direct I/O but use DMA which runs correctly and as expected.
I use the same algorithms on the same memory footprint.
It's the processing of blocks of data in memory that is slow.

Either I don't init the SOC correctly or the SOC is simply what it is.
Well, these sunxi socs are badly documented so the change I messed up, lets say, cache init or dram controller init isn't unlikely.


LeMury

Ok, I played around a bit with the Debian Linux image.
I don't have a test application ready yet for Linux, but testing the Linux media player hints to the same performance issues that I encounter running Bare Metal.

For instance, if you play a mp3 with the default Linux media player, you'll see hardly any cpu usage. Now, whilst playing, turn on the "Meter" function which displays The Audio spectrum of the signal. Cpu will rise to ~50%. I'm pretty sure that that is caused by the FFT algo needed to compute the spectrum.

Of course this is no hard proof but it is exactly the kind of (slow) performance I notice;
.. as soon as the A13 has to do some serious data processing, cpu usage rises fast.

Looking at the specs of the A13 we'll see that the memory bus width is max.16-bit as opposed to 32-bit in the A10.

Again, I'm curious how much better the A10 will perform in this respect?