A20 LIME vs LIME2 - single vs dual RAM speed

Started by d.kunchev, March 22, 2017, 11:22:34 AM

Previous topic - Next topic

d.kunchev

Hi everyone

I have a question about LIME vs LIME2. I see one has single RAM and the other two. From what I understand the two DDR3 chips share the address lines and have different data lines. So my guess is that in effect with single chip the CPU works in 16bit mode and with two - 32 bit mode (ram bus). Theoretically that means you have half the bandwidth to the RAM with single chip.

Is that correct? Has anyone done any actual measurements / speed tests on the boards to compare? Or am I wrong and the two chips just give more space rather than speed?

Cheers

d.kunchev

Ok, apparently noone has a clue (or interest) on this topic, so I did a bit of testing myself. And got some extremely weird results which I would like to share with whoever lands here.

Hardware - cubieboard2 (my lime2 got fried and haven't replaced it just yet, will test with it too when I get a new one)
Kernel - 4.10.0 mainstream latest
U-boot - 2017.03-rc2

The board has two 512MB DDR3 chips. To test how it performs with a single one I went to the u-boot source board/sunxi/dram_sun5i_auto.c and changed the values for density, io_width and bus_width, setting them to 4096, 16 and 16 respectively. This causes the dram initialization to skip detection and set them to these values. Effectively both u-boot and linux report 512MB ram, so I suppose I have managed to disable the second DDR3 chip. Whether that is actually what happens - I cannot be entirely sure.

Testing memory - single chip vs two chips
1. Using dd with memory mapped disk at /mnt, writing to a test file, reading from /dev/zero
(command: sudo mount -t tmpfs /mnt /mnt)

** Single chip tests:
dd if=/dev/zero of=/mnt/test bs=16 count=100000      -> 1.4MB/s
dd if=/dev/zero of=/mnt/test bs=64 count=100000      -> 5.3MB/s
dd if=/dev/zero of=/mnt/test bs=256 count=100000   -> 18.4MB/s
dd if=/dev/zero of=/mnt/test bs=1024 count=100000   -> 53.6MB/s
dd if=/dev/zero of=/mnt/test bs=2048 count=100000   -> 79.9MB/s

** Two chips tests:
dd if=/dev/zero of=/mnt/test bs=16 count=100000      -> 1.3MB/s
dd if=/dev/zero of=/mnt/test bs=64 count=100000      -> 4.8MB/s
dd if=/dev/zero of=/mnt/test bs=256 count=100000   -> 17.4MB/s
dd if=/dev/zero of=/mnt/test bs=1024 count=100000   -> 48.8MB/s
dd if=/dev/zero of=/mnt/test bs=2048 count=100000   -> 73.1MB/s

2. Testing with mbw tool (installed with apt-get)

** Single chip tests:
mbw 16 | grep AVG
AVG   Method: MEMCPY   Elapsed: 0.05149   MiB: 16.00000   Copy: 310.751 MiB/s
AVG   Method: DUMB   Elapsed: 0.03077   MiB: 16.00000   Copy: 519.987 MiB/s
AVG   Method: MCBLOCK   Elapsed: 0.02800   MiB: 16.00000   Copy: 571.353 MiB/s

mbw 32 | grep AVG
AVG   Method: MEMCPY   Elapsed: 0.10302   MiB: 32.00000   Copy: 310.631 MiB/s
AVG   Method: DUMB   Elapsed: 0.06150   MiB: 32.00000   Copy: 520.344 MiB/s
AVG   Method: MCBLOCK   Elapsed: 0.05690   MiB: 32.00000   Copy: 562.425 MiB/s

mbw 128 | grep AVG
AVG   Method: MEMCPY   Elapsed: 0.41162   MiB: 128.00000   Copy: 310.966 MiB/s
AVG   Method: DUMB   Elapsed: 0.24556   MiB: 128.00000   Copy: 521.268 MiB/s
AVG   Method: MCBLOCK   Elapsed: 0.22529   MiB: 128.00000   Copy: 568.169 MiB/s


** Two chips tests:
mbw 16 | grep AVG
AVG   Method: MEMCPY   Elapsed: 0.05742   MiB: 16.00000   Copy: 278.655 MiB/s
AVG   Method: DUMB   Elapsed: 0.02418   MiB: 16.00000   Copy: 661.739 MiB/s
AVG   Method: MCBLOCK   Elapsed: 0.02767   MiB: 16.00000   Copy: 578.336 MiB/s

mbw 32 | grep AVG
AVG   Method: MEMCPY   Elapsed: 0.11365   MiB: 32.00000   Copy: 281.565 MiB/s
AVG   Method: DUMB   Elapsed: 0.04685   MiB: 32.00000   Copy: 682.983 MiB/s
AVG   Method: MCBLOCK   Elapsed: 0.05429   MiB: 32.00000   Copy: 589.418 MiB/s

mbw 128 | grep AVG
AVG   Method: MEMCPY   Elapsed: 0.45452   MiB: 128.00000   Copy: 281.618 MiB/s
AVG   Method: DUMB   Elapsed: 0.18673   MiB: 128.00000   Copy: 685.473 MiB/s
AVG   Method: MCBLOCK   Elapsed: 0.21711   MiB: 128.00000   Copy: 589.570 MiB/s

Just a note - these are the controller settings, according to the a10-meminfo tool
** Single chip
dram_clk          = 480
mbus_clk          = 300
dram_type         = 3
dram_rank_num     = 1
dram_chip_density = 4096
dram_io_width     = 16
dram_bus_width    = 16
dram_cas          = 9
dram_zq           = 0x7b (0x5294a00)
dram_odt_en       = 0
dram_tpr0         = 0x42d899b7
dram_tpr1         = 0xa090
dram_tpr2         = 0x22a00
dram_tpr3         = 0x0
dram_emr1         = 0x4
dram_emr2         = 0x10
dram_emr3         = 0x0
dqs_gating_delay  = 0x00000606
active_windowing  = 0
** Two chips
dram_clk          = 480                                                         
mbus_clk          = 300                                                         
dram_type         = 3                                                           
dram_rank_num     = 1                                                           
dram_chip_density = 4096                                                       
dram_io_width     = 16                                                         
dram_bus_width    = 32                                                         
dram_cas          = 9                                                           
dram_zq           = 0x7b (0x5294a00)                                           
dram_odt_en       = 0                                                           
dram_tpr0         = 0x42d899b7                                                 
dram_tpr1         = 0xa090                                                     
dram_tpr2         = 0x22a00                                                     
dram_tpr3         = 0x0                                                         
dram_emr1         = 0x4                                                         
dram_emr2         = 0x10                                                       
dram_emr3         = 0x0                                                         
dqs_gating_delay  = 0x06060606                                                 
active_windowing  = 0

So, if anyone bothers reading the numbers they are quite strange. It appears in some tests the board with single DDR3 chip actually performs 10% faster. My expectation would be that using 16bit bus vs 32 bus would half the performance. Instead I get this. One might notice that in the DUMB test of mbw there is significant difference between the results - about 20% improvement using two chips. That test actually loops over memory array and does a = b copying. Dumb indeed :) However that is still only 20% difference, not double...

There are few possible explanations (I am not very familiar with how these things work so I am guessing here):
- Incorrect test setup - this way I am not actually disabling the second chip. I am planning on physically removing it from the board to check how that goes.
- Incorrect test - there aren't many tools to do memory benchmarks on arm so this is what I've got. Could it be skewing the results?
- Bad board design - is it possible that with single chip the controller doesn't perform some synchronizations between the two banks resulting in less delays somewhere?
- Bad understanding of how these things work in the first place. For instance - this is DDR3 memory so it has double data rate. So with 16bit bus width it actually serves 32bits in single read. So basically unless I operate on large amounts of ram, reading continuously and etc that change from 32 to 16b width doesn't really affect anything.
- All of the above and something else :)

If anyone has a clue what is going on - please feel free to drop a suggestion what is going on here.

Cheers

aneox

Very interesting! Anybody test it with GPU loaded?