STM32: Using External SRAM

This article is about the external SRAM of STM32. With FSMC, STM32 MCUs can access external
SRAM. Hopefully, this idea looks like saving us from the thirsty of RAM.


This article requires you have some basic knowledge about STM32 development. Including:

  1. C programming language
  2. What is SRAM
  3. The official document about STM32 FSMC
  4. Some basic tuning skills
  5. A STM32 dev board which supports FSMC and built in an external SRAM.
  6. A debugger, such as ST-Link or J-Link
  7. An IDE, I'm using this IDE: System Workbench for STM32


All right, here comes a new acronym PSRAM. When you search PSRAM on WikiPedia, this page jumps out and note us it is not a real SRAM but pseudo one. JEDEC has a good explain about PSRAM. Usually, SRAM is too expansive(comparing to PSRAM) and has less size(usually counting in KB). However, PSRAM is far cheaper then SRAM and providing sufficient size. Here is my favorite one. The brief about a popular PSRAM

Even the manual won't tell you it is a PSRAM. But we can conclude by its price and capacity. Fortunately, my dev board has a piece of PSRAM of the same type built in.
A real SRAM looks like this one. 32KB for 38RMB. Since we don't have unlimited budget, let's assume our built-in RAM which we thought is as fast as we need. We treat the built-in RAM as real SRAM. Then we can write some code to test the performance of SRAM and PSRAM.

Some tools make your life easier

I use STM32CubeMx a lot. With the GUI we can easily get the board and some basic code prepared, just by a few of clicks. Further more, if we have the following eclipse plugins installed, we will code more happily:

  1. TM Terminal
  2. RxTx

They are for display log string from serial.

Enough talk, let's code

The manual is boring, boring and boring, especially the part about timing. At very beginning I was frustrated by the time order and waveform figures. Soon, after some attempts, I found that our poor PSRAM, err, SRAM doesn't care about too much except the data
According to board manufacture's manual, the board has a 1MB built-in external SRAM which is IS62WV51216B.
This SRAM, unsurprisingly, is a PSRAM, which we can easily figure it out by
By the way, we use HAL everywhere, so please check HAL support in STM32CubeMx.

FSMC Configuration

Actually, even though our PSRAM support 18bits addressing, by my test, 1 bit addressing works very well. So we can use 1bit addressing at all. And also, 8bits/16bits data bus doesn't matter too. So we can configure our chip using 1bit addressing and 8bit data bus with any pressure. Also, save some IO pins which are perish resources. Here is the FSMC initialization code generated by STM32CubeMx.

// in main.c
static void MX_FSMC_Init(void)
  FSMC_NORSRAM_TimingTypeDef Timing;

  /** Perform the SRAM3 memory initialization sequence
  hsram3.Instance = FSMC_NORSRAM_DEVICE;
  /* hsram3.Init */
  hsram3.Init.NSBank = FSMC_NORSRAM_BANK3;    // my board has PSRAM connected on bank3.
  hsram3.Init.DataAddressMux = FSMC_DATA_ADDRESS_MUX_DISABLE;    // Not used, using HAL lock
  hsram3.Init.MemoryType = FSMC_MEMORY_TYPE_SRAM;
  hsram3.Init.MemoryDataWidth = FSMC_NORSRAM_MEM_BUS_WIDTH_8;    // using 8bit for data bus
  hsram3.Init.BurstAccessMode = FSMC_BURST_ACCESS_MODE_DISABLE;    // PSRAM won't care
  hsram3.Init.WaitSignalPolarity = FSMC_WAIT_SIGNAL_POLARITY_LOW;
  hsram3.Init.WrapMode = FSMC_WRAP_MODE_DISABLE;
  hsram3.Init.WaitSignalActive = FSMC_WAIT_TIMING_BEFORE_WS;
  hsram3.Init.WriteOperation = FSMC_WRITE_OPERATION_ENABLE; // Of course, we want to write the memory
  hsram3.Init.WaitSignal = FSMC_WAIT_SIGNAL_DISABLE;    // Let's FSMC manage this
  hsram3.Init.ExtendedMode = FSMC_EXTENDED_MODE_DISABLE;    // What is extended mode? keep default
  hsram3.Init.AsynchronousWait = FSMC_ASYNCHRONOUS_WAIT_DISABLE;    // FSMC won't care
  hsram3.Init.WriteBurst = FSMC_WRITE_BURST_DISABLE;    // Not supported write burst
  /* Timing */
  Timing.AddressSetupTime = 0;    // doesn't matter
  Timing.AddressHoldTime = 0;    // doesn't matter
  Timing.DataSetupTime = 3;        // NOTE: the less, the butter, I tried 2 but failed on 1. 2 
  Timing.BusTurnAroundDuration = 0;    // doen'st matter
  Timing.CLKDivision = 0;    // doesn't care
  Timing.DataLatency = 0;    // doesn't care
  Timing.AccessMode = FSMC_ACCESS_MODE_A;

IO Pin Configuration

Actually, CubeMx is a good nanny. She does great job. We don't have to care the pins FSMC using. But we can still take a look.

// in stm32f1xx_hal_msp.c
static void HAL_FSMC_MspInit(void){
  /* USER CODE BEGIN FSMC_MspInit 0 */

  /* USER CODE END FSMC_MspInit 0 */
  GPIO_InitTypeDef GPIO_InitStruct;
  if (FSMC_Initialized) {
  FSMC_Initialized = 1;
  /* Peripheral clock enable */
  /** FSMC GPIO Configuration  
  PF0   ------> FSMC_A0        // See, 1bit addressing
  PE7   ------> FSMC_D4
  PE8   ------> FSMC_D5
  PE9   ------> FSMC_D6
  PE10   ------> FSMC_D7
  PD14   ------> FSMC_D0
  PD15   ------> FSMC_D1
  PD0   ------> FSMC_D2
  PD1   ------> FSMC_D3
  PD4   ------> FSMC_NOE
  PD5   ------> FSMC_NWE
  PG10   ------> FSMC_NE3
  GPIO_InitStruct.Pin = GPIO_PIN_0;
  GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
  HAL_GPIO_Init(GPIOF, &GPIO_InitStruct);

  GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
  HAL_GPIO_Init(GPIOE, &GPIO_InitStruct);

  GPIO_InitStruct.Pin = GPIO_PIN_14|GPIO_PIN_15|GPIO_PIN_0|GPIO_PIN_1 
  GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
  HAL_GPIO_Init(GPIOD, &GPIO_InitStruct);

  GPIO_InitStruct.Pin = GPIO_PIN_10;
  GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
  HAL_GPIO_Init(GPIOG, &GPIO_InitStruct);

  /* USER CODE BEGIN FSMC_MspInit 1 */

  /* USER CODE END FSMC_MspInit 1 */

With 12 pins, we can fully control our PSRAM.

Memory Access Code

The CubeMx is so sweet that she even prepared HAL edition SRAM read/write code for us. If we are careful enough, we can find the DMA routines. However, we won't discuss DMA or IT here. We will test our SRAM by dry run. The following list is the typical memory R/W routines.

// Write sram byte by byte, the normally way
void sram_write(unsigned char* pbuf, unsigned long addr, size_t size) {
    while(size--) {
        *(__IO unsigned char *)(FSMC_BANK1_3 + addr) = *pbuf;
// Read byte by byte
void sram_read(unsigned char* pbuf, unsigned long addr, size_t size) {
    while(size--) {
        *pbuf = *(__IO unsigned char *)(FSMC_BANK1_3 + addr);
// Faster, word by word
// NOTE: data length won't be concerned.
void sram_write_word(unsigned short* pbuf, unsigned long addr, size_t size) {
    while(size--) {
        *(__IO unsigned short *)(FSMC_BANK1_3 + addr) = *pbuf;
// Read word by word
void sram_read_word(unsigned short* pbuf, unsigned long addr, size_t size) {
    while(size--) {
        *pbuf = *(__IO unsigned short*)(FSMC_BANK1_3 + addr);
// One step further, try double word
void sram_write_dword(unsigned int* pbuf, unsigned long addr, size_t size) {
    while(size--) {
        *(__IO unsigned int *)(FSMC_BANK1_3 + addr) = *pbuf;

void sram_read_dword(unsigned int* pbuf, unsigned long addr, size_t size) {
    while(size--) {
        *pbuf = *(__IO unsigned int*)(FSMC_BANK1_3 + addr);

// NOTE: the following code uses two tricks:
// 1. Loop weakening
// 2. Code extending
// Fast write 8 bytes
void sram_fast_write8(unsigned char* pbuf, unsigned int addr, size_t size) {
    const int align = 2 * sizeof(unsigned int);

    if (size <= align) {
        sram_write(pbuf, addr, size);
        return ;

    size_t remains = size & 7;
    size_t count = (size - remains) / sizeof(unsigned int);
    unsigned int* psrc= (unsigned int *)pbuf;
    __IO unsigned int* pdst = FSMC_BANK1_3 + addr;

        // Write 8 ints each time
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;

    if (remains) {
        sram_write(pdst, psrc, remains);

// Fast write 16 bytes
void sram_fast_write16(unsigned char* pbuf, unsigned int addr, size_t size) {
    const int align = 4 * sizeof(unsigned int);

    if (size <= align) {
        sram_write(pbuf, addr, size);
        return ;

    size_t remains = size & 15;
    size_t count = (size - remains) / sizeof(unsigned int);
    unsigned int* psrc= (unsigned int *)pbuf;
    __IO unsigned int* pdst = FSMC_BANK1_3 + addr;

        // Write 8 ints each time
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;

    if (remains) {
        sram_write(pdst, psrc, remains);

// Fast write 32 bytes
void sram_fast_write32(unsigned char* pbuf, unsigned int addr, size_t size) {
    const int align = 8 * sizeof(unsigned int);

    if (size <= align) {
        sram_write(pbuf, addr, size);
        return ;

    size_t remains = size & 31;
    size_t count = (size - remains) / sizeof(unsigned int);
    unsigned int* psrc= (unsigned int *)pbuf;
    __IO unsigned int* pdst = FSMC_BANK1_3 + addr;

        // Write 8 ints each time
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;

        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;
        *pdst++ = *psrc++; count--;

    if (remains) {
        sram_write(pdst, psrc, remains);

Testing Code

Here we now have the testing code, in main loop:

  const unsigned int mem_size = 1 * 1024 * 1024;    // We have 1MB memory
  const unsigned int buf_size = 4 * 1024;            // read/write buffer size: 4KB, the unit we do W/R test
  const unsigned int test_loop = 16;                // Run 16 times for each kind test

  unsigned char pbuf[buf_size];    // The read buffer
  unsigned char pres[buf_size];    // The write buffer

The main loop looks like this:

  while (1)


      LOG("-------------------- Begin test -------------------------\r\n");

          memset(pbuf, 0xAB, sizeof(pbuf));
          sram_write(pbuf, 0, sizeof(pbuf));
          memset(pres, 0, sizeof(pres));
          sram_read(pres, 0, sizeof(pres));

          if (0 != memcmp(pbuf, pres, sizeof(buf_size))) {
              continue ;
          } else {
          LOG("Built in:");
          unsigned int ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    memcpy(pbuf, pres, buf_size);

            ticks = HAL_GetTick() - ticks;
            LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));

            unsigned int ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_write(pbuf, addr, sizeof(pbuf));

            ticks = HAL_GetTick() - ticks;
            LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));

            ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_read(pbuf, addr, sizeof(pbuf));

            ticks = HAL_GetTick() - ticks;
            LOG("\tRt: %lu\tRs: %lu KB/t\r\n", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
//            HAL_Delay(1000);

            unsigned int ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_write_word(pbuf, addr, sizeof(pbuf) / sizeof(unsigned short));

            ticks = HAL_GetTick() - ticks;
            LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));

            ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_read_word(pbuf, addr, sizeof(pbuf) / sizeof(unsigned short));

            ticks = HAL_GetTick() - ticks;
            LOG("\tRt: %lu\tRs: %lu KB/t\r\n", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
//            HAL_Delay(1000);

            unsigned int ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_write_dword(pbuf, addr, sizeof(pbuf) / sizeof(unsigned int));

            ticks = HAL_GetTick() - ticks;
            LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));

            ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_read_dword(pbuf, addr, sizeof(pbuf) / sizeof(unsigned int));

            ticks = HAL_GetTick() - ticks;
            LOG("\tRt: %lu\tRs: %lu KB/t\r\n", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
//            HAL_Delay(1000);
            unsigned int ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_fast_write8(pbuf, addr, sizeof(pbuf));
            ticks = HAL_GetTick() - ticks;

            LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));


            unsigned int ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_fast_write16(pbuf, addr, sizeof(pbuf));
            ticks = HAL_GetTick() - ticks;

            LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));


            unsigned int ticks = HAL_GetTick();

            for(unsigned int n = 0; n < test_loop; n++) {
                for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
                    sram_fast_write32(pbuf, addr, sizeof(pbuf));
            ticks = HAL_GetTick() - ticks;

            LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));


        HAL_Delay(1000); // cooling down
  } // while
  /* USER CODE END 3 */

Final Result

By run the test in debug mode, I got this result:
With Timing.DataSetupTime = 3;

-------------------- Begin test -------------------------
Built in:       Wt: 234 Ws: 4 KB/t
Byte:   Wt: 204 Ws: 4 KB/t      Rt: 409 Rs: 2 KB/t
Word:   Wt: 175 Ws: 5 KB/t      Rt: 274 Rs: 3 KB/t
Dword:  Wt: 145 Ws: 7 KB/t      Rt: 202 Rs: 5 KB/t
Fast(8B):       Wt: 95  Ws: 10 KB/t
Fast(16B):      Wt: 95  Ws: 10 KB/t
Fast(32B):      Wt: 95  Ws: 10 KB/t

Surprisingly, I found the read operation is almost 50-70% of writing, much slower. Another interesting thing is the built-in RAM which is located in the chip I think, gains speed as same as byte by byte method. And so on, the fast method is really fast, but has it limitation: 8 bytes per loop, won't work harder any more.

With Timing.DataSetupTime = 2;

-------------------- Begin test -------------------------
Built in:       Wt: 234 Ws: 4 KB/t
Byte:   Wt: 204 Ws: 4 KB/t      Rt: 395 Rs: 2 KB/t
Word:   Wt: 164 Ws: 6 KB/t      Rt: 259 Rs: 3 KB/t
Dword:  Wt: 134 Ws: 7 KB/t      Rt: 187 Rs: 5 KB/t
Fast(8B):       Wt: 80  Ws: 12 KB/t
Fast(16B):      Wt: 80  Ws: 12 KB/t
Fast(32B):      Wt: 80  Ws: 12 KB/t

Noticed that with later configuration, the fast write has 2KB/ticks improvement.


The external "P"SRAM is fast enough. Mostly we can use it as another memory resource. Some applications such as colorful LCD manipulation can use the external PSRAM as double buffer to avoid lagging. Further more, probably we can run program from external PSRAM and have more fun.

Good luck!



The above part said the timing is not import, but it's not true. At that time, the author(or me) didn't read the documents carefully.

Here is how to get the timing value:

At first, for a 72MHz AHP clock, one hclk takes 1/72MHz ~= 13.88ns. For the 36MHz AHP clock, the hclk time is : 27.77. Later, t_hclk is used for time of one HCLK.

Address Setup Time

For IS64C256AL, the Address Setup Time to Write End is 25ns. It is 2t_hclk for 1/72MHz, or 1t_hclk for 1/36MHz, and so on.

Data Setup Time

According to this document. In the STM32F10xCDE FSMC asynchronous timings table, the Data to FSMC_NEx hight setup time + FSMC_NEx low to FSMC_A valid value is 2t_hclk + 25 ns.

so that the 2t_hclk is 2 * 13.88ns. Then one can get:

2t_hclk + 25ns/13.88ns ~= 4t_hclk

For a 36MHz AHP clock, one hclk takes 1/36MHz ~= 27.77ns, than the timing is:

2t_hclk + 25ns/27.77 ~= 3t_hclk

Bus Turn Around Time

According to Table 110. FSMC_BTRx bit fields, BUSTURN Time between NEx hight to NEx low (BUSTURN HCLK). And the BUSTURN value usually is:


Then according to ISSI document, the Taw, Address setup time to write end is 25ns minimum, which is about 2t_hclk for 72MHz or 1t_hclk for 36MHz.

So the timing configuration under 72MHz should look like this:

  • Address setup time in HCLK lcok cycles: 2
  • Data setup time in HCLK clock cycles: 4
  • Bus turn around time in HCLK clock cycles: 5

Others Parameters

In the HAL_FSMC_Init() function, the extra parameters AddressHoldTime, CLKDivision and DataLatency are not for asynchronous SDRAM.

Performance Test

Run in release compiling, Read/Write 64K bytes, and record the ticks each operation used, in unit of tick.

  • Round 0:

    Using the default setting.

      AddressSetupTime: 15
    DataSetupTime: 255
    BusTurnAroundDuration: 15

    The result is:

    Read: 253 ticks
    Write: 264 ticks
  • Round 1:

    Using the setting from this blog, assuming the clock is 72MHz.

      AddressSetupTime: 2
    DataSetupTime: 4
    BusTurnAroundDuration: 5

    The result is:

    Read: 13 ticks
    Write: 23 ticks
  • Round 2:
    Using the setting from this blog, assuming the clock is 36MHz, which only affects the address setup time:

      AddressSetupTime: 1

    The result is:

    Read: 12 ticks
    Write: 22 ticks
  • Round 3:
    Just change AddressSetupTime to 0:

    AddressSetupTime: 0

    The result is:

    Read: 11 ticks
    Write: 21 ticks

    A imporvement of 1 tick. Should one say "Amazing"?

  • Round X:

    Let's push the device to it's limit. As far as the test going, the following combinations work. List them here FYI:

    Address Setup Time, Data Setup Time, Bus Turn Round Time => Writing Perf, Reading Perf

    0, 2, 4 => 9w,20r
    0, 1, 4 => 8w, 19r
    0, 1, 3 => 8w, 19r
    0, 1, 2 => 8w, 19r
    0, 1, 1 => 8w, 19r

  • Conclusion

    Seting up the FSMC timing is a important thing, if you have the need of speed. Also, don't forget set the AHB clock as high as possible. For example, with the last setting in Round X, when runing under 36MHz, the result is 13 ticks for writing and 24 ticks for reading. If using percentage, the lost of performance would look like 72MHz is 1.6 times faster than 36MHz in writing and 1.26 in reading.

By the way, the speed should be:

  • Writing Speed:

    (64 * 1024 * 1024) / 8 * 1000 / (1024 * 1024) = 7.8125MB/s
  • Read Speed:

    (64 * 1024 * 1024) / 19 * 1000 / (1024 * 1024) = 3.28MB/s

Further More

Can Access Setup Time be 0?

In ISSI document, the Tsa(Address Setup Time) is 0.

Can Data Setup Time be 0?

In ISSI document, the Tsd(Data Setup Time) is 20ns minimum, which is at least 1h_clk.

Can Bus Turn Around Duration be 0?

No, at least be 1.

