CUBE HAL is 122% larger than Register Access Code!!!

THIS IS A VERY HIGHLY OPINIONATED ARTICLE, ASIDE FROM THE GRAPH AND NUMBERS WHICH SPEAK FOR THEMSELVES, THE REST OF THE CONTENT IS MY PERSONAL TASTE AND BELIEFS. FEEL FREE TO LEAVE A COMMENT AND SHARE YOUR OPINION.

AS MENTIONED BY ONE OF MY INSTAGRAM FOLLOWERS (@EMBEDADDY), SOME OF THESE NUMBERS SPECIFICALLY FOR CUBE HAL CAN BE IMPROVED WITH OPTIMIZED IDE AND COMPILER SETTINGS, SO KEEP THAT IN MIND!     

In this highly biased article I will examine a very simple program written using three different programming approaches STM32 Cube HAL,  STM32 LL API, and register access code. The program itself consists of configuring three peripherals : GPIO SPI UART and well inadvertently the RCC must always be configured. The peripherals will be configured with default or common settings where warranted for the sake of simplicity, such as 9600 baud rate for the UART among other things and similar default settings for SPI clock phases and such. The main loop will simply toggle the led and send “hello!” through UART and send a byte through SPI. I will examine the code size of the final program with the intention of selling you on the idea to start using LL , there I said it , sue me. But anyways get in losers we’re going coding!!!

The main.c file consist of all 3 program conditionally compiled with #ifdef statements. So basically something like this:

//here i only un-comment one to compile the code I want
#define USE_HAL
// #define USE_LL
// #define USE_REG
#ifdef USE_HAL
...all HAL code here gets compiled
#endif
#ifdef USE_LL
...all LL code here gets compiled
#endif
#ifdef USE_REG
...all REG code here gets compiled
#endif

The code is posted at the very bottom in its entirety. The graph below shows the difference in overall code size of the entire application. You can click on it to enlarge. 


Lets crunch some numbers given the graph above.
  • HAL code is 122% larger than direct register access code..
  • LL api code is 44%  larger than direct register access code.
  • HAL code is  53% larger than LL api code size
The jump from register access code to HAL is significant and even the jump from LL to HAL is larger than that of register to LL. The images below help understand where the code increase is coming from.

Register vs Cube HAL : 122% increase

The image below is a capture of the Embedded Memory Explorer feature of visualGDB in Visual Studio IDE. The previous compilation before this image was the register version of the application, followed by a compilation of the Cube HAL version. The red text is the size increase from one compilation to the next. In other words the size increase when going form the previous register access code to Cube HAL code. As you can see the largest chunk of code size increase is coming from the CUBES implementation of the RCC driver. 

If you want to save yourself 2kb just access the RCC yourself to enable peripherals, its literally 1 line of code to do so. And if you want to learn how to change the core clock via the registers I have tutorials for that too. Click here

I was wondering why there would be a jump in size for the core_cm3.h file, but this is a result of HAL calling functions in that file to access the NVIC and setup the SystickTimer which it uses for its delay function so the memory viewer credits the increase of memory usage to that file since it contains the NVIC and Systick functions. 

Also to be noted is the amount of SRAM increase going from register code to CUBE HAL, you will notice on the next image that with LL API there is no SRAM increase from that of register access code.




Register vs LL API : 44% Increase

And here we have the memory usage increase when going from register access to LL API. In this case the largest culprit seems to be the GPIO driver which makes sense because its a structure that gets used extensively in setting up all the pins for the peripherals, that I understand, but HAL's massive RCC module  I do not understand . Also note there is no increase in SRAM usage from that of register access code.




At the time of this writing there are few tutorials on how to use STM’s Low Level drivers ( LL ) So I will be coming out with a series on that shortly..which really means eventually. One thing I am noticing is people are confusing the term “Low Level” with register level code. Yes register level code is low level but “Low Level” is the actual name of a HAL offered by STM just as Cube is the name of another HAL offered but STM. In-fact Cube HAL relies heavily on the LL drivers. After all the asserts and error checking inside a Cube function you will find often times it ultimately calls an LL function. 

By using the LL drivers on their own you remove an entire layer of abstraction and bloat. (if you appreciate the error checking / full asserts and ease of use?? Then its not bloat)

So why would I, being a strong advocate for register level code, now opt for an abstraction layer? Well I am not opposed to hardware abstraction, most of my tutorials that show register level code is for learning purposes so that you can perhaps design your own abstraction layer with as much bloat as you want or do not want.

Simple peripherals like GPIO, I2C, USART, SPI, DMA, TIMERS, CRC do not necessarily need an abstraction layer because they are simple enough to configure directly at the register level. For learning purposes these peripherals are a great way to get intimate with the hardware.

Once you start using things like RTOS, Ethernet and USB and even CAN you will cry your eyes out trying to manage the hundreds of settings that need to be configured properly to get an error free implementation. At this point an abstraction and/or well written library is pretty much a must and for those applications LL is not enough and CUBE HAL is alright if and when you can get it working.

The clear benefit I see with LL API versus Cube HAL is that LL still requires you to know what your are doing, and that is why this article is biased, because knowing what you are doing is something that is important to ME and should be important to any real engineer or student. Cube HAL on the other hand hides too much and by doing so ends up bloating the code a lot more and thus you have bigger than necessary code size and lower efficiency and at the end of the day you have no clue how it works. 

Saving a few kb of code space is not such a huge deal. The tragedy is when you call yourself a programmer or engineer and have no idea what the code is doing and do not know why HAL generated code does not work, debugging is nightmare because you do not know how the hardware is supposed to be configured in the first place so how can you spot a bug?

Should your boss or company ask you to not use Cube HAL and write very specific and efficient code because they have opted for the cheapest MCU with smallest memory space to keep cost down, you wil not know where to start because you have no idea how to get anything going without calling a pre-written init function that some other smart fella/gal wrote.

One thing I like about using LL is that I do not have to dig into the reference manual looking for bit locations. After all the programming I do it is impossible to remember the location of bits or exact register names. LL helps with this since you can just initialize a structure and have it configure the desired peripheral (this is also my preferred method for personal driver development) In conjunction with code completion features as well as naming conventions used in LL API it is easy to find the desired configuration settings. 

It is my opinion that the little bit of extra code size added by LL is worth not digging through the reference manual, at least LL does not promote ignorance of the hardware. However Cube HAL does have its uses and targeted users as well. 



#define USE_HAL
//#define USE_LL
//#define USE_REG
#ifdef USE_HAL
#include <stm32f1xx_hal.h>
#include <stm32_hal_legacy.h>
#include "CL_CONFIG.h"
#include "CL_printMsg.h"
#define LED0_Pin GPIO_PIN_13
#define LED0_GPIO_Port GPIOC
SPI_HandleTypeDef hspi1;
UART_HandleTypeDef huart1;
void SystemClock_Config(void);
static void MX_GPIO_Init(void);
static void MX_SPI1_Init(void);
static void MX_USART1_UART_Init(void);
void Error_Handler(void);
void SysTick_Handler(void)
{
HAL_IncTick();
HAL_SYSTICK_IRQHandler();
}
int main(void)
{
HAL_Init();

SystemClock_Config();
MX_GPIO_Init();
MX_SPI1_Init();
MX_USART1_UART_Init();

__GPIOC_CLK_ENABLE();
GPIO_InitTypeDef GPIO_InitStructure;
GPIO_InitStructure.Pin = GPIO_PIN_13;
GPIO_InitStructure.Mode = GPIO_MODE_OUTPUT_PP;
GPIO_InitStructure.Speed = GPIO_SPEED_FREQ_HIGH;
GPIO_InitStructure.Pull = GPIO_NOPULL;
HAL_GPIO_Init(GPIOC, &GPIO_InitStructure);
for (;;)
{

HAL_GPIO_TogglePin(GPIOC, GPIO_PIN_13);
HAL_SPI_Transmit(&hspi1, 0x65, 1, HAL_MAX_DELAY);
CL_printMsg("hello!");
HAL_Delay(500);


}
}
void SystemClock_Config(void)
{
RCC_OscInitTypeDef RCC_OscInitStruct = { 0 };
RCC_ClkInitTypeDef RCC_ClkInitStruct = { 0 };

RCC_OscInitStruct.OscillatorType = RCC_OSCILLATORTYPE_HSE;
RCC_OscInitStruct.HSEState = RCC_HSE_ON;
RCC_OscInitStruct.HSEPredivValue = RCC_HSE_PREDIV_DIV1;
RCC_OscInitStruct.HSIState = RCC_HSI_ON;
RCC_OscInitStruct.PLL.PLLState = RCC_PLL_ON;
RCC_OscInitStruct.PLL.PLLSource = RCC_PLLSOURCE_HSE;
RCC_OscInitStruct.PLL.PLLMUL = RCC_PLL_MUL9;
if (HAL_RCC_OscConfig(&RCC_OscInitStruct) != HAL_OK)
{
Error_Handler();
}
RCC_ClkInitStruct.ClockType = RCC_CLOCKTYPE_HCLK | RCC_CLOCKTYPE_SYSCLK
                            | RCC_CLOCKTYPE_PCLK1 | RCC_CLOCKTYPE_PCLK2;
RCC_ClkInitStruct.SYSCLKSource = RCC_SYSCLKSOURCE_PLLCLK;
RCC_ClkInitStruct.AHBCLKDivider = RCC_SYSCLK_DIV1;
RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV2;
RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV1;
if (HAL_RCC_ClockConfig(&RCC_ClkInitStruct, FLASH_LATENCY_2) != HAL_OK)
{
Error_Handler();
}
}

static void MX_SPI1_Init(void)
{

hspi1.Instance = SPI1;
hspi1.Init.Mode = SPI_MODE_MASTER;
hspi1.Init.Direction = SPI_DIRECTION_2LINES;
hspi1.Init.DataSize = SPI_DATASIZE_8BIT;
hspi1.Init.CLKPolarity = SPI_POLARITY_LOW;
hspi1.Init.CLKPhase = SPI_PHASE_1EDGE;
hspi1.Init.NSS = SPI_NSS_SOFT;
hspi1.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_8;
hspi1.Init.FirstBit = SPI_FIRSTBIT_MSB;
hspi1.Init.TIMode = SPI_TIMODE_DISABLE;
hspi1.Init.CRCCalculation = SPI_CRCCALCULATION_DISABLE;
hspi1.Init.CRCPolynomial = 10;
if (HAL_SPI_Init(&hspi1) != HAL_OK)
{
Error_Handler();
}

}

static void MX_USART1_UART_Init(void)
{

huart1.Instance = USART1;
huart1.Init.BaudRate = 115200;
huart1.Init.WordLength = UART_WORDLENGTH_8B;
huart1.Init.StopBits = UART_STOPBITS_1;
huart1.Init.Parity = UART_PARITY_NONE;
huart1.Init.Mode = UART_MODE_TX_RX;
huart1.Init.HwFlowCtl = UART_HWCONTROL_NONE;
huart1.Init.OverSampling = UART_OVERSAMPLING_16;
if (HAL_UART_Init(&huart1) != HAL_OK)
{
Error_Handler();
}

}

static void MX_GPIO_Init(void)
{
GPIO_InitTypeDef GPIO_InitStruct = { 0 };

__HAL_RCC_GPIOC_CLK_ENABLE();
__HAL_RCC_GPIOD_CLK_ENABLE();
__HAL_RCC_GPIOA_CLK_ENABLE();

HAL_GPIO_WritePin(LED0_GPIO_Port, LED0_Pin, GPIO_PIN_RESET);

GPIO_InitStruct.Pin = LED0_Pin;
GPIO_InitStruct.Mode = GPIO_MODE_OUTPUT_PP;
GPIO_InitStruct.Pull = GPIO_NOPULL;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_LOW;
HAL_GPIO_Init(LED0_GPIO_Port, &GPIO_InitStruct);
}
void Error_Handler(void)
{

}
#endif
 
#ifdef USE_LL
#include <stm32f1xx_ll_bus.h>
#include <stm32f1xx_ll_gpio.h>
#include <stm32f1xx_ll_utils.h>
#include <stm32f1xx_ll_spi.h>
#include <stm32f1xx_ll_usart.h>
#include <stm32f1xx_ll_exti.h>
#include <stm32f1xx_ll_system.h>
//------------| COMM LIBS |----------
#include "CL_CONFIG.h"
#include "CL_delay.h"
#include "CL_systemClockUpdate.h"
#include "CL_printMsg.h"






#define NRF_PINS_CLOCK_ENABLE() (RCC->APB2ENR |= RCC_APB2ENR_IOPAEN )  //given the pins are on GPIOA
uint8_t NRFSTATUS = 0x00;
uint8_t tx_data_buff[32];
uint8_t rx_data_buff[32];
uint8_t multibyte_buff[10] = { 0 };
uint8_t flag = 0x55;

void init_pins(void);
void init_spi1(void);

void spiSend(uint8_t  data);
void blinkLed(void);
void printRegister(uint8_t reg);
int main(void)
{

setSysClockTo72();
CL_delay_init();
init_pins();
init_spi1();

for(;  ;)
{
LL_GPIO_TogglePin(GPIOC, LL_GPIO_PIN_13);
LL_SPI_TransmitData8(SPI1, 0x65);
CL_printMsg("hello!");
delayMS(500);
}

}
void init_pins(void)
{
//clock enable GPIOA
LL_APB2_GRP1_EnableClock(LL_APB2_GRP1_PERIPH_GPIOA);

LL_GPIO_InitTypeDef pins;
LL_GPIO_StructInit(&pins);

// CE pin as output : active high/ normally low
// CSN chip select for spi active low / normally high
pins.Pin = LL_GPIO_PIN_3 | LL_GPIO_PIN_2 ;
pins.Mode = LL_GPIO_MODE_OUTPUT;
pins.OutputType = LL_GPIO_OUTPUT_PUSHPULL;
pins.Speed = LL_GPIO_SPEED_FREQ_HIGH;
LL_GPIO_Init(GPIOA, &pins);  // CE & CSN on same port only need this once

//LED pin
pins.Pin   = LL_GPIO_PIN_13;
LL_GPIO_Init(GPIOC, &pins);

// IRQ pin as input with interrupt enabled
pins.Pin = LL_GPIO_PIN_4;
pins.Mode = LL_GPIO_MODE_INPUT;
pins.Pull = LL_GPIO_PULL_UP;
LL_GPIO_Init(GPIOA, &pins);

LL_EXTI_InitTypeDef myEXTI = { 0 };
LL_EXTI_StructInit(&myEXTI);
myEXTI.Line_0_31 = LL_EXTI_LINE_4;
myEXTI.LineCommand = ENABLE;
myEXTI.Mode = LL_EXTI_MODE_IT;
myEXTI.Trigger = LL_EXTI_TRIGGER_FALLING;
LL_EXTI_Init(&myEXTI);

LL_GPIO_SetOutputPin(GPIOA, LL_GPIO_PIN_3);
LL_GPIO_ResetOutputPin(GPIOA, LL_GPIO_PIN_2);
NVIC_EnableIRQ(EXTI4_IRQn); //enable IRQ on Pin 4
}
void init_spi1(void)
{
// CLOCK  [ Alt Function ] [ GPIOA ] [ SPI1 ]

LL_APB2_GRP1_EnableClock(LL_APB2_GRP1_PERIPH_SPI1);
// GPIO [ PA5:SCK:output:push ] [ PA6:MISO:input:float/pullup ] [ PA7:MOSI:output:push ]
LL_GPIO_InitTypeDef spiGPIO;
LL_GPIO_StructInit(&spiGPIO);

spiGPIO.Pin = LL_GPIO_PIN_5 | LL_GPIO_PIN_7;
spiGPIO.Mode = LL_GPIO_MODE_ALTERNATE;
spiGPIO.OutputType = LL_GPIO_OUTPUT_PUSHPULL;
spiGPIO.Speed = LL_GPIO_SPEED_FREQ_MEDIUM;

LL_GPIO_Init(GPIOA, &spiGPIO);

spiGPIO.Pin = LL_GPIO_PIN_6;
spiGPIO.Mode = LL_GPIO_MODE_FLOATING;
spiGPIO.Pull = LL_GPIO_PULL_UP;
LL_GPIO_Init(GPIOA, &spiGPIO);

// SPI
LL_SPI_InitTypeDef mySPI;
LL_SPI_StructInit(&mySPI);

mySPI.Mode = LL_SPI_MODE_MASTER;
mySPI.NSS = LL_SPI_NSS_SOFT;
mySPI.BaudRate = LL_SPI_BAUDRATEPRESCALER_DIV32;
LL_SPI_Init(SPI1, &mySPI);

LL_SPI_Enable(SPI1);

}
void init_UART(void)
{
LL_USART_InitTypeDef uart;
LL_USART_StructInit(&uart);
LL_USART_Enable(USART1);

 
}


#endif
#ifdef USE_REG
#include "stm32f1xx.h"
//------------| COMM LIBS |----------
#include "CL_CONFIG.h"
#include "CL_systemClockUpdate.h"
#include "CL_printMsg.h"
void initSPI(void);
void initDebugLed(void);
uint8_t spiSend(uint8_t data);
int main(void)
{
setSysClockTo72();
CL_delay_init();
CL_printMsg_init_Default(false);
initDebugLed();
initSPI();

for (;;)
{

GPIOC->ODR ^= GPIO_ODR_ODR13;
spiSend(0x65);
CL_printMsg("hello");
delayMS(500);


}

}
void initSPI(void)
{
//clocks and pin configs related to spi
RCC->APB2ENR |= RCC_APB2ENR_IOPAEN  | RCC_APB2ENR_SPI1EN | RCC_APB2ENR_AFIOEN;
GPIOA->CRL &= ~(GPIO_CRL_MODE5 | GPIO_CRL_MODE7 | GPIO_CRL_CNF5 | GPIO_CRL_CNF7 | GPIO_CRL_CNF6 | GPIO_CRL_MODE6);
GPIOA->CRL |= (GPIO_CRL_CNF5_1 | GPIO_CRL_MODE5 | GPIO_CRL_CNF7_1 | GPIO_CRL_MODE7 |  GPIO_CRL_CNF6_0);

//setup spi
SPI1->CR1 |= (SPI_CR1_SSM | SPI_CR1_BR_2 | SPI_CR1_MSTR);
SPI1->CR2 |= (SPI_CR2_SSOE);
SPI1->CR1 |= SPI_CR1_SPE;


}
void initDebugLed(void)
{
RCC->APB2ENR |= RCC_APB2ENR_IOPCEN;
GPIOC->CRH &= ~(GPIO_CRH_CNF13 | GPIO_CRH_MODE13);
GPIOC->CRH |= (GPIO_CRH_MODE13);


}
uint8_t spiSend(uint8_t data)
{
uint8_t volatile i;
while ((SPI1->SR & SPI_SR_TXE) != SPI_SR_TXE) 
{
i++;
}
SPI1->DR  = data;
return SPI1->DR ;

}
#endif


Comments

  1. Nice article; it explains some some performance issues I was observing in piece of code. Thank you.

    ReplyDelete
  2. Nice work, Edwin! Keep going. Make more articles and videos with CMSIS (no HAL) approach as this is the only way to understand how an STM32 works. Thank you!

    ReplyDelete
    Replies
    1. thanks for the comment and yes I do have more articles coming soon, I just have been a bit busy in my personal life , but i will add more content here. Glad to see someone out there is reading

      Delete
  3. awesome as always, can you do more articles? can bus should be very nice

    ReplyDelete
Share your comments with me

Archive

Contact Form

Send