Basically the startup code needs to clear memory from _edata to _end. In the
past it's been done with a fairly naive copy loop. This changes the code to
just call memset and let memset figure out a sensible way to handle the
operation given the size and alignment requirements.
I don't have performance data on this. I cobbled it together some time ago in
response to seeing some of the GCC tests with larger .bss sections taking an
insane amount of time to just get from _start to main. With the fixes to the
H8 decoder in the simulator it may not matter nearly as much anymore.
This has been in my tester for months. Naturally it does not cause any
regressions in the H8 port.