I can't imagine doing anything embedded / real-time with C++? I don't constantly count every little cycle with the D-Lev SW running on Hive, but there is a strict budget when 8 threads have to equally share 180MHz worth of cycles interrupted at 48kHz
You don't need to use a hydraulic drill where poking a hole with a needle will do, even if you bought that new big toolbox.
C++ compiler generated executables, from tests I've seen, are not inherently slower than C ones. (they can be faster where, in equivalent C code, you try to replicate concepts that have language features in C++ - because C++ "gets" what you want to do and has optimizations for that, the C compiler doesn't.)
What's slow is when you use parts of the standard library that are prone to using dynamic memory allocation.
Exception mechanism - same thing. (don't use it)
When you do something that needs to be as fast as possible, virtual inheritance might also not be good - but this tends to be overdramatized.
When big objects are passed by value, it used to be the case that this was slow(er than necessary) because of copying. Since the notion of "move semantics" copies can be avoided more easily. (for your own stuff, it takes some extra effort, std library stuff tends to have it built in - if you don't want dynamic allocation at all, these things are probably off the table anyw, though - although I have not looked so deeply into it)
You don't even need to use anything OO related to reap some benefits of current C++. Can still stay procedural, but use function templates (incl. the readily available ones of the library, like subsets of those in algorithm, limits, type_traits, etc.).
Wth recent C++ versions, I decidedly switched to C++ vs. C for embedded bare metal projects exactly because it allows to code better for two to three: The product, maintainer(s) and author of the code alike. (in the latter case, reducing the tedium)
The aforementioned subset of standard library headers that offer template stuff that you can use to work at compile-time, in conjunction with the growing ever mightier "constexpr" keyword, allows you to make all sorts of computations at compile time and make their results part of the image that lands in program memory, in the project code itself, so everyone gets what's going on. Not some magic hard-coded tables computed elsewhere.
This is far less painful and bug inviting than macros. (and try to use a macro that does something interesting within another macro... whaa whaa whaaaaa... You'd need multipass - some people rely on external tools for that. How nice if the language you are supposedly using actually allows you do express it.)
You can (more safely) use things that would be considered "dirty" - put code next to it that makes checks as complex or exhaustive as you need them, to test assumptions made in the code. Like with regards to two places in the code relying on the same order of something, implicitly mapping something by order, to not take extra memory for explicitly doing so. Which is prone to create bugs when any part of it may get altered later. Or if you assemble some data structure that is to lie in program memory without any run-time or boot time overhead - and you want to check its consistency in some way, because somewhat repetitive lines of assignments let errors be overlooked easily sometimes - just code those checks in a constexpr function and put a call to that function int a static_assert. If it fails, you get a compile-time error. None of this needs to be executed on the target. It will not even be code on the target and waste space if nobody calls it at run-time.
This increases compile times. Preferable compared to time wasted debugging bugs created due to some foul-ish optimization.
And with multithread compilation on an 8-core machine with 16 hyperthreads or so, code lying on an SSD - with a project size of the typical MCU project, who notices that anyway. Especially if you use forward declaration and references in headers where possible, not triggering a recompile chain reaction every time anything changes anywhere.
I was mildly amused when I recently tried how far this constexpr goes by now, and something like this compiled and worked as... let's say suspected:
constexpr unsigned NBITS = Log2( NextPOT(someEnum_count or tableSize or what was it) );
struct alignas(uint8_t) SomeStruct
{
uint8_t A: NBITS, B: NBITS, C: NBITS;
};
I.e. a bitfield where the bit widths are not literals, but something you computed at compile-time.
(not useful when you (ab)use a bitfield for hardware register mapping, but sometimes you just want to store stuff as compact as possible conveniently auto-trackable by evolving code.)
Something that's dependant on other parts of the code with its own constants - and you really want once place that is the "master" of it all, the place where you turn screws to adjust something, while all dependant parts follow automatically.
One can do such things with macros, but far less extensive (multipass!) and with no type safety.
Log2 and NextPOT are constexpr function templates in my library. GCC has __builtin_log2 variants which are all float of a size or another - not good when you happen to call such a function at run-time for a change.
Oh. And using the algorithms header, and things like std::min, ...max instead of stupid surprize macros for the same purpose.
And all the code duplication and maintenance worries due to C stuff that's not macros working with one type (or if nothing gets stored compactly, using a biggest-type-of-family function to call with smaller compatible args... that's not always good)
It's actually pretty amazing the 48kHz DSP you can accomplish with a 33x33=65 bit multiplier ALU running at 180 MIPS. Assembly really helps in squeezing it all out IMO (and I'm a control freak, so there's that).
I did not have to use asm on the ARM Cortex Ms, but if it seemed to make sense I wouldn't hesitate to use it on MCU projects if it would probably be re-usable stuff on other such projects.
I could only imagine doing that when doing something in an ISR running at a rate that makes the MCU sweat and using a bigger one isn't an option for some reason. Just not likely to happen currently. At previous work with, in the widest sense, consumer electronics, where every cent per unit counts - that could have been more likely.
Or some inner loops that are speed critical. But not the overall logic of the program that eats only a few % of the processing.
Ok, your application is basically all data processing. But it proably also has places in the code that get executed far less frequently than your "hot spots". And trying to optimize a group of code spots that in total eats 5%, and you manage to make it eat 4% - I guess there are things where the time is more well spent, let the compiler do the rest