DenseMatrix should not use Array and parallel_for
DenseMatrix has been written to handle quite small dense matrices and to be compatible with Vector.
This is actually a very bad idea!
Dealing with small DenseMatrix or small Vector is completely incompatible with the allocator of Kokkos::View: this leads to extremely expensive code in the case of lots of DenseMatrix (for instance one by node), which is essentially due to the allocator itself.
One should get rid of the embedded Array in DenseMatrix and create a new "small vector" class or find a way to replace easily Kokkos::View's allocator.