Hybrid parallel CFEASolver and CMeshSolver by pcarruscag · Pull Request #843

Hybrid parallel CFEASolver and CMeshSolver by pcarruscag · Pull Request #843 · su2code/SU2

Thank you for the thorough review @rsanfer! I'll reply to your main questions and some of the smaller ones here to centralize things.

Just a request, if it's possible that you add one or two test cases so the implementation is safe onwards (and, of course, so I can play around with the new features a little bit ).

The testcases are the same, no changes there other than the one optional option introduced above. When the hybrid stuff covers most of the code I would add an entire build configuration e.g. BaseMPIOMP and corresponding testcase suite.

Should this just run "out of the box" with a working installation of OpenMP in any machine, or is there anything else fancy needed?

I would leave it to the community to decide what the defaults should be, probably for a lot of new users that don't run on clusters just calling SU2_CFD and not having to worry about mpi would be nice (a lot of the issues on CFD online are mpi related).

Is the previous behaviour exactly kept, or are there any modifications in the basic, non OpenMP version of code? (Not that I mind, just curious).

Other than the algorithmic changes (but mathematically equivalent) introduced to limiters and gradients in #834, yes.

What's the advantage of having one numerics term per thread?

It is a requirement, we need to write data into numerics before using them, multiple threads cannot write to the same location (i.e. the internal structures of CNumerics) therefore one per thread is required.

...Also, I think I missed the point where the numerics container is extended beyond MAX_TERMS

The allocation of space for one numerics per thread is done above in line 1995 of my 21 Dec 2019 comment: ...MAX_TERMS*omp_get_max_threads()....
The instantiation of one numerics per thread is then done by executing the rest of the preprocessing in parallel and instead of using XYZ_TERM using XYZ_TERM+offset where offset = thread_id * MAX_TERMS.
I think someone mentioned this (maybe Tim) that we could revisit the ownership relations of the numerics classes, i.e. allocate them as members of their respective solvers, which if we do, we can think of having a purpose built container that automates the per-thread creation and access.

Why are they redefined each time inside the loop?
Is this for efficiency reasons?

Referring to variables being declared inside loops. One stylist reason is that declaring everything at the top of a function is the C way of doing things, the C++ people whose books/blogs I've read and talks I've watched, recommend keeping namespaces (the inside of the loop being one) as clean as possible.
The only reason not to do this is if you explicitly want re-use, in the case of trivial types this does not improve efficiency, and in the context of OpenMP code it can create issues. Just like we need one numerics per thread, if we declare variables outside a parallel loop the default OpenMP behaviour is to consider them shared, and concurrent writes to shared locations = gdb and many bad words xD.
EDIT: I should mention here that if the parallel region is started before the variable declarations they become local and all is well, with the exception of class members, those will be shared most of the time (this is where const correctness can give some peace of mind).

Also, just an additional (hopefully constructive) comment: I find all of these developments great, and I honestly think that you are doing an amazing job on performance and overall code improvement. However, as a non-C++-master myself, I'm just a little concerned of whether some advanced programming may become an entrance barrier to new additions to the community.

As I wrote in the preamble of #789:
"But please participate even if you never heard of these topics, your opinion about readability and "developability" of the code is important! I think the code-style should be accessible to people starting a PhD (after they read a bit about C++...)."
I try to encapsulate and hide the tricky bits as much as possible to make the code as readable as possible, whether I am succeeding or not is for the community to decide, in all these PR's I've been pointing to the areas I think are trickier, if someone, anyone, feels they are absolutely incomprehensible please say something... either here, or trough slack, or by email (I think it shows in the commits) (I understand not everyone is keen on github exposure).

I'm aware that you have been doing very well at documenting the code and the various PRs, but I'd say we should try to find an strategy to ease the learning curve on potential new developers (maybe some developer tutorials? a collection of the comments/discussions on the PRs moved to the wiki? a list of links/useful resources?).

I agree with documentation of broad design decisions, that is the intent of #789, and developer tutorials (how to implement a new X) once we are content with the restructurings, otherwise they will quickly go outdated... or actually...
We should probably first think about the answers to "how to implement a new X" and restructure/refactor as a function of that.
Based on previous efforts of maintaining wiki's updated while code is being developed, I much prefer this github style where you can clearly tell what version of the code the comments refer to. A collection of comments/discussions organized by topic and linked to a feature is somewhat what I had in mind when I opened a "big PR" (#824) with little branches such as this one, I can try to complete that with a list of links/useful resources, references as it were, good idea!