Add shared_task<T> and shared_lazy_task<T> classes

The ability to have multiple consumers wait on the result of a task is required for some scenarios.
eg. where you want to pass a prerequisite task into multiple sub-tasks that each need to await that task.

The task<T> and lazy_task<T> classes are move-only and support only a single awaiting coroutine at a time.

This issue is proposing to add a shared_task<T> class and a shared_lazy_task<T> class that support copy-construction and assignment with reference-counting semantics and support multiple concurrent awaiting coroutines.

It should be possible to implement in a lock-free fashion using std::atomic pointers.