Add functions to compute, plot, store the local hazard exceedence intensity and RP maps (new branch) by ValentinGebhart ยท Pull Request #898 ยท CLIMADA-project/climada_python

@ValentinGebhart

Changes proposed in this PR (new version of old PR #857):

  • addition of new function to compute local return period maps local_return_period and _loc_return_period for given hazard intensities, giving out a geodataframe
  • addition of function subplots_from_gdf in util.plot that plots different subplots from the columns of a gdf

Earlier parts removed (commented out in code) until agreement:

  • addition of new functions to store the hazard maps that are generated in the Hazard.plot_rp_intensity() function as raster files write_raster_local_exceedance_inten. (do we also need as netCDF files write_netcdf_local_exceedance_inten?)
  • addition of new function to store these local return period maps for given hazard intensities as raster files write_raster_local_return_periods or as NetCDF files write_netcdf_local_return_periods.

This PR fixes #854
(new version of old PR #857)

PR Author Checklist

PR Reviewer Checklist

ValentinGebhart

@ValentinGebhart

chahank

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated clean version of this PR! I think this will be a quite useful set of methods.

I have 2 general points that itch me at the moment.

  • I am not sure it makes sense that the plotting and saving methods always call and recalculate the values. Maybe we should rather store the values as a new attribute. Do you know how much computation this costs for large datasets? @emanuel-schmid @peanutfun : what do you think, would it be ok if the method local_return_period adds an attribute to the hazard object?
  • I find the method local_return_period and the method _loc_return_period quite hard to read at the moment. I cannot say exactly why, but probably a mix of not always clear variable names, hidden changing of variables (like the return_periods), and quite some loops.

I thus suggest to a) wait for @emanuel-schmid and @peanutfun opinion about the attribute b) rewrite the code to avoid updating the return periods in the back.

@peanutfun

I am not sure it makes sense that the plotting and saving methods always call and recalculate the values. Maybe we should rather store the values as a new attribute. Do you know how much computation this costs for large datasets? @emanuel-schmid @peanutfun : what do you think, would it be ok if the method local_return_period adds an attribute to the hazard object?

I agree that the values should not be calculated in the plotting method. Instead, the user should insert the data to plot into the appropriate function.

I very much dislike adding new attributes after the class initialization. I would return the data to the user, who should take care of "bookkeeping" themselves. They can then plug the values into a plotting function, store them separately, etc.

@chahank

I agree that the values should not be calculated in the plotting method. Instead, the user should insert the data to plot into the appropriate function.

I very much dislike adding new attributes after the class initialization. I would return the data to the user, who should take care of "bookkeeping" themselves. They can then plug the values into a plotting function, store them separately, etc.

I also agree with you that adding new attributes after the class initialization is a bad habit. Here the point is that plotting the intensity return period on the map requires the centroids, and thus having it in the hazard object is convenient... not sure what the best solution is.

@ValentinGebhart

I very much dislike adding new attributes after the class initialization. I would return the data to the user, who should take care of "bookkeeping" themselves. They can then plug the values into a plotting function, store them separately, etc.

I also agree with you that adding new attributes after the class initialization is a bad habit. Here the point is that plotting the intensity return period on the map requires the centroids, and thus having it in the hazard object is convenient... not sure what the best solution is.

One way of returning data to the user without losing the centroids could be to return a geodataframe?

@peanutfun

@ValentinGebhart Great idea! Returning a GeoDataFrame with the return period information and the centroids geometry should contain all data necessary for the plot.

Users can then use the geopandas functions for plotting and writing. Maybe this solves all issues? ๐Ÿ˜…

@ValentinGebhart

@emanuel-schmid

@ValentinGebhart I agree the idea is not bad - but returning a GeoDataFrame is still somewhat fuzzy. A cleaner solution would be to return a tuple (intensity return period, centroids geometry) or a class with two attributes, intensity_return_period and geometry.
This seems cleaner. Ideally any requirement should be explicit in a method's signature and not hidden away in an object that could contain anything at all, like a dict or a dataframe.

@chahank

@emanuel-schmid I strongly disagree with this take as is. The proposed outputs (tuples or custom class) maybe would be cleaner from a programming point of view, but it would make the output mostly useless from a user perspective. I do not see the problem with giving back a geodataframe. Can you maybe explain why this would be a bad idea?

Besides, we already have a precedent with the method impact_at_reg that returns a DataFrame :D

@emanuel-schmid

Yeah. Agreed. Although I must stress that the principle holds - in this particular case a dataframe seems alrightish. ๐Ÿ˜

@chahank

Yeah. Agreed. Although I must stress that the principle holds - in this particular case a dataframe seems alrightish. ๐Ÿ˜

Indeed. I think there is a general question on how climada objects should return extra information. We had the questions about Centroids and things like pixel sizes, Impacts and regional values, or here Hazards and return period.

@ValentinGebhart

@ValentinGebhart

In the new commit, local_return_period outputs a GeoDataFrame and gdf_meta (named tuple with plotting metadata information) and I add a function subplots_from_gdf to util.plot that plots subplots from the different columns of a gdf (using the gdf_meta information). I thought it would be nice to have this plot function as general as possible (so it could plot return periods and exceedance frequencies and more), but if this is too messy, we can also write a more specific "plot_return_period" function.

After discussion with Lukas, I create the meta data of the gdf as a named tuple, such that it is clear which attributes gdf_meta must have to correctly plot.

I removed the write functions (and their tests) because we might want to generalize them to general gdf. I made a backup if we change our minds.

@ValentinGebhart

@ValentinGebhart

After the changes (see above comment) a new review is requested.

@ValentinGebhart

chahank

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this simplified version! It makes it much clearer.

There are a few main points to discuss:
a) is it computing what it should? The test looks incorrect to me, but I probably have mixed up something. Can you please enlighten me? Did it myself.
b) I am not sure whether the GdfMeta is a good idea.
c) I am not sure whether the chunking is working as intented (and whether it is needed at all.)

Comment on lines +460 to +461

threshold_intensities : np.array
Hazard intensities to consider.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name threshold_intensities rather confusing, and the docstring does not really help. Can you be a bit more verbatim please?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about "test_intensities"? I agree, I'll update the docstrings

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I think the word test would be quite confusing actually. Maybe just intensities? Or even avoid the plural and use intensity? Or keep threshold_intensities but with a clearer docstring.

Comment on lines +471 to +472

LOGGER.warning("The Hazard's frequency unit is %s and not %s which "
"most likely leads to incorrect results",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely leads to incorrect results does not tell me much. What is meant by that? Why would having different units of frequencies lead to incorrect results? What one gets is simply return periods in the unit of the frequency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I could just use the inverse of the frequency attribute. I now added using a if elif statement that can read years, months, and weeks. I don't know how to make this more generic and easier because the frequency attribute is just a string and not a "physical" frequency unit that can be converted eg to years.

#create gdf meta data
gdf_meta = GdfMeta(
name = 'Return Period',
unit = 'Years',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the units of the frequency (or rather its inverse)?

Comment on lines +880 to +881

# Create GdfMeta class (as a named tuple) for GeoDataFrame meta data
GdfMeta = namedtuple('GdfMeta', ['name', 'unit', 'col_name', 'col_unit'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I am not sure whether this is a good idea. Maybe it is, maybe not.

@emanuel-schmid What do you think? Should we add some nametuple in the utility plot for a specific case of geopandas plotting?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add more details: this is called very generically GdfMeta, but refers to only one very specific data set of geodataframes.

col_name is already in the dataframe, col_unit and unit are rather confusing, and name is a bit unclear to me too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested this to pass information to the plot that is not explicitly contained in the GeoDataFrame. GdfMeta can be renamed, of course. But I think a namedtuple makes more sense than a dict, because you can be sure which keys/attributes are defined. col_name specifies the name of the column to plot. But I agree that the names can be adjusted and made clearer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of passing some meta data for the plotting. I am not sure that this is the perfect way this way :D At first I thought that GdfMeta some new property of GeoDataFrames. Also, is the current proposal general enough for a generic geopandas plotting method in CLIMADA? Or should it then be made more specific.

And, namedtuple makes sense indeed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ValentinGebhart, could we maybe find another way to put this information into the GeoDataFrame? @chahank and I had the following ideas when discussing this in person:

  • The col_name to plot can be a separate parameter, as users generally have to state which functions to plot in GeoDataFrame.plot() and similar functions
  • The col_unit can be added to the column name in parentheses or brackets. The plot function then reads this information is brackets/parentheses are found, otherwise it does not display a unit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the naming is not optimal. The included information should be: what quantity (+ units) is represented by the values in the gdf, what quantity (+ units) is represented by the values in the column labels. The rows always correspond to centroids. Maybe instead of name we could use quantity or value_type?

Otherwise, the plotting function should be "generic", in the sense that it produces one subplot per column of a generic gdf, using the values to color the different points, and using the gdf_meta varibale for figure and axis label. Of course, it is still using the geo_im_from_array util function, and I'm sure there are other ways to plot a gdf with several columns.
Let me know what you think

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peanutfun sorry I didn't see your suggestion in time. We can definitely add the labels and units as parameter(s) to the plot function. Then the user has to know about what their gdf represents without us helping by providing the information, which I guess is ok but is a bit more work to the user. If we put the as defaults to the plot function, the plot function is not generic anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically along the same lines as @peanutfun
I'd suggest to have two parameters instead of gdf_meta: GdfMeta,
colbar_name: str and title_subplots: lambda.

To get the same result as implemented now when calling

subplots_from_gdf(
    gdf=gdf, 
    gdf_meta=GdfMeta(name='Return Period', unit='Years', col_name='Threshold Intensity', col_unit='m/s')
)

you would

subplots_from_gdf(
    gdf=gdf, 
    colbar_name="Return Period (Years)", 
    title_subplots=lambda column: f"Threshold Intensity: {column} m/s",
)

Seems better readable and more flexible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consequently, I'd let the local_return_period return a triple

Returns
-------
gdf : GeoDataFrame
label : str
    GeoDataFrame label, for reporting and plotting
column_label : lambda
    Column label, for reporting and plotting

in case the user is not 100% sure what they did and is happy to get this kind of conformation. (Why not?)

@chahank

In the new commit, local_return_period outputs a GeoDataFrame and gdf_meta (named tuple with plotting metadata information) and I add a function subplots_from_gdf to util.plot that plots subplots from the different columns of a gdf (using the gdf_meta information). I thought it would be nice to have this plot function as general as possible (so it could plot return periods and exceedance frequencies and more), but if this is too messy, we can also write a more specific "plot_return_period" function.

I like the generality! I only fear that the GdfMeta might make it not generic.

After discussion with Lukas, I create the meta data of the gdf as a named tuple, such that it is clear which attributes gdf_meta must have to correctly plot.

I removed the write functions (and their tests) because we might want to generalize them to general gdf. I made a backup if we change our minds.

Sounds good!

chahank

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +35 to +37

from climada.test import get_test_file

HAZ_TEST_TC :Path = get_test_file('test_tc_florida')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not part of this PR.

@ValentinGebhart

Good work! Please finalize the cosmetics: the linter complains about trailing white spaces to be removed and some indention that is incorrect. There is one file that is modified by mistake I think.

Also, don't forget to add your name to the list of authors: https://github.com/CLIMADA-project/climada_python/blob/feature/write_haz_rp_maps2/AUTHORS.md

Thanks for the review(s)! I now understood what the trailing white space means :) For me this looks ready to merge, let me know if there is something else to do.

@chahank

This is good, feel free to merge!

This was referenced

Jul 16, 2024