Skip to content

6 glm for cue#7

Merged
hrlai merged 19 commits intomainfrom
6-glm-for-cue
Mar 27, 2025
Merged

6 glm for cue#7
hrlai merged 19 commits intomainfrom
6-glm-for-cue

Conversation

@hrlai
Copy link
Copy Markdown
Collaborator

@hrlai hrlai commented Feb 20, 2025

This PR addresses #6 (stems from ImperialCollegeLondon/virtual_ecosystem#746), which is to set up a generalised linear regression to estimate parameters for temperature-dependent microbial carbon use efficiency (CUE) that does not predict CUE out of bound (stays between 0 and 1).

There are two aspects to review: (1) model choice and (2) folder structure of this repo going forward.

Model choice:

  • @davidorme suggested that we might look at multiple approaches, any further thoughts? Just within a beta-distribution GLM, there are multiple link function choices (logit, probit and cloglog) but I think you had something else in mind?
  • @jacobcook1995 I'm not sure when do we consider this done to update the soil constants. Probably after a few discussions here?

Folder structure:

  • This is where more people could chime in
  • For data, I stored it in a data directory, but it wasn't commited because the csv files etc. are gitignored, How do we envision data download? Do we always include the URL in the code for others to manually download it...?
  • For code, I set it up like code/soil/cue for my case, so the code directory is a bit like the models directory in VE. Not sure if this is best.
  • @arne-exe I could your review too, @jacobcook1995 told me that you're working on some RStudio notes for this repo? I'm writing from a Rstudio user's perspective too.
  • @vgro I have created a bib directory to address Bibliography #1 and followed the same filename as virtual_ecosystem. The bibliography is supposed to include data sources and refs used in html reports that I create with RMarkdown, which leads to:-
  • Do we want to always include a report with the data analyses? Initially I was going to set up just an .R script, but decided to trial with .Rmd to also generate a report at the end. It is in html intended for anyone to jump right in without having to worry about the details. But the html file would easily go >500 kb (which is the lintr limit), and mine was 900 kb due to figures, so I didn't commit it.

That's all for now on the top of my head. Looking forward to pin down a folder structure for this repo :)

@hrlai hrlai linked an issue Feb 20, 2025 that may be closed by this pull request
@hrlai hrlai requested a review from arne-exe February 20, 2025 07:05
@jacobcook1995
Copy link
Copy Markdown
Collaborator

jacobcook1995 commented Feb 20, 2025

  • @davidorme suggested that we might look at multiple approaches, any further thoughts? Just within a beta-distribution GLM, there are multiple link function choices (logit, probit and cloglog) but I think you had something else in mind?

I guess it does make sense to try different distributions on this side, but the issue would be that we don't currently have any way to utilise alternative functional implementations within the virtual_ecosystem code. Unless I figure out a way of implementing CUE as a generalised linear model, I would need a different function for each distribution

@davidorme
Copy link
Copy Markdown
Collaborator

We could do it cheaply with a config option that chose the function. This might be a thing where we hard coded two competing theoretical approaches rather than having to build a registry. But one option for now is fine!

Also you don't need to go anywhere near the GLM itself. The paramaterisation does that and you just need to use an appropriate expression to turn the parameters into predictions.

@jacobcook1995
Copy link
Copy Markdown
Collaborator

jacobcook1995 commented Feb 20, 2025

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

But yes for now, I would probably only want to implement a specific distribution, because the alternative is not a scalable approach to soil model parameterisation

Copy link
Copy Markdown
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chatted to Rob and @jacobcook1995 about notebook formats. We think it is probably going to be better to use Jupyter notebooks rather than RMarkdown. The main reason is that these outputs render better on GitHub.

I've uploaded a couple of files. One is Myst Markdown - the only real advantage there is that Github knows it is Markdown and tries to render it. It doesn't know that .Rmd is Markdown. The .ipynb format is rendered properly though and includes the binary data for the images. That makes them a bit bulky but they are also much easier to read and review.

None of this is nailed down yet, but Jupyter is what we'd use for Python notebooks and having the same framework and proper rendering is a real advantage.

@hrlai
Copy link
Copy Markdown
Collaborator Author

hrlai commented Feb 20, 2025

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

Good point. In addition to the config option @davidorme suggested, a simpler (but less general?) approach is to add a (inverse) link function to a calculate_* function in the soil module. I don't know python precisely but here's a crude draft. Following the nomenclature from some R package and using calculate_carbon_use_efficiency() as an example, there could be another argument transform, which is an inverse link function to backtransform the output to observation scale depending on the GLM:

return transform(reference_cue - cue_with_temperature * (soil_temp - cue_reference_temp))

And transform would be inverse-logit for logit link (my current implementation), exp for log-link, and we would also include an identity link (no transformation) for Gaussian / normal GLM.

We do not need to worry about generalising this if we only pick one GLM for now, so these are more like notes for the future.

@hrlai
Copy link
Copy Markdown
Collaborator Author

hrlai commented Feb 20, 2025

better to use Jupyter notebooks rather than RMarkdown. The main reason is that these outputs render better on GitHub.

Thanks @davidorme , I'm all for intergration so happy to favour Jupyter over RMarkdown. That said, would it be possible to do a bit of both by building an auto-conversion into git's workflow? There seems to be jupytext --to ipynb example.Rmd. Did you manually converted the RMarkdown or did it this way? The conversion I got looks a bit different from your .ipynb file.

It would be ideal if anyone on the data team like me could stick to the RStudio routine but still deliver Jupyter notebooks as an end product. But if this is too clunky then I'm still happy to switch over 😄

@jacobcook1995
Copy link
Copy Markdown
Collaborator

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

Good point. In addition to the config option @davidorme suggested, a simpler (but less general?) approach is to add a (inverse) link function to a calculate_* function in the soil module. I don't know python precisely but here's a crude draft. Following the nomenclature from some R package and using calculate_carbon_use_efficiency() as an example, there could be another argument transform, which is an inverse link function to backtransform the output to observation scale depending on the GLM:

return transform(reference_cue - cue_with_temperature * (soil_temp - cue_reference_temp))

And transform would be inverse-logit for logit link (my current implementation), exp for log-link, and we would also include an identity link (no transformation) for Gaussian / normal GLM.

We do not need to worry about generalising this if we only pick one GLM for now, so these are more like notes for the future.

Yes these are definitely useful notes future! For mainly academic background reasons (only just learning what GLMs are 🫠), I'm pretty strongly in favour of trying to use solely mechanistically derived process representations for the soil model. But something Rob has talked about before is trying to implement alternative empirical derived representations, so down the road we could split the soil model into two (e.g. soil_mechanistic and soil_empirical) and compare performance (my guess is which one works out best would be pretty metric dependant)

@jacobcook1995
Copy link
Copy Markdown
Collaborator

jacobcook1995 commented Feb 21, 2025

On the data side I think using @qiao2019 is pretty much a no-brainer (vs data drawn from a single system at only 3 temperatures!). The lack of tropical study sites is obviously an issue, but I think this is something we are going to have to learn to generally accept (both the tropics and soils are generally understudied, in combo it's really bad).

Looking at the plotted distribution I do prefer your model to the linear models shown. Staying with the range of 0.2 < CUE < 0.8 for a wide range of temperatures is good in my book, because values outside that have never felt biologically reasonable to me, like can cells really bring their respiration down to as 5% of total carbon intake, with the other 95% being assimilated as biomass? (Okay there's literally data points saying that they can, but I remain sceptical)

If we want to implement your model + parameterisation, I would need to change the functional form of calculate_carbon_use_efficiency(). I'm happy to do this but I need to check what form would fit with the statistical model used here, would it be something like

cue = 1 / (1 + exp(-(cue_ref + (temperature - reference_temperature)))?

(Obviously cue_ref would need to be renamed as it's no longer the carbon use efficiency at the reference temperature)

@hrlai
Copy link
Copy Markdown
Collaborator Author

hrlai commented Feb 21, 2025

split the soil model into two (e.g. soil_mechanistic and soil_empirical)

Agreed, and now I can more clearly see what you were coming from. I think our recent discussion is quite relevant in this regard, i.e., on one hand we have Arrhenius equations that represent the mechanistic / process-based part, and on the other hand we have the CUE line what is a curve-fitting "empirical" model.

At some point, it would be great for the data team to chat about our "default" approach / model choice:

  • empirical, curve-fitting, phenomenological, vs.
  • process=based, mechanistic

would it be something like

cue = 1 / (1 + exp(-(cue_ref + (temperature - reference_temperature)))?

(Obviously cue_ref would need to be renamed as it's no longer the carbon use efficiency at the reference temperature)

Yeap, it would be what python calls expit and what R tends to call inverse-logit or logistic. But the slope is missing:

cue = 1 / (1 + exp(-(cue_ref + cue_T * (temperature - reference_temperature)))

For renaming there are a few options:

  • cue_ref can be log-odds of CUE at ref temp
  • cue_T is the slope of log-odds of CUE with respect to temperature...

@arne-exe
Copy link
Copy Markdown
Collaborator

Hey Hao Ran, below some comments:

  • Folder structure: Please have a look at the "data repository structure" document on the VE notebook (go to VE Sharepoint > click Notebook > go to Data repository > click "data repository structure"). There are some useful notes there regarding the folder structure.
  • Script structure: I noticed your script has some meta data (title, author, etc.) and is well organised using different sections. It would be good to discuss with everyone whether we want to have a common structure that is similar in all of the scripts.
  • Data download: The abovementioned notebook doc has some good info on this. I also tried out a method where we can facilitate the data download within the R script (using RSelenium). If we were to include this in our scripts we could extract the datasets straight from Zenodo. I'll create a different issue on this with some more details.

@vgro
Copy link
Copy Markdown
Collaborator

vgro commented Mar 3, 2025

Good question regarding the bibliography. First, the file looks fine to me :-)
Second, I am not sure what the best way forward is. For the virtual_ecosystem repo, we set up an online zotero account that we can all access and modify. The idea was that we always upload new references there and create a new bib file to replace the old one (to keep the same naming style etc). I am not sure how consistently this is still used and if it would make sense to give everyone access to the online account, or if we can cross-link references somehow like Jacob said (I am not familiar with this).

@davidorme
Copy link
Copy Markdown
Collaborator

For the virtual_ecosystem repo, we set up an online zotero account that we can all access and modify. The idea was that we always upload new references there and create a new bib file to replace the old one (to keep the same naming style etc). I am not sure how consistently this is still used and if it would make sense to give everyone access to the online account, or if we can cross-link references somehow like Jacob said (I am not familiar with this).

We use the same for pyrealm. It is a bit convoluted because Zotero online doesn't export .bib files. So we have a group library (which is online) and you can open that locally to create updated .bib files within the repo and upload changes. There are a few local settings we need to synchronise (citation key style and also exclude some fields from the group library - notably the local file paths, which otherwise change constantly for no good reason).

@hrlai
Copy link
Copy Markdown
Collaborator Author

hrlai commented Mar 12, 2025

Alright, I have added more YAML metadata to the Rmd script following the template. This is almost there, but when I tried to convert the Rmd file using jupytext the output ipynb files always fail the pre-commit checks (see my attempt in the usage_notes metadata). Any idea @davidorme ?

@davidorme davidorme mentioned this pull request Mar 13, 2025
@davidorme
Copy link
Copy Markdown
Collaborator

@hrlai I think we park the notebook conversion for this issue. There are a few things that make the Jupyter and RMarkdown formats not a straight swap (like bibliography handling for one thing!).

I think we probably do want to converge on one format - and from the developer side we'd probably prefer Jupyter - but we have tools to automate conversion so we can solve this later - getting some format of the code in is more important right now.

@hrlai
Copy link
Copy Markdown
Collaborator Author

hrlai commented Mar 15, 2025

No worries @davidorme . I'm almost there, one problem. gitignore currently prevents the data files (e,g., csv) to be pushed, how would you like us to proceed? @annarallings ?

@hrlai hrlai requested a review from annarallings March 15, 2025 07:35
@davidorme
Copy link
Copy Markdown
Collaborator

davidorme commented Mar 15, 2025

@hrlai and @annarallings We don't really want any data files stored in the Git part of the repo - CSV files aren't so bad but it is better if we treat them all data files the same. So all data files should be archived via GLOBUS.

What I would ideally like to do - Anna and I have chatted about this but there is no decision yet - is that all data outputs live in the data directory. So a script output might read in data from data/primary but then write out to data/derived We probably want the subdirectories within data/derived to mirror the directories within analysis. The outputs section of the script metadata should then allow us to trace the source of all files within data/derived.

@nickwctan
Copy link
Copy Markdown
Collaborator

@hrlai I think we park the notebook conversion for this issue. There are a few things that make the Jupyter and RMarkdown formats not a straight swap (like bibliography handling for one thing!).

I think we probably do want to converge on one format - and from the developer side we'd probably prefer Jupyter - but we have tools to automate conversion so we can solve this later - getting some format of the code in is more important right now.

So , just to clarify from what I understand from here and our last meeting, we would like to have the output in ipynb format? @davidorme, wouldn't it make it easier to work directly in Jupyter in ipynb format? However, I only see Rmd template on the repo.

I have been trying to convert the Rmd file into ipynb in R but with no success. One of the trials with markdown package:

# load libraries
library(rmarkdown)

# set directory
setwd("file_path")

# Specify the input and output file names
input_file <- "R_markdown_template.Rmd"  # Replace with your Rmd file
output_file <- "R_markdown_template.ipynb"  # Desired output file name

# Convert Rmd to ipynb
rmarkdown::render(input_file, output_format = "ipynb", output_file = output_file)

I got an error:
Error in eval(xfun::parse_only(name)) : object 'ipynb' not found

@hrlai
Copy link
Copy Markdown
Collaborator Author

hrlai commented Mar 15, 2025

So a script output might read in data from data/primary but then write out to data/derived

Right, but just to clarify, any csv files that live inside these folder are still ignored? (that sounds good to me)

In that case, this PR is ready for a final review. @annarallings maybe you could be the one to do the approving review?

@davidorme
Copy link
Copy Markdown
Collaborator

davidorme commented Mar 15, 2025

@sphinxdrake This probably isn't the right location for this conversation, but ipynb is not a rendering of an RMarkdown format. It's a Jupyter notebook format (like .Rmd is an R notebook format). There are two advantages:

  • We already use Jupyter notebooks widely in the project.
  • It so happens that the .ipynb format doesn't need a separate rendered file. It can store the outputs of running the code within the file and GitHub knows how to display it.

But...

  • We'd use that instead of RMarkdown so you don't get to use your existing RMarkdown skills.
  • It's then going to be easier to work within VSCode rather than RStudio.

There is a tool (jupytext) which can do the bulk of the conversion to and from these formats, so for the moment, use whatever file you like (R script, Rmd notebook, etc) and we'll look at pulling everything together down the line.

@annarallings
Copy link
Copy Markdown
Collaborator

@hrlai and @annarallings We don't really want any data files stored in the Git part of the repo - CSV files aren't so bad but it is better if we treat them all data files the same. So all data files should be archived via GLOBUS.

What I would ideally like to do - Anna and I have chatted about this but there is no decision yet - is that all data outputs live in the data directory. So a script output might read in data from data/primary but then write out to data/derived We probably want the subdirectories within data/derived to mirror the directories within analysis. The outputs section of the script metadata should then allow us to trace the source of all files within data/derived.

I think we should store our csv files on Git for the time being as we develop our workflows. It will be really difficult to check work and run outputs unless the csvs are collocated in this repo. Our data is still relatively small and easy to manipulate. Unless there are privacy concerns for the data, I suggest we go ahead with primary and derived data in the repo.

@davidorme
Copy link
Copy Markdown
Collaborator

Unless there are privacy concerns for the data, I suggest we go ahead with primary and derived data in the repo.

CSV files are OK - but binary data files are really not (including XLSX). This can break a repository really quite quickly. I completely get the short term goal of getting data available, but we already have Excel files over in #7 which shouldn't go in here. This does need the GLOBUS up and running more cleanly.

In the use case that a PR only has CSV data, I'm ok with it for now, for small files, but it's a really slippery path!

@hrlai
Copy link
Copy Markdown
Collaborator Author

hrlai commented Mar 25, 2025

@annarallings and @davidorme , I removed csv from gitignore and have uploaded the original data as a csv.

This is not a 100% reproducible process because the original data were actually an Excel file, so I had to convert it to csv to push it. Previously, I have directly read the xlsx file in R. I think the eventual solution will be something like Globus. I agree that this will only be a temporary fix.

I think this PR is ready for a final review? When you switch to this branch on your computer, you should have the input data now :)

Copy link
Copy Markdown
Collaborator

@annarallings annarallings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hrlai hrlai merged commit 3c96206 into main Mar 27, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GLM for carbon use efficiency to avoid fitted line out of bound

7 participants