6 glm for cue by hrlai · Pull Request #7 · ImperialCollegeLondon/ve_data_science

hrlai · 2025-02-20T07:05:15Z

This PR addresses #6 (stems from ImperialCollegeLondon/virtual_ecosystem#746), which is to set up a generalised linear regression to estimate parameters for temperature-dependent microbial carbon use efficiency (CUE) that does not predict CUE out of bound (stays between 0 and 1).

There are two aspects to review: (1) model choice and (2) folder structure of this repo going forward.

Model choice:

@davidorme suggested that we might look at multiple approaches, any further thoughts? Just within a beta-distribution GLM, there are multiple link function choices (logit, probit and cloglog) but I think you had something else in mind?
@jacobcook1995 I'm not sure when do we consider this done to update the soil constants. Probably after a few discussions here?

Folder structure:

This is where more people could chime in
For data, I stored it in a data directory, but it wasn't commited because the csv files etc. are gitignored, How do we envision data download? Do we always include the URL in the code for others to manually download it...?
For code, I set it up like code/soil/cue for my case, so the code directory is a bit like the models directory in VE. Not sure if this is best.
@arne-exe I could your review too, @jacobcook1995 told me that you're working on some RStudio notes for this repo? I'm writing from a Rstudio user's perspective too.
@vgro I have created a bib directory to address Bibliography #1 and followed the same filename as virtual_ecosystem. The bibliography is supposed to include data sources and refs used in html reports that I create with RMarkdown, which leads to:-
Do we want to always include a report with the data analyses? Initially I was going to set up just an .R script, but decided to trial with .Rmd to also generate a report at the end. It is in html intended for anyone to jump right in without having to worry about the details. But the html file would easily go >500 kb (which is the lintr limit), and mine was 900 kb due to figures, so I didn't commit it.

That's all for now on the top of my head. Looking forward to pin down a folder structure for this repo :)

jacobcook1995 · 2025-02-20T11:34:22Z

@davidorme suggested that we might look at multiple approaches, any further thoughts? Just within a beta-distribution GLM, there are multiple link function choices (logit, probit and cloglog) but I think you had something else in mind?

I guess it does make sense to try different distributions on this side, but the issue would be that we don't currently have any way to utilise alternative functional implementations within the virtual_ecosystem code. Unless I figure out a way of implementing CUE as a generalised linear model, I would need a different function for each distribution

davidorme · 2025-02-20T11:42:46Z

We could do it cheaply with a config option that chose the function. This might be a thing where we hard coded two competing theoretical approaches rather than having to build a registry. But one option for now is fine!

Also you don't need to go anywhere near the GLM itself. The paramaterisation does that and you just need to use an appropriate expression to turn the parameters into predictions.

jacobcook1995 · 2025-02-20T11:47:44Z

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

But yes for now, I would probably only want to implement a specific distribution, because the alternative is not a scalable approach to soil model parameterisation

davidorme

I've chatted to Rob and @jacobcook1995 about notebook formats. We think it is probably going to be better to use Jupyter notebooks rather than RMarkdown. The main reason is that these outputs render better on GitHub.

I've uploaded a couple of files. One is Myst Markdown - the only real advantage there is that Github knows it is Markdown and tries to render it. It doesn't know that .Rmd is Markdown. The .ipynb format is rendered properly though and includes the binary data for the images. That makes them a bit bulky but they are also much easier to read and review.

None of this is nailed down yet, but Jupyter is what we'd use for Python notebooks and having the same framework and proper rendering is a real advantage.

code/soil/cue/update_cue_parameters.Rmd

hrlai · 2025-02-20T21:11:00Z

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

Good point. In addition to the config option @davidorme suggested, a simpler (but less general?) approach is to add a (inverse) link function to a calculate_* function in the soil module. I don't know python precisely but here's a crude draft. Following the nomenclature from some R package and using calculate_carbon_use_efficiency() as an example, there could be another argument transform, which is an inverse link function to backtransform the output to observation scale depending on the GLM:

return transform(reference_cue - cue_with_temperature * (soil_temp - cue_reference_temp))

And transform would be inverse-logit for logit link (my current implementation), exp for log-link, and we would also include an identity link (no transformation) for Gaussian / normal GLM.

We do not need to worry about generalising this if we only pick one GLM for now, so these are more like notes for the future.

hrlai · 2025-02-20T21:27:47Z

better to use Jupyter notebooks rather than RMarkdown. The main reason is that these outputs render better on GitHub.

Thanks @davidorme , I'm all for intergration so happy to favour Jupyter over RMarkdown. That said, would it be possible to do a bit of both by building an auto-conversion into git's workflow? There seems to be jupytext --to ipynb example.Rmd. Did you manually converted the RMarkdown or did it this way? The conversion I got looks a bit different from your .ipynb file.

It would be ideal if anyone on the data team like me could stick to the RStudio routine but still deliver Jupyter notebooks as an end product. But if this is too clunky then I'm still happy to switch over 😄

jacobcook1995 · 2025-02-21T08:15:48Z

Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get y = mx + c to represent an exponential decay process)

Good point. In addition to the config option @davidorme suggested, a simpler (but less general?) approach is to add a (inverse) link function to a calculate_* function in the soil module. I don't know python precisely but here's a crude draft. Following the nomenclature from some R package and using calculate_carbon_use_efficiency() as an example, there could be another argument transform, which is an inverse link function to backtransform the output to observation scale depending on the GLM:
return transform(reference_cue - cue_with_temperature * (soil_temp - cue_reference_temp))
And transform would be inverse-logit for logit link (my current implementation), exp for log-link, and we would also include an identity link (no transformation) for Gaussian / normal GLM.

We do not need to worry about generalising this if we only pick one GLM for now, so these are more like notes for the future.

Yes these are definitely useful notes future! For mainly academic background reasons (only just learning what GLMs are 🫠), I'm pretty strongly in favour of trying to use solely mechanistically derived process representations for the soil model. But something Rob has talked about before is trying to implement alternative empirical derived representations, so down the road we could split the soil model into two (e.g. soil_mechanistic and soil_empirical) and compare performance (my guess is which one works out best would be pretty metric dependant)

jacobcook1995 · 2025-02-21T09:14:39Z

On the data side I think using @qiao2019 is pretty much a no-brainer (vs data drawn from a single system at only 3 temperatures!). The lack of tropical study sites is obviously an issue, but I think this is something we are going to have to learn to generally accept (both the tropics and soils are generally understudied, in combo it's really bad).

Looking at the plotted distribution I do prefer your model to the linear models shown. Staying with the range of 0.2 < CUE < 0.8 for a wide range of temperatures is good in my book, because values outside that have never felt biologically reasonable to me, like can cells really bring their respiration down to as 5% of total carbon intake, with the other 95% being assimilated as biomass? (Okay there's literally data points saying that they can, but I remain sceptical)

If we want to implement your model + parameterisation, I would need to change the functional form of calculate_carbon_use_efficiency(). I'm happy to do this but I need to check what form would fit with the statistical model used here, would it be something like

cue = 1 / (1 + exp(-(cue_ref + (temperature - reference_temperature)))?

(Obviously cue_ref would need to be renamed as it's no longer the carbon use efficiency at the reference temperature)

hrlai · 2025-02-21T20:02:51Z

split the soil model into two (e.g. soil_mechanistic and soil_empirical)

Agreed, and now I can more clearly see what you were coming from. I think our recent discussion is quite relevant in this regard, i.e., on one hand we have Arrhenius equations that represent the mechanistic / process-based part, and on the other hand we have the CUE line what is a curve-fitting "empirical" model.

At some point, it would be great for the data team to chat about our "default" approach / model choice:

empirical, curve-fitting, phenomenological, vs.
process=based, mechanistic

would it be something like

cue = 1 / (1 + exp(-(cue_ref + (temperature - reference_temperature)))?

(Obviously cue_ref would need to be renamed as it's no longer the carbon use efficiency at the reference temperature)

Yeap, it would be what python calls expit and what R tends to call inverse-logit or logistic. But the slope is missing:

cue = 1 / (1 + exp(-(cue_ref + cue_T * (temperature - reference_temperature)))

For renaming there are a few options:

cue_ref can be log-odds of CUE at ref temp
cue_T is the slope of log-odds of CUE with respect to temperature...

arne-exe · 2025-02-24T13:43:20Z

Hey Hao Ran, below some comments:

Folder structure: Please have a look at the "data repository structure" document on the VE notebook (go to VE Sharepoint > click Notebook > go to Data repository > click "data repository structure"). There are some useful notes there regarding the folder structure.
Script structure: I noticed your script has some meta data (title, author, etc.) and is well organised using different sections. It would be good to discuss with everyone whether we want to have a common structure that is similar in all of the scripts.
Data download: The abovementioned notebook doc has some good info on this. I also tried out a method where we can facilitate the data download within the R script (using RSelenium). If we were to include this in our scripts we could extract the datasets straight from Zenodo. I'll create a different issue on this with some more details.

vgro · 2025-03-03T13:11:21Z

Good question regarding the bibliography. First, the file looks fine to me :-)
Second, I am not sure what the best way forward is. For the virtual_ecosystem repo, we set up an online zotero account that we can all access and modify. The idea was that we always upload new references there and create a new bib file to replace the old one (to keep the same naming style etc). I am not sure how consistently this is still used and if it would make sense to give everyone access to the online account, or if we can cross-link references somehow like Jacob said (I am not familiar with this).

davidorme · 2025-03-03T14:16:43Z

For the virtual_ecosystem repo, we set up an online zotero account that we can all access and modify. The idea was that we always upload new references there and create a new bib file to replace the old one (to keep the same naming style etc). I am not sure how consistently this is still used and if it would make sense to give everyone access to the online account, or if we can cross-link references somehow like Jacob said (I am not familiar with this).

We use the same for pyrealm. It is a bit convoluted because Zotero online doesn't export .bib files. So we have a group library (which is online) and you can open that locally to create updated .bib files within the repo and upload changes. There are a few local settings we need to synchronise (citation key style and also exclude some fields from the group library - notably the local file paths, which otherwise change constantly for no good reason).

hrlai · 2025-03-12T07:33:44Z

Alright, I have added more YAML metadata to the Rmd script following the template. This is almost there, but when I tried to convert the Rmd file using jupytext the output ipynb files always fail the pre-commit checks (see my attempt in the usage_notes metadata). Any idea @davidorme ?

davidorme · 2025-03-14T11:18:04Z

@hrlai I think we park the notebook conversion for this issue. There are a few things that make the Jupyter and RMarkdown formats not a straight swap (like bibliography handling for one thing!).

I think we probably do want to converge on one format - and from the developer side we'd probably prefer Jupyter - but we have tools to automate conversion so we can solve this later - getting some format of the code in is more important right now.

hrlai · 2025-03-15T07:35:23Z

No worries @davidorme . I'm almost there, one problem. gitignore currently prevents the data files (e,g., csv) to be pushed, how would you like us to proceed? @annarallings ?

davidorme · 2025-03-15T11:44:54Z

@hrlai and @annarallings We don't really want any data files stored in the Git part of the repo - CSV files aren't so bad but it is better if we treat them all data files the same. So all data files should be archived via GLOBUS.

What I would ideally like to do - Anna and I have chatted about this but there is no decision yet - is that all data outputs live in the data directory. So a script output might read in data from data/primary but then write out to data/derived We probably want the subdirectories within data/derived to mirror the directories within analysis. The outputs section of the script metadata should then allow us to trace the source of all files within data/derived.

nickwctan · 2025-03-15T16:29:21Z

@hrlai I think we park the notebook conversion for this issue. There are a few things that make the Jupyter and RMarkdown formats not a straight swap (like bibliography handling for one thing!).

I think we probably do want to converge on one format - and from the developer side we'd probably prefer Jupyter - but we have tools to automate conversion so we can solve this later - getting some format of the code in is more important right now.

So , just to clarify from what I understand from here and our last meeting, we would like to have the output in ipynb format? @davidorme, wouldn't it make it easier to work directly in Jupyter in ipynb format? However, I only see Rmd template on the repo.

I have been trying to convert the Rmd file into ipynb in R but with no success. One of the trials with markdown package:

# load libraries
library(rmarkdown)

# set directory
setwd("file_path")

# Specify the input and output file names
input_file <- "R_markdown_template.Rmd"  # Replace with your Rmd file
output_file <- "R_markdown_template.ipynb"  # Desired output file name

# Convert Rmd to ipynb
rmarkdown::render(input_file, output_format = "ipynb", output_file = output_file)

I got an error:
Error in eval(xfun::parse_only(name)) : object 'ipynb' not found

hrlai · 2025-03-15T16:38:06Z

So a script output might read in data from data/primary but then write out to data/derived

Right, but just to clarify, any csv files that live inside these folder are still ignored? (that sounds good to me)

In that case, this PR is ready for a final review. @annarallings maybe you could be the one to do the approving review?

…don/ve_data_science into 6-glm-for-cue

davidorme · 2025-03-15T18:19:01Z

@sphinxdrake This probably isn't the right location for this conversation, but ipynb is not a rendering of an RMarkdown format. It's a Jupyter notebook format (like .Rmd is an R notebook format). There are two advantages:

We already use Jupyter notebooks widely in the project.
It so happens that the .ipynb format doesn't need a separate rendered file. It can store the outputs of running the code within the file and GitHub knows how to display it.

But...

We'd use that instead of RMarkdown so you don't get to use your existing RMarkdown skills.
It's then going to be easier to work within VSCode rather than RStudio.

There is a tool (jupytext) which can do the bulk of the conversion to and from these formats, so for the moment, use whatever file you like (R script, Rmd notebook, etc) and we'll look at pulling everything together down the line.

annarallings · 2025-03-25T14:06:39Z

@hrlai and @annarallings We don't really want any data files stored in the Git part of the repo - CSV files aren't so bad but it is better if we treat them all data files the same. So all data files should be archived via GLOBUS.

What I would ideally like to do - Anna and I have chatted about this but there is no decision yet - is that all data outputs live in the data directory. So a script output might read in data from data/primary but then write out to data/derived We probably want the subdirectories within data/derived to mirror the directories within analysis. The outputs section of the script metadata should then allow us to trace the source of all files within data/derived.

I think we should store our csv files on Git for the time being as we develop our workflows. It will be really difficult to check work and run outputs unless the csvs are collocated in this repo. Our data is still relatively small and easy to manipulate. Unless there are privacy concerns for the data, I suggest we go ahead with primary and derived data in the repo.

davidorme · 2025-03-25T14:56:48Z

Unless there are privacy concerns for the data, I suggest we go ahead with primary and derived data in the repo.

CSV files are OK - but binary data files are really not (including XLSX). This can break a repository really quite quickly. I completely get the short term goal of getting data available, but we already have Excel files over in #7 which shouldn't go in here. This does need the GLOBUS up and running more cleanly.

In the use case that a PR only has CSV data, I'm ok with it for now, for small files, but it's a really slippery path!

…r testing now (related to #20)

hrlai · 2025-03-25T20:51:41Z

@annarallings and @davidorme , I removed csv from gitignore and have uploaded the original data as a csv.

This is not a 100% reproducible process because the original data were actually an Excel file, so I had to convert it to csv to push it. Previously, I have directly read the xlsx file in R. I think the eventual solution will be something like Globus. I agree that this will only be a temporary fix.

I think this PR is ready for a final review? When you switch to this branch on your computer, you should have the input data now :)

annarallings

LGTM

hrlai added 2 commits February 20, 2025 19:41

code to fit GLM on CUE data

72ddc67

add refs for data and report

c3ceb6e

hrlai requested review from davidorme, jacobcook1995 and vgro February 20, 2025 07:05

hrlai linked an issue Feb 20, 2025 that may be closed by this pull request

GLM for carbon use efficiency to avoid fitted line out of bound #6

Closed

hrlai requested a review from arne-exe February 20, 2025 07:05

Adding Jupyter notebook versions

ce790cf

davidorme reviewed Feb 20, 2025

View reviewed changes

code/soil/cue/update_cue_parameters.Rmd Outdated Show resolved Hide resolved

hrlai mentioned this pull request Feb 20, 2025

Default directory for RMarkdown evaluation #5

Closed

hrlai requested a review from davidorme March 4, 2025 19:59

jacobcook1995 mentioned this pull request Mar 5, 2025

Replace CUE with a logistic expression ImperialCollegeLondon/virtual_ecosystem#767

Merged

8 tasks

hrlai added 5 commits March 12, 2025 19:40

Merge branch 'main' into 6-glm-for-cue

50af153

update metadata using the template

47f868e

list the estimated parameters

3a28c69

add figure caption

4cf5626

attempt to convert Rmd to ipynb

1d97edf

davidorme mentioned this pull request Mar 13, 2025

folders for scripts #19

Closed

hrlai added 2 commits March 15, 2025 20:21

fix folder name typo

e8aabb0

move analysis script according to #19

04339a5

hrlai requested a review from annarallings March 15, 2025 07:35

update paths (plus fixing long path names to pass lintr)

ea917d6

davidorme mentioned this pull request Mar 15, 2025

Separate .gitignore in each analysis folder? #20

Closed

hrlai added 3 commits March 16, 2025 05:38

Merge branch 'main' into 6-glm-for-cue

e309a2f

also ignore R output rds and RData (#20)

386ea45

Merge branch '6-glm-for-cue' of https://github.com/ImperialCollegeLon…

a3413dc

…don/ve_data_science into 6-glm-for-cue

hrlai added 3 commits March 16, 2025 07:50

remove .gitkeep because we now have files populating the folder

0198612

try out github_document

e83160a

update file extension

44ebdf8

hrlai removed request for arne-exe, davidorme, jacobcook1995 and vgro March 15, 2025 18:59

allow csv to be pushed, and convert original data from xlsx to csv fo…

22062c8

…r testing now (related to #20)

Merge branch 'main' into 6-glm-for-cue

1993d6c

annarallings approved these changes Mar 27, 2025

View reviewed changes

hrlai merged commit 3c96206 into main Mar 27, 2025
2 checks passed

Conversation

hrlai commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobcook1995 commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidorme commented Feb 20, 2025

Uh oh!

jacobcook1995 commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidorme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hrlai commented Feb 20, 2025

Uh oh!

hrlai commented Feb 20, 2025

Uh oh!

jacobcook1995 commented Feb 21, 2025

Uh oh!

jacobcook1995 commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hrlai commented Feb 21, 2025

Uh oh!

arne-exe commented Feb 24, 2025

Uh oh!

vgro commented Mar 3, 2025

Uh oh!

davidorme commented Mar 3, 2025

Uh oh!

hrlai commented Mar 12, 2025

Uh oh!

davidorme commented Mar 14, 2025

Uh oh!

hrlai commented Mar 15, 2025

Uh oh!

davidorme commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickwctan commented Mar 15, 2025

Uh oh!

hrlai commented Mar 15, 2025

Uh oh!

davidorme commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annarallings commented Mar 25, 2025

Uh oh!

davidorme commented Mar 25, 2025

Uh oh!

hrlai commented Mar 25, 2025

Uh oh!

annarallings left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hrlai commented Feb 20, 2025 •

edited

Loading

jacobcook1995 commented Feb 20, 2025 •

edited

Loading

jacobcook1995 commented Feb 20, 2025 •

edited

Loading

jacobcook1995 commented Feb 21, 2025 •

edited

Loading

davidorme commented Mar 15, 2025 •

edited

Loading

davidorme commented Mar 15, 2025 •

edited

Loading