Conversation
I guess it does make sense to try different distributions on this side, but the issue would be that we don't currently have any way to utilise alternative functional implementations within the |
|
We could do it cheaply with a config option that chose the function. This might be a thing where we hard coded two competing theoretical approaches rather than having to build a registry. But one option for now is fine! Also you don't need to go anywhere near the GLM itself. The paramaterisation does that and you just need to use an appropriate expression to turn the parameters into predictions. |
|
Ahh I meant that I would have to have the mechanistic model be written in such a way that it could accept arbitrary GLM parameters, which a mechanistic representation of a particular distribution wouldn't do (e.g. there's no parameterisation by which you can get But yes for now, I would probably only want to implement a specific distribution, because the alternative is not a scalable approach to soil model parameterisation |
davidorme
left a comment
There was a problem hiding this comment.
I've chatted to Rob and @jacobcook1995 about notebook formats. We think it is probably going to be better to use Jupyter notebooks rather than RMarkdown. The main reason is that these outputs render better on GitHub.
I've uploaded a couple of files. One is Myst Markdown - the only real advantage there is that Github knows it is Markdown and tries to render it. It doesn't know that .Rmd is Markdown. The .ipynb format is rendered properly though and includes the binary data for the images. That makes them a bit bulky but they are also much easier to read and review.
None of this is nailed down yet, but Jupyter is what we'd use for Python notebooks and having the same framework and proper rendering is a real advantage.
Good point. In addition to the config option @davidorme suggested, a simpler (but less general?) approach is to add a (inverse) link function to a And We do not need to worry about generalising this if we only pick one GLM for now, so these are more like notes for the future. |
Thanks @davidorme , I'm all for intergration so happy to favour Jupyter over RMarkdown. That said, would it be possible to do a bit of both by building an auto-conversion into git's workflow? There seems to be It would be ideal if anyone on the data team like me could stick to the RStudio routine but still deliver Jupyter notebooks as an end product. But if this is too clunky then I'm still happy to switch over 😄 |
Yes these are definitely useful notes future! For mainly academic background reasons (only just learning what GLMs are 🫠), I'm pretty strongly in favour of trying to use solely mechanistically derived process representations for the soil model. But something Rob has talked about before is trying to implement alternative empirical derived representations, so down the road we could split the soil model into two (e.g. |
|
On the data side I think using @qiao2019 is pretty much a no-brainer (vs data drawn from a single system at only 3 temperatures!). The lack of tropical study sites is obviously an issue, but I think this is something we are going to have to learn to generally accept (both the tropics and soils are generally understudied, in combo it's really bad). Looking at the plotted distribution I do prefer your model to the linear models shown. Staying with the range of If we want to implement your model + parameterisation, I would need to change the functional form of
(Obviously |
Agreed, and now I can more clearly see what you were coming from. I think our recent discussion is quite relevant in this regard, i.e., on one hand we have Arrhenius equations that represent the mechanistic / process-based part, and on the other hand we have the CUE line what is a curve-fitting "empirical" model. At some point, it would be great for the data team to chat about our "default" approach / model choice:
Yeap, it would be what python calls expit and what R tends to call inverse-logit or logistic. But the slope is missing: For renaming there are a few options:
|
|
Hey Hao Ran, below some comments:
|
|
Good question regarding the bibliography. First, the file looks fine to me :-) |
We use the same for |
|
Alright, I have added more YAML metadata to the Rmd script following the template. This is almost there, but when I tried to convert the Rmd file using |
|
@hrlai I think we park the notebook conversion for this issue. There are a few things that make the Jupyter and RMarkdown formats not a straight swap (like bibliography handling for one thing!). I think we probably do want to converge on one format - and from the developer side we'd probably prefer Jupyter - but we have tools to automate conversion so we can solve this later - getting some format of the code in is more important right now. |
|
No worries @davidorme . I'm almost there, one problem. gitignore currently prevents the data files (e,g., csv) to be pushed, how would you like us to proceed? @annarallings ? |
|
@hrlai and @annarallings We don't really want any data files stored in the Git part of the repo - CSV files aren't so bad but it is better if we treat them all data files the same. So all data files should be archived via GLOBUS. What I would ideally like to do - Anna and I have chatted about this but there is no decision yet - is that all data outputs live in the |
So , just to clarify from what I understand from here and our last meeting, we would like to have the output in ipynb format? @davidorme, wouldn't it make it easier to work directly in Jupyter in ipynb format? However, I only see Rmd template on the repo. I have been trying to convert the Rmd file into ipynb in R but with no success. One of the trials with markdown package: I got an error: |
Right, but just to clarify, any csv files that live inside these folder are still ignored? (that sounds good to me) In that case, this PR is ready for a final review. @annarallings maybe you could be the one to do the approving review? |
|
@sphinxdrake This probably isn't the right location for this conversation, but
But...
There is a tool ( |
I think we should store our csv files on Git for the time being as we develop our workflows. It will be really difficult to check work and run outputs unless the csvs are collocated in this repo. Our data is still relatively small and easy to manipulate. Unless there are privacy concerns for the data, I suggest we go ahead with primary and derived data in the repo. |
CSV files are OK - but binary data files are really not (including XLSX). This can break a repository really quite quickly. I completely get the short term goal of getting data available, but we already have Excel files over in #7 which shouldn't go in here. This does need the GLOBUS up and running more cleanly. In the use case that a PR only has CSV data, I'm ok with it for now, for small files, but it's a really slippery path! |
…r testing now (related to #20)
|
@annarallings and @davidorme , I removed csv from gitignore and have uploaded the original data as a csv. This is not a 100% reproducible process because the original data were actually an Excel file, so I had to convert it to csv to push it. Previously, I have directly read the xlsx file in R. I think the eventual solution will be something like Globus. I agree that this will only be a temporary fix. I think this PR is ready for a final review? When you switch to this branch on your computer, you should have the input data now :) |
This PR addresses #6 (stems from ImperialCollegeLondon/virtual_ecosystem#746), which is to set up a generalised linear regression to estimate parameters for temperature-dependent microbial carbon use efficiency (CUE) that does not predict CUE out of bound (stays between 0 and 1).
There are two aspects to review: (1) model choice and (2) folder structure of this repo going forward.
Model choice:
Folder structure:
datadirectory, but it wasn't commited because the csv files etc. are gitignored, How do we envision data download? Do we always include the URL in the code for others to manually download it...?code/soil/cuefor my case, so thecodedirectory is a bit like themodelsdirectory in VE. Not sure if this is best.bibdirectory to address Bibliography #1 and followed the same filename asvirtual_ecosystem. The bibliography is supposed to include data sources and refs used in html reports that I create with RMarkdown, which leads to:-.Rscript, but decided to trial with.Rmdto also generate a report at the end. It is in html intended for anyone to jump right in without having to worry about the details. But the html file would easily go >500 kb (which is the lintr limit), and mine was 900 kb due to figures, so I didn't commit it.That's all for now on the top of my head. Looking forward to pin down a folder structure for this repo :)