Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Nov 28, 2025

resolves #5572

Follow up PR with fastaId to fastaIds change: #5583

Screenshot

fastaIds field added to template for CCHF:
image
image
Not added for ebola or EVs:
image
image

PR Checklist

  • Make PR with same changes in PPX -> after docs are approved
  • [x] Add fastaIds to commonMetadata fields in config? -> fastaIds field does not exist for single segmented organisms and thus should not be added here as this breaks Loculus
  • Ensure metadata template downloads are correct

🚀 Preview: https://multipath-docs.loculus.org

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Burning the midnight oil! Looks good but a few suggestions. Basically we should distinguish fastaId which is the name of a Loculus field, and "the fasta header" which is the thing it is used to link to (or maybe the "ID portion of the FASTA header" or the "FASTA entry's ID", but not the fastaId).

(I also wonder if we should call the column fastaIds since it will only be used when we expect the possibility of multiple entries?)

@anna-parker
Copy link
Contributor Author

(I also wonder if we should call the column fastaIds since it will only be used when we expect the possibility of multiple entries?)

Good point! @corneliusroemer do you have any thoughts? I will otherwise make a followup PR into our "multi path branch" with that update

@anna-parker anna-parker changed the title update docs feat(docs, website): multi path - update submission docs and templates with correct fields Nov 28, 2025
@anna-parker
Copy link
Contributor Author

@corneliusroemer and @theosanderson I tried to take both your suggestions into account - let me know what you think!

… (non-head) (#5584)

🚀 Preview: Add `preview` label to enable
@anna-parker anna-parker mentioned this pull request Dec 1, 2025
3 tasks
Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

resolves #5570

@theosanderson I am seeing a failing integration test but I think it is
just flaky, let me know if you think it is an actual bug!:
#5426

### PR Checklist
- [ ] All necessary documentation has been adapted.
- [ ] The implemented feature is covered by appropriate, automated
tests.
- [ ] Any manual testing that has been done is documented (i.e. what
exactly was tested?)

🚀 Preview: https://rename-fastaids.loculus.org

---------

Co-authored-by: Theo Sanderson <[email protected]>
@anna-parker
Copy link
Contributor Author

@corneliusroemer I will merge this now as it an overall improvement - if you have any improvements please make a PR and Im happy to review :-)

@anna-parker anna-parker merged commit 3f54e76 into edit-page-anya Dec 1, 2025
42 checks passed
@anna-parker anna-parker deleted the multipath-docs branch December 1, 2025 12:33
anna-parker added a commit that referenced this pull request Dec 1, 2025
…s with correct fields (#5561)

resolves #5572

Follow up PR with `fastaId` to `fastaIds` change:
#5583

### Screenshot
fastaIds field added to template for CCHF:
<img width="1672" height="1138" alt="image"
src="https://github.com/user-attachments/assets/0dcc1be8-2f01-4205-a819-84ea8055fc5f"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/b41678ca-17c5-41ef-a409-86288f18d124"
/>
Not added for ebola or EVs:
<img width="1728" height="644" alt="image"
src="https://github.com/user-attachments/assets/ebcfa88a-e64f-47b1-9fea-e3dfd6bd21a5"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/a77e47ee-8f06-4a29-9eeb-68863ed3dbd0"
/>


### PR Checklist
- [ ] Make PR with same changes in PPX -> after docs are approved
- ~[x] Add fastaIds to commonMetadata fields in config?~ -> fastaIds
field does not exist for single segmented organisms and thus should not
be added here as this breaks Loculus
- [x] Ensure metadata template downloads are correct

🚀 Preview: https://multipath-docs.loculus.org

---------

Co-authored-by: Theo Sanderson <[email protected]>
anna-parker added a commit that referenced this pull request Dec 1, 2025
…s with correct fields (#5561)

resolves #5572

Follow up PR with `fastaId` to `fastaIds` change:
#5583

### Screenshot
fastaIds field added to template for CCHF:
<img width="1672" height="1138" alt="image"
src="https://github.com/user-attachments/assets/0dcc1be8-2f01-4205-a819-84ea8055fc5f"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/b41678ca-17c5-41ef-a409-86288f18d124"
/>
Not added for ebola or EVs:
<img width="1728" height="644" alt="image"
src="https://github.com/user-attachments/assets/ebcfa88a-e64f-47b1-9fea-e3dfd6bd21a5"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/a77e47ee-8f06-4a29-9eeb-68863ed3dbd0"
/>


### PR Checklist
- [ ] Make PR with same changes in PPX -> after docs are approved
- ~[x] Add fastaIds to commonMetadata fields in config?~ -> fastaIds
field does not exist for single segmented organisms and thus should not
be added here as this breaks Loculus
- [x] Ensure metadata template downloads are correct

🚀 Preview: https://multipath-docs.loculus.org

---------

Co-authored-by: Theo Sanderson <[email protected]>
anna-parker added a commit that referenced this pull request Dec 2, 2025
…s with correct fields (#5561)

resolves #5572

Follow up PR with `fastaId` to `fastaIds` change:
#5583

### Screenshot
fastaIds field added to template for CCHF:
<img width="1672" height="1138" alt="image"
src="https://github.com/user-attachments/assets/0dcc1be8-2f01-4205-a819-84ea8055fc5f"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/b41678ca-17c5-41ef-a409-86288f18d124"
/>
Not added for ebola or EVs:
<img width="1728" height="644" alt="image"
src="https://github.com/user-attachments/assets/ebcfa88a-e64f-47b1-9fea-e3dfd6bd21a5"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/a77e47ee-8f06-4a29-9eeb-68863ed3dbd0"
/>


### PR Checklist
- [ ] Make PR with same changes in PPX -> after docs are approved
- ~[x] Add fastaIds to commonMetadata fields in config?~ -> fastaIds
field does not exist for single segmented organisms and thus should not
be added here as this breaks Loculus
- [x] Ensure metadata template downloads are correct

🚀 Preview: https://multipath-docs.loculus.org

---------

Co-authored-by: Theo Sanderson <[email protected]>
anna-parker added a commit that referenced this pull request Dec 3, 2025
…s with correct fields (#5561)

resolves #5572

Follow up PR with `fastaId` to `fastaIds` change:
#5583

### Screenshot
fastaIds field added to template for CCHF:
<img width="1672" height="1138" alt="image"
src="https://github.com/user-attachments/assets/0dcc1be8-2f01-4205-a819-84ea8055fc5f"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/b41678ca-17c5-41ef-a409-86288f18d124"
/>
Not added for ebola or EVs:
<img width="1728" height="644" alt="image"
src="https://github.com/user-attachments/assets/ebcfa88a-e64f-47b1-9fea-e3dfd6bd21a5"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/a77e47ee-8f06-4a29-9eeb-68863ed3dbd0"
/>


### PR Checklist
- [ ] Make PR with same changes in PPX -> after docs are approved
- ~[x] Add fastaIds to commonMetadata fields in config?~ -> fastaIds
field does not exist for single segmented organisms and thus should not
be added here as this breaks Loculus
- [x] Ensure metadata template downloads are correct

🚀 Preview: https://multipath-docs.loculus.org

---------

Co-authored-by: Theo Sanderson <[email protected]>
anna-parker added a commit that referenced this pull request Dec 4, 2025
…s with correct fields (#5561)

resolves #5572

Follow up PR with `fastaId` to `fastaIds` change:
#5583

### Screenshot
fastaIds field added to template for CCHF:
<img width="1672" height="1138" alt="image"
src="https://github.com/user-attachments/assets/0dcc1be8-2f01-4205-a819-84ea8055fc5f"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/b41678ca-17c5-41ef-a409-86288f18d124"
/>
Not added for ebola or EVs:
<img width="1728" height="644" alt="image"
src="https://github.com/user-attachments/assets/ebcfa88a-e64f-47b1-9fea-e3dfd6bd21a5"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/a77e47ee-8f06-4a29-9eeb-68863ed3dbd0"
/>


### PR Checklist
- [ ] Make PR with same changes in PPX -> after docs are approved
- ~[x] Add fastaIds to commonMetadata fields in config?~ -> fastaIds
field does not exist for single segmented organisms and thus should not
be added here as this breaks Loculus
- [x] Ensure metadata template downloads are correct

🚀 Preview: https://multipath-docs.loculus.org

---------

Co-authored-by: Theo Sanderson <[email protected]>
anna-parker added a commit that referenced this pull request Dec 5, 2025
… refactor multi segment submission in backend and edit page and have prepro assign segments (#5382)

resolves #4999 #4708,
#4734,
#5511

partially resolves
#5392,
#5185 (comment)

includes work done in
#5398 and
#5402

This PR additionally fixes submission, subtype assignment and search for
EVs and other multi-path organisms.

### BREAKING CHANGES

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaIds` column with a space -separated list of the
`fastaId`s (fasta header IDs) of the respective sequences. If no
`fastaIds` column is supplied the `submissionId` will be used instead
and the backend will assume that (as in the single-segmented case) there
is a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort (uses a minimizer index for fast local alignment) or
nextclade align (full sequence alignment to reference) will be used to
assign segments/subtypes for all multi-segmented and multi-pathogen
sequences (this is also done in ingest for grouping segments):
```
segment_classification_method: "minimizer" or "align"
minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format `<submissionId>_<segmentName>` (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaId in the processedData, the map is
called: `sequenceNameToFastaId`. This allows us to surface the segment
assignment on the edit page.

### Nextclade Preprocessing pipeline config changes

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a dictionary where each item includes all
information required to run nextclade. I.e. we change from:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```
to: 
```
nextclade_sequence_and_datasets: 
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
    genes: [RdRp]
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
    genes: [GPC]
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
    genes: [NP]
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align">
minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort>
```

### Ingest Pipeline Config changes

`minimizer_index` is changed to `minimizer_url` for consistency (can be
used in ingest and preprocessing and should both be the same)

### Optional additional Config changes

Limit the number of sequences the backend will accept per submission by
using - should be added for multi-segmented organisms:
`
submissionDataTypes: &defaultSubmissionDataTypes
  consensusSequences: true
  maxSequencesPerEntry: 1
`

### Testing

You can use pathoplexus/example_data#16 and
pathoplexus/dev_example_data#2 for testing.

### PR Checklist
- [x] Update values.schema.json and other READMEs
- [x] add fastaId to commonMetadata (ensure it is downloaded in
templates): #5561
- [x] Fix how genes are returned (will cause a config update):
#5563
- [x] Improve prepro code (less duplication and more tests):
#5554
- [x] ingest EVs as single segmented to ensure search works:
#5511
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping
- ~add integration testing for full EV submission user journey~ -> will
be done in a later PR
- [x] improve CCHF minimizer (some segments are again not assigned)
- [x] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
-> decided against
- [x] update PPX docs with new multi-segment submission format -> test
PR here: pathoplexus/pathoplexus#759
- [x] update example data for demo

🚀 Preview: https://edit-page-anya.loculus.org

---------

Co-authored-by: Cornelius Roemer <[email protected]>
Co-authored-by: Fabian Engelniederhammer <[email protected]>
Co-authored-by: Theo Sanderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants