Skip to content

Hyperparameters optimisation #65

@wiktorolszowy

Description

@wiktorolszowy

Hi! Thanks a lot for this package. I am interested in how best to choose values of the hyperparameters. There are five of them that seem particularly relevant:

  1. d: the number of hash functions, used to initialize the LSH forest data structure, by default 128.
  2. l: the number of prefix trees, used to initialize the LSH forest data structure, by default 8.
  3. k: the number of nearest neighbors used to create the k-nearest neighbor graph, by default 10.
  4. $k_c$: the scalar by which k is multiplied before querying the LSH forest, by default 10.
  5. p: the size of the nodes, which affects the magnitude of their repelling force, by default 1/65.

The first two parameters are from tmap.LSHForest and their default values are defined here. The remaining parameters are from tmap.layout_from_lsh_forest and their default values are defined here.

From the supplement (https://ndownloader.figstatic.com/files/21710592) it seems that p is particularly important (cf. figures S1+S2+S3+S7). I often see tmap visualizations that are too sparse, in particular that some branches are very long and that some branches are very short (e.g., with the leaves). The paper and the corresponding analysis of the hyperparameters are already from 4 years ago. I am wondering whether there is someone who has used this tool extensively, who has experimented with these hyperparameters, and who maybe has developed some rules of thumb how to optimize these hyperparameters, especially p, for example dependent on the number of data points, and maybe also dependent on the approximate number of suspected clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions