Skip to content

kchu25/ShaneGPUCountMinSketch.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ShaneGPUCountMinSketch.jl

Stable Dev Build Status Coverage

Overview

ShaneGPUCountMinSketch.jl is a Julia package for fast, GPU-accelerated motif counting using the Count-Min Sketch probabilistic data structure. It is designed for large-scale sequence analysis, efficiently counting motif instances in sparse codes by leveraging CUDA-enabled GPUs.

Mathematical Background

We define a configuration $c_K$ as a $(2K+1)$-tuple for $K = 1,2,3,\ldots$:

$$ c_K = \big( c[1],; c[2],; c[3],; \ldots,; c[2K+1] \big) $$

  • The components $c[2k+1]$ (for $1 \leq k \leq K$) represent filter indices.
  • The components $c[2k]$ (for $1 \leq k \leq K-1$) represent the distances (e.g., number of nucleotides) between filters $c[2(k-1)+1]$ and $c[2k+1]$.

This structure allows flexible motif representation, encoding both filter indices and the distances between them.

Features

  • High-performance motif counting on GPU
  • Probabilistic counting with bounded error using Count-Min Sketch
  • Flexible configuration for motif structure and filter length
  • Easy integration with Julia data structures

Installation

using Pkg
Pkg.add("ShaneGPUCountMinSketch")

Quick Start

using ShaneGPUCountMinSketch

# Prepare your data as a Dict{Int, Vector{CartesianIndex{2}}}
nz_dict = Dict(
        1 => [CartesianIndex(5, 2), CartesianIndex(10, 3)],
        2 => [CartesianIndex(7, 1)]
)
num_fils = 2      # Number of filters in configuration
fil_len = 10      # Filter length

configs = obtain_enriched_configurations(nz_dict, num_fils, fil_len; min_count=2)

API

obtain_enriched_configurations

obtain_enriched_configurations(
        nz_dict::Dict{Int, Vector{CartesianIndex{2}}},
        num_fils::Int, 
        fil_len::Int;
        min_count=default_min_count
) -> Set{Tuple{Int, ...}}
  • nz_dict: Dictionary mapping sequence indices to vectors of (position, filter index) pairs.
  • num_fils: Number of filters in the configuration (motif size).
  • fil_len: Length of each filter.
  • min_count: Minimum count threshold for a configuration to be considered enriched.

Returns a set of integer tuples, each representing a unique motif configuration.

Notes

  • This package uses a probabilistic data structure (Count-Min Sketch), so results may vary slightly between runs. In practice, the error is negligible and theoretically bounded.

License

MIT License. See LICENSE for details.

About

A much faster way to enumerate combinations of patterns

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages