stjude-biohackathon · renato-umeton · Nov 3, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,86 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Project FINDIT (Facilitating Intra-Departmental Navigation of Data and Information Transfer) is a data management system for the Department of Psychology and Biobehavioral Science (PBS). The project consists of R-based "finders" (crawlers) that scan directories, extract metadata, and perform data management tasks.
+
+## Architecture
+
+The project contains two main components:
+
+### 1. File Crawler (`File Crawler/file_crawler.Rmd`)
+- Recursively scans directories to create a complete file inventory
+- Extracts metadata: file paths, names, sizes, modified times, file types
+- For SAS datasets (`.sas7bdat`), calculates unique patient identifier (MRN) counts from the first column
+- Uses parallel processing to handle large file sets efficiently
+- Outputs to: `../data/processed_data/file_inventory_updated.csv`
+
+### 2. Participant First Finder (`Participant First Finder/participant_first_findr.Rmd`)
+- Searches for specific patient MRNs across multiple file formats
+- Supports: `.sas7bdat`, `.sas`, `.csv`, `.xlsx`
+- Assumes MRN is in the first column of each data file
+- Uses parallel processing for performance optimization
+- Configured to scan: `../data/raw_data/[folder]`
+- Reports file-level metadata plus MRN occurrence counts
+
+## Key Technical Details
+
+### R Package Dependencies
+Both scripts require:
+- `tidyverse` - data manipulation and transformation
+- `haven` - reading SAS files (`.sas7bdat`, `.sas`)
+- `tools` - file extension detection via `file_ext()`
+- `data.table` - high-performance data operations
+- `parallel` - multi-core processing
+- `readr` - CSV writing (file_crawler only)
+- `readxl` - Excel file reading (participant_first_findr only)
+
+### Parallel Processing Pattern
+Both finders use identical parallel processing architecture:
+```r
+num_cores <- detectCores() - 1
+cl <- makeCluster(num_cores)
+clusterEvalQ(cl, { library(...) })
+clusterExport(cl, c("all_files", "process_file", ...))
+file_results <- parLapply(cl, all_files, process_file)
+stopCluster(cl)
+```
+
+### Data Processing Workflow
+1. List all files recursively with `list.files(path, full.names = TRUE, recursive = TRUE)`
+2. Build metadata inventory using `file.info()` and `file_ext()`
+3. Process files in parallel based on file type
+4. Consolidate results into single dataframe
+5. Output structured CSV
+
+### File Size Classification
+Files are classified using 100MB threshold (not 1MB despite variable naming):
+```r
+mutate(file_size = ifelse(size > 100000000, "greater than 1 GB", "less than 1GB"))
+```
+
+## Running the Finders
+
+Both tools are R Markdown files meant to be executed chunk-by-chunk in RStudio or via `knitr`:
+
+```r
+# In R console
+rmarkdown::render("File Crawler/file_crawler.Rmd")
+rmarkdown::render("Participant First Finder/participant_first_findr.Rmd")
+```
+
+For participant_first_findr, update the MRN and data path before running:
+```r
+specific_mrn <- "2759"  # Line 68: Change to target MRN
+start_path = "../data/raw_data/[folder]"  # Line 20: Update folder path
+```
+
+## Development Notes
+
+- Both scripts assume the first column of data files contains MRN/patient identifiers
+- File paths are relative to the R Markdown file locations
+- Output directory is created automatically if it doesn't exist
+- The parallel processing uses all available cores minus one to avoid system overload
+- SAS file reading may fail if SAS files are corrupted or use incompatible encoding