Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Project FINDIT (Facilitating Intra-Departmental Navigation of Data and Information Transfer) is a data management system for the Department of Psychology and Biobehavioral Science (PBS). The project consists of R-based "finders" (crawlers) that scan directories, extract metadata, and perform data management tasks.

## Architecture

The project contains two main components:

### 1. File Crawler (`File Crawler/file_crawler.Rmd`)
- Recursively scans directories to create a complete file inventory
- Extracts metadata: file paths, names, sizes, modified times, file types
- For SAS datasets (`.sas7bdat`), calculates unique patient identifier (MRN) counts from the first column
- Uses parallel processing to handle large file sets efficiently
- Outputs to: `../data/processed_data/file_inventory_updated.csv`

### 2. Participant First Finder (`Participant First Finder/participant_first_findr.Rmd`)
- Searches for specific patient MRNs across multiple file formats
- Supports: `.sas7bdat`, `.sas`, `.csv`, `.xlsx`
- Assumes MRN is in the first column of each data file
- Uses parallel processing for performance optimization
- Configured to scan: `../data/raw_data/[folder]`
- Reports file-level metadata plus MRN occurrence counts

## Key Technical Details

### R Package Dependencies
Both scripts require:
- `tidyverse` - data manipulation and transformation
- `haven` - reading SAS files (`.sas7bdat`, `.sas`)
- `tools` - file extension detection via `file_ext()`
- `data.table` - high-performance data operations
- `parallel` - multi-core processing
- `readr` - CSV writing (file_crawler only)
- `readxl` - Excel file reading (participant_first_findr only)

### Parallel Processing Pattern
Both finders use identical parallel processing architecture:
```r
num_cores <- detectCores() - 1
cl <- makeCluster(num_cores)
clusterEvalQ(cl, { library(...) })
clusterExport(cl, c("all_files", "process_file", ...))
file_results <- parLapply(cl, all_files, process_file)
stopCluster(cl)
```

### Data Processing Workflow
1. List all files recursively with `list.files(path, full.names = TRUE, recursive = TRUE)`
2. Build metadata inventory using `file.info()` and `file_ext()`
3. Process files in parallel based on file type
4. Consolidate results into single dataframe
5. Output structured CSV

### File Size Classification
Files are classified using 100MB threshold (not 1MB despite variable naming):
```r
mutate(file_size = ifelse(size > 100000000, "greater than 1 GB", "less than 1GB"))
```

## Running the Finders

Both tools are R Markdown files meant to be executed chunk-by-chunk in RStudio or via `knitr`:

```r
# In R console
rmarkdown::render("File Crawler/file_crawler.Rmd")
rmarkdown::render("Participant First Finder/participant_first_findr.Rmd")
```

For participant_first_findr, update the MRN and data path before running:
```r
specific_mrn <- "2759" # Line 68: Change to target MRN
start_path = "../data/raw_data/[folder]" # Line 20: Update folder path
```

## Development Notes

- Both scripts assume the first column of data files contains MRN/patient identifiers
- File paths are relative to the R Markdown file locations
- Output directory is created automatically if it doesn't exist
- The parallel processing uses all available cores minus one to avoid system overload
- SAS file reading may fail if SAS files are corrupted or use incompatible encoding