Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .claude/rules/csv_reader_rules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
paths:
- "include/internal/csv_reader.hpp"
- "include/internal/csv_reader.cpp"
- "include/internal/csv_reader_iterator.cpp"
---

# CSVReader Iterator Rules

## CRITICAL CONSTRAINT: DO NOT Cache All RawCSVDataPtr Chunks

### The Streaming Architecture
- Library parses 50+ GB CSV files with bounded memory (e.g., 8GB RAM)
- Previous data chunks are freed as iterator advances
- `CSVReader::iterator` is `std::input_iterator_tag` by design (single-pass)

### When You See Heap-Use-After-Free with std::max_element or ForwardIterator Algorithms

**❌ WRONG FIX (defeats streaming)**:
```cpp
class iterator {
std::vector<RawCSVDataPtr> _all_data_chunks; // NO!
// Result: 50GB CSV requires 50GB RAM
};
```

**✅ CORRECT FIXES**:
1. **Document limitation**: ForwardIterator algorithms not supported on CSVReader::iterator
2. **Fix test**: Copy to vector first: `std::vector<CSVRow> rows(reader.begin(), reader.end());`
3. **Update README**: Show vector-based workaround for std::max_element

### Valid RawCSVDataPtr Usage
- ✅ `CSVRow::iterator` CAN have RawCSVDataPtr member (single row scope, negligible memory)
- ❌ `CSVReader::iterator` CANNOT cache all chunks (file scope, unbounded memory)

### Root Cause Analysis
- Small test files (< chunk size) fit in one chunk → ForwardIterator algorithms appear to work
- Large files span multiple chunks → heap-use-after-free when algorithm accesses freed chunk
- Solution: Document as unsupported OR provide vector workaround, NOT cache all chunks
117 changes: 117 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# GitHub Actions Memory Checking Setup

This directory contains GitHub Actions workflows for comprehensive memory and thread safety testing.

## Workflows

### 🧵 sanitizers.yml - Memory & Thread Sanitizers (PRIMARY)
**Runs:** On every push and pull request
**Sanitizers Included:**
- **ThreadSanitizer (TSan)** - Detects data races and thread safety issues (CRITICAL for csv-parser)
- **AddressSanitizer (ASan)** - Detects memory errors: use-after-free, buffer overflows, memory leaks
- **UndefinedBehaviorSanitizer (UBSan)** - Catches undefined behavior: signed overflow, type mismatches

**Config:**
- Runs on Ubuntu with GCC
- Tests C++17 and C++20 standards
- Debug builds for better diagnostics
- Timeout: 20 minutes per configuration
- Artifacts: Upload logs on failure

**Key Features:**
- Matrix testing: 3 sanitizers × 2 C++ standards = 6 parallel jobs
- Fail-fast disabled to see all results
- Environment variables configured for halt-on-error behavior

### 💾 valgrind.yml - Valgrind Memory Profiler
**Runs:** On every push and pull request
**Uses:** Valgrind with full leak checking

**Config:**
- Runs on Ubuntu with GCC
- Debug build with -O1 for balance
- Excludes multi-threaded tests (Valgrind + threads = slow)
- Timeout: 60 minutes (Valgrind is slower)

### 🔍 codeql.yml - Static Analysis (GitHub CodeQL)
**Runs:** On every push and pull request + weekly schedule
**Analysis:**
- Deep static analysis for security and quality issues
- Integrates with GitHub Security tab

## Testing Recommendations

### Local Testing Before Push
```bash
# Test with ThreadSanitizer (most critical)
cmake -B build/tsan -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_CXX_FLAGS="-fsanitize=thread -g"
cmake --build build/tsan
cd build/tsan && ctest --output-on-failure && cd ../..

# Test with AddressSanitizer
cmake -B build/asan -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g"
cmake --build build/asan
cd build/asan && ctest --output-on-failure && cd ../..
```

### CI/CD Pipeline Order
1. **Primary:** ThreadSanitizer - Catches data races in threading model
2. **Primary:** AddressSanitizer - Catches memory corruption
3. **Secondary:** UBSanitizer - Catches subtle undefined behavior
4. **Nightly:** Valgrind - Comprehensive memory profiling
5. **Continuous:** CodeQL - Static analysis

## Known Considerations

### ThreadSanitizer (TSan)
- **Why CRITICAL:** csv-parser uses worker threads + ThreadSafeDeque + double_quote_fields lazy init
- May report false positives in STL (benign race detection)
- Slower execution due to extra instrumentation
- Catches real issues in PR #282 exception propagation and issue #278 move semantics

### AddressSanitizer (ASan)
- **Why Important:** Catches CSVFieldList memory issues like issue #278
- Cannot run simultaneously with TSan (different memory models)
- Better performance than TSan for memory safety

### Valgrind
- Slower than sanitizers but more mature tool
- Useful for final verification before releases
- Runs on push (not PR) to avoid GitHub Actions timeout

### CodeQL
- No runtime overhead
- Integrated into GitHub Security tab
- Good for catching security issues and code quality

## Interpreting Results

### Sanitizer Failures
Look for:
```
ERROR: ThreadSanitizer: data race detected
ERROR: AddressSanitizer: heap-buffer-overflow
ERROR: UndefinedBehaviorSanitizer: signed integer overflow
```

### Action Items
1. Check artifacts uploaded on test failure
2. Look at the specific sanitizer output in GitHub Actions logs
3. Focus on ThreadSanitizer results first (threading complexity)
4. Correlate with codebase changes to identify root cause

## History

- **PR #282:** ThreadSanitizer would catch exception propagation issues
- **Issue #278:** AddressSanitizer catches CSVFieldList move semantics bug
- **PR #237:** TSan catches double_quote_fields concurrency todo
- **v2.3.0:** ThreadSafeDeque prevents std::vector reallocation race

## Reference

- [ThreadSanitizer Documentation](https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual)
- [AddressSanitizer Documentation](https://github.com/google/sanitizers/wiki/AddressSanitizer)
- [Valgrind User Manual](https://valgrind.org/docs/manual/)
- [CodeQL Query Help](https://codeql.github.io/codeql-query-help/)
2 changes: 1 addition & 1 deletion .github/workflows/cmake-multi-platform.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ name: CMake on multiple platforms

on:
push:
branches: [ "master", "memory-fix-csvfieldlist" ]
branches: [ "master", "feb-15-2026" ]
pull_request:
branches: [ "master" ]

Expand Down
58 changes: 58 additions & 0 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: CodeQL Analysis

on:
push:
branches: [ "master", "feb-15-2026" ]
pull_request:
branches: [ "master" ]
schedule:
- cron: '0 2 * * 0' # Weekly on Sunday at 2 AM UTC

jobs:
analyze:
name: CodeQL C++
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write

strategy:
fail-fast: false
matrix:
language: ['cpp']

steps:
- name: Checkout repository and submodules
uses: actions/checkout@v4
with:
submodules: recursive

- name: Initialize CodeQL
uses: github/codeql-action/init@v3
with:
language: ${{ matrix.language }}
queries: security-and-quality

- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake

- name: Configure CMake
run: |
mkdir -p build
cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCSV_CXX_STANDARD=17 \
-DCMAKE_CXX_COMPILER=g++ \
-DCMAKE_C_COMPILER=gcc

- name: Build for CodeQL analysis
run: cmake --build build --config Release

- name: Run CodeQL analysis
uses: github/codeql-action/analyze@v3
with:
category: "/language:${{ matrix.language }}"
111 changes: 111 additions & 0 deletions .github/workflows/sanitizers.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
name: Memory and Thread Sanitizers

on:
push:
branches: [ "master", "feb-15-2026" ]
pull_request:
branches: [ "master" ]

jobs:
sanitizers:
runs-on: ubuntu-latest

strategy:
fail-fast: false
matrix:
sanitizer: [address, thread, undefined]
cxx_standard: [17, 20]
include:
- sanitizer: address
flag: "-fsanitize=address -fno-omit-frame-pointer"
name: "AddressSanitizer"
- sanitizer: thread
flag: "-fsanitize=thread"
name: "ThreadSanitizer"
- sanitizer: undefined
flag: "-fsanitize=undefined -fno-omit-frame-pointer"
name: "UndefinedBehaviorSanitizer"

env:
ASAN_OPTIONS: "detect_leaks=1:halt_on_error=0"
TSAN_OPTIONS: "halt_on_error=1"
UBSAN_OPTIONS: "halt_on_error=1"

steps:
- name: Checkout repository and submodules
uses: actions/checkout@v4
with:
submodules: recursive

- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake

- name: Configure CMake with ${{ matrix.name }}
run: |
mkdir -p build
cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Debug \
-DCSV_CXX_STANDARD=${{ matrix.cxx_standard }} \
-DCMAKE_CXX_COMPILER=g++ \
-DCMAKE_C_COMPILER=gcc \
-DCMAKE_CXX_FLAGS="${{ matrix.flag }} -g" \
-DCMAKE_C_FLAGS="${{ matrix.flag }} -g"

- name: Build with ${{ matrix.name }}
run: cmake --build build --config Debug

- name: Test with ${{ matrix.name }}
working-directory: build
run: ctest --output-on-failure -V
timeout-minutes: 20

- name: Upload sanitizer logs
if: failure()
uses: actions/upload-artifact@v4
with:
name: sanitizer-logs-${{ matrix.sanitizer }}-std${{ matrix.cxx_standard }}
path: build/Testing/

valgrind:
Comment on lines +11 to +72

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 2 months ago

In general, the fix is to explicitly set a restrictive permissions block in the workflow, ideally at the top level so it applies to all jobs, and only broaden it if/when a specific job needs more. For this workflow, the jobs only need to read repository contents and interact with artifacts; they do not need to write to code, PRs, or issues. A suitable minimal configuration is permissions: contents: read, which still allows actions/checkout to function, and does not interfere with actions/upload-artifact (which does not use GITHUB_TOKEN).

The single best fix without changing functionality is to add a top-level permissions section right under the name: (and before on:) that sets contents: read. This will apply to both sanitizers and valgrind jobs, satisfying CodeQL and enforcing least privilege. No other code, steps, or jobs need modifications, and no imports or external libraries are involved since this is a YAML configuration change only.

Specifically, in .github/workflows/sanitizers.yml, insert:

permissions:
  contents: read

between line 1 (name: Memory and Thread Sanitizers) and line 3 (on:). No other file changes are required.

Suggested changeset 1
.github/workflows/sanitizers.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/sanitizers.yml b/.github/workflows/sanitizers.yml
--- a/.github/workflows/sanitizers.yml
+++ b/.github/workflows/sanitizers.yml
@@ -1,4 +1,6 @@
 name: Memory and Thread Sanitizers
+permissions:
+  contents: read
 
 on:
   push:
EOF
@@ -1,4 +1,6 @@
name: Memory and Thread Sanitizers
permissions:
contents: read

on:
push:
Copilot is powered by AI and may make mistakes. Always verify output.
runs-on: ubuntu-latest

steps:
- name: Checkout repository and submodules
uses: actions/checkout@v4
with:
submodules: recursive

- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake valgrind

- name: Configure CMake
run: |
mkdir -p build
cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Debug \
-DCSV_CXX_STANDARD=17 \
-DCMAKE_CXX_COMPILER=g++ \
-DCMAKE_C_COMPILER=gcc \
-DCMAKE_CXX_FLAGS="-g -O1" \
-DCMAKE_C_FLAGS="-g -O1"

- name: Build
run: cmake --build build --config Debug

- name: Test with Valgrind
working-directory: build
run: ctest --output-on-failure
timeout-minutes: 60

- name: Upload Valgrind results
if: failure()
uses: actions/upload-artifact@v4
with:
name: valgrind-results
path: build/Testing/
Comment on lines +73 to +111

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 2 months ago

In general, to fix this issue you should explicitly declare permissions for the GITHUB_TOKEN either at the workflow root (applying to all jobs) or for each job individually, granting only the minimal scopes needed. For a build-and-test workflow that only checks out the repo and uploads artifacts, contents: read is typically sufficient.

The best, least-invasive fix here is to add a single root-level permissions block right under the name: (or under on:) in .github/workflows/sanitizers.yml. This will apply to both sanitizers and valgrind jobs and ensures that the GITHUB_TOKEN has read-only access to repository contents and no write scopes. No existing steps need modification, and no new dependencies or imports are required, because GitHub Actions interprets the permissions key natively.

Concretely:

  • Edit .github/workflows/sanitizers.yml.
  • Insert:
permissions:
  contents: read

after the name: Memory and Thread Sanitizers line (or equivalently between on: and jobs:), ensuring indentation matches YAML rules (root-level, no extra spaces). This documents and enforces least-privilege access for the workflow.

Suggested changeset 1
.github/workflows/sanitizers.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/sanitizers.yml b/.github/workflows/sanitizers.yml
--- a/.github/workflows/sanitizers.yml
+++ b/.github/workflows/sanitizers.yml
@@ -1,5 +1,8 @@
 name: Memory and Thread Sanitizers
 
+permissions:
+  contents: read
+
 on:
   push:
     branches: [ "master", "feb-15-2026" ]
EOF
@@ -1,5 +1,8 @@
name: Memory and Thread Sanitizers

permissions:
contents: read

on:
push:
branches: [ "master", "feb-15-2026" ]
Copilot is powered by AI and may make mistakes. Always verify output.
7 changes: 6 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ else()
endif(CSV_CXX_STANDARD)

option(BUILD_PYTHON "Build Python Binding" OFF)
option(ENABLE_CODE_COVERAGE "Enable code coverage instrumentation" OFF)

message("Building CSV library using C++${CMAKE_CXX_STANDARD}")

Expand All @@ -29,7 +30,11 @@ if(MSVC)
else()
# Ignore Visual Studio pragma regions
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unknown-pragmas")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} --coverage -Og")

if(ENABLE_CODE_COVERAGE)
message("Code coverage instrumentation enabled")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} --coverage -Og")
endif()
endif(MSVC)

set(CSV_ROOT_DIR ${CMAKE_CURRENT_LIST_DIR})
Expand Down
Loading
Loading