-
Notifications
You must be signed in to change notification settings - Fork 195
Version 2.4.1: CSV Parser Bulletproofing #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4aa9ae9
95797eb
718eadc
27282ed
1507866
6a31bdc
3f02253
6604cac
a78686d
bcd9432
384e753
a412c83
ba4c42e
483911d
c82889d
23f8999
3596f1a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| --- | ||
| paths: | ||
| - "include/internal/csv_reader.hpp" | ||
| - "include/internal/csv_reader.cpp" | ||
| - "include/internal/csv_reader_iterator.cpp" | ||
| --- | ||
|
|
||
| # CSVReader Iterator Rules | ||
|
|
||
| ## CRITICAL CONSTRAINT: DO NOT Cache All RawCSVDataPtr Chunks | ||
|
|
||
| ### The Streaming Architecture | ||
| - Library parses 50+ GB CSV files with bounded memory (e.g., 8GB RAM) | ||
| - Previous data chunks are freed as iterator advances | ||
| - `CSVReader::iterator` is `std::input_iterator_tag` by design (single-pass) | ||
|
|
||
| ### When You See Heap-Use-After-Free with std::max_element or ForwardIterator Algorithms | ||
|
|
||
| **❌ WRONG FIX (defeats streaming)**: | ||
| ```cpp | ||
| class iterator { | ||
| std::vector<RawCSVDataPtr> _all_data_chunks; // NO! | ||
| // Result: 50GB CSV requires 50GB RAM | ||
| }; | ||
| ``` | ||
|
|
||
| **✅ CORRECT FIXES**: | ||
| 1. **Document limitation**: ForwardIterator algorithms not supported on CSVReader::iterator | ||
| 2. **Fix test**: Copy to vector first: `std::vector<CSVRow> rows(reader.begin(), reader.end());` | ||
| 3. **Update README**: Show vector-based workaround for std::max_element | ||
|
|
||
| ### Valid RawCSVDataPtr Usage | ||
| - ✅ `CSVRow::iterator` CAN have RawCSVDataPtr member (single row scope, negligible memory) | ||
| - ❌ `CSVReader::iterator` CANNOT cache all chunks (file scope, unbounded memory) | ||
|
|
||
| ### Root Cause Analysis | ||
| - Small test files (< chunk size) fit in one chunk → ForwardIterator algorithms appear to work | ||
| - Large files span multiple chunks → heap-use-after-free when algorithm accesses freed chunk | ||
| - Solution: Document as unsupported OR provide vector workaround, NOT cache all chunks |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
| # GitHub Actions Memory Checking Setup | ||
|
|
||
| This directory contains GitHub Actions workflows for comprehensive memory and thread safety testing. | ||
|
|
||
| ## Workflows | ||
|
|
||
| ### 🧵 sanitizers.yml - Memory & Thread Sanitizers (PRIMARY) | ||
| **Runs:** On every push and pull request | ||
| **Sanitizers Included:** | ||
| - **ThreadSanitizer (TSan)** - Detects data races and thread safety issues (CRITICAL for csv-parser) | ||
| - **AddressSanitizer (ASan)** - Detects memory errors: use-after-free, buffer overflows, memory leaks | ||
| - **UndefinedBehaviorSanitizer (UBSan)** - Catches undefined behavior: signed overflow, type mismatches | ||
|
|
||
| **Config:** | ||
| - Runs on Ubuntu with GCC | ||
| - Tests C++17 and C++20 standards | ||
| - Debug builds for better diagnostics | ||
| - Timeout: 20 minutes per configuration | ||
| - Artifacts: Upload logs on failure | ||
|
|
||
| **Key Features:** | ||
| - Matrix testing: 3 sanitizers × 2 C++ standards = 6 parallel jobs | ||
| - Fail-fast disabled to see all results | ||
| - Environment variables configured for halt-on-error behavior | ||
|
|
||
| ### 💾 valgrind.yml - Valgrind Memory Profiler | ||
| **Runs:** On every push and pull request | ||
| **Uses:** Valgrind with full leak checking | ||
|
|
||
| **Config:** | ||
| - Runs on Ubuntu with GCC | ||
| - Debug build with -O1 for balance | ||
| - Excludes multi-threaded tests (Valgrind + threads = slow) | ||
| - Timeout: 60 minutes (Valgrind is slower) | ||
|
|
||
| ### 🔍 codeql.yml - Static Analysis (GitHub CodeQL) | ||
| **Runs:** On every push and pull request + weekly schedule | ||
| **Analysis:** | ||
| - Deep static analysis for security and quality issues | ||
| - Integrates with GitHub Security tab | ||
|
|
||
| ## Testing Recommendations | ||
|
|
||
| ### Local Testing Before Push | ||
| ```bash | ||
| # Test with ThreadSanitizer (most critical) | ||
| cmake -B build/tsan -DCMAKE_BUILD_TYPE=Debug \ | ||
| -DCMAKE_CXX_FLAGS="-fsanitize=thread -g" | ||
| cmake --build build/tsan | ||
| cd build/tsan && ctest --output-on-failure && cd ../.. | ||
|
|
||
| # Test with AddressSanitizer | ||
| cmake -B build/asan -DCMAKE_BUILD_TYPE=Debug \ | ||
| -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g" | ||
| cmake --build build/asan | ||
| cd build/asan && ctest --output-on-failure && cd ../.. | ||
| ``` | ||
|
|
||
| ### CI/CD Pipeline Order | ||
| 1. **Primary:** ThreadSanitizer - Catches data races in threading model | ||
| 2. **Primary:** AddressSanitizer - Catches memory corruption | ||
| 3. **Secondary:** UBSanitizer - Catches subtle undefined behavior | ||
| 4. **Nightly:** Valgrind - Comprehensive memory profiling | ||
| 5. **Continuous:** CodeQL - Static analysis | ||
|
|
||
| ## Known Considerations | ||
|
|
||
| ### ThreadSanitizer (TSan) | ||
| - **Why CRITICAL:** csv-parser uses worker threads + ThreadSafeDeque + double_quote_fields lazy init | ||
| - May report false positives in STL (benign race detection) | ||
| - Slower execution due to extra instrumentation | ||
| - Catches real issues in PR #282 exception propagation and issue #278 move semantics | ||
|
|
||
| ### AddressSanitizer (ASan) | ||
| - **Why Important:** Catches CSVFieldList memory issues like issue #278 | ||
| - Cannot run simultaneously with TSan (different memory models) | ||
| - Better performance than TSan for memory safety | ||
|
|
||
| ### Valgrind | ||
| - Slower than sanitizers but more mature tool | ||
| - Useful for final verification before releases | ||
| - Runs on push (not PR) to avoid GitHub Actions timeout | ||
|
|
||
| ### CodeQL | ||
| - No runtime overhead | ||
| - Integrated into GitHub Security tab | ||
| - Good for catching security issues and code quality | ||
|
|
||
| ## Interpreting Results | ||
|
|
||
| ### Sanitizer Failures | ||
| Look for: | ||
| ``` | ||
| ERROR: ThreadSanitizer: data race detected | ||
| ERROR: AddressSanitizer: heap-buffer-overflow | ||
| ERROR: UndefinedBehaviorSanitizer: signed integer overflow | ||
| ``` | ||
|
|
||
| ### Action Items | ||
| 1. Check artifacts uploaded on test failure | ||
| 2. Look at the specific sanitizer output in GitHub Actions logs | ||
| 3. Focus on ThreadSanitizer results first (threading complexity) | ||
| 4. Correlate with codebase changes to identify root cause | ||
|
|
||
| ## History | ||
|
|
||
| - **PR #282:** ThreadSanitizer would catch exception propagation issues | ||
| - **Issue #278:** AddressSanitizer catches CSVFieldList move semantics bug | ||
| - **PR #237:** TSan catches double_quote_fields concurrency todo | ||
| - **v2.3.0:** ThreadSafeDeque prevents std::vector reallocation race | ||
|
|
||
| ## Reference | ||
|
|
||
| - [ThreadSanitizer Documentation](https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual) | ||
| - [AddressSanitizer Documentation](https://github.com/google/sanitizers/wiki/AddressSanitizer) | ||
| - [Valgrind User Manual](https://valgrind.org/docs/manual/) | ||
| - [CodeQL Query Help](https://codeql.github.io/codeql-query-help/) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| name: CodeQL Analysis | ||
|
|
||
| on: | ||
| push: | ||
| branches: [ "master", "feb-15-2026" ] | ||
| pull_request: | ||
| branches: [ "master" ] | ||
| schedule: | ||
| - cron: '0 2 * * 0' # Weekly on Sunday at 2 AM UTC | ||
|
|
||
| jobs: | ||
| analyze: | ||
| name: CodeQL C++ | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| actions: read | ||
| contents: read | ||
| security-events: write | ||
|
|
||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| language: ['cpp'] | ||
|
|
||
| steps: | ||
| - name: Checkout repository and submodules | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| submodules: recursive | ||
|
|
||
| - name: Initialize CodeQL | ||
| uses: github/codeql-action/init@v3 | ||
| with: | ||
| language: ${{ matrix.language }} | ||
| queries: security-and-quality | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| sudo apt-get update | ||
| sudo apt-get install -y build-essential cmake | ||
|
|
||
| - name: Configure CMake | ||
| run: | | ||
| mkdir -p build | ||
| cd build | ||
| cmake .. \ | ||
| -DCMAKE_BUILD_TYPE=Release \ | ||
| -DCSV_CXX_STANDARD=17 \ | ||
| -DCMAKE_CXX_COMPILER=g++ \ | ||
| -DCMAKE_C_COMPILER=gcc | ||
|
|
||
| - name: Build for CodeQL analysis | ||
| run: cmake --build build --config Release | ||
|
|
||
| - name: Run CodeQL analysis | ||
| uses: github/codeql-action/analyze@v3 | ||
| with: | ||
| category: "/language:${{ matrix.language }}" |
| Original file line number | Diff line number | Diff line change | |||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,111 @@ | |||||||||||||||||||||||||||||
| name: Memory and Thread Sanitizers | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| on: | |||||||||||||||||||||||||||||
| push: | |||||||||||||||||||||||||||||
| branches: [ "master", "feb-15-2026" ] | |||||||||||||||||||||||||||||
| pull_request: | |||||||||||||||||||||||||||||
| branches: [ "master" ] | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| jobs: | |||||||||||||||||||||||||||||
| sanitizers: | |||||||||||||||||||||||||||||
| runs-on: ubuntu-latest | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| strategy: | |||||||||||||||||||||||||||||
| fail-fast: false | |||||||||||||||||||||||||||||
| matrix: | |||||||||||||||||||||||||||||
| sanitizer: [address, thread, undefined] | |||||||||||||||||||||||||||||
| cxx_standard: [17, 20] | |||||||||||||||||||||||||||||
| include: | |||||||||||||||||||||||||||||
| - sanitizer: address | |||||||||||||||||||||||||||||
| flag: "-fsanitize=address -fno-omit-frame-pointer" | |||||||||||||||||||||||||||||
| name: "AddressSanitizer" | |||||||||||||||||||||||||||||
| - sanitizer: thread | |||||||||||||||||||||||||||||
| flag: "-fsanitize=thread" | |||||||||||||||||||||||||||||
| name: "ThreadSanitizer" | |||||||||||||||||||||||||||||
| - sanitizer: undefined | |||||||||||||||||||||||||||||
| flag: "-fsanitize=undefined -fno-omit-frame-pointer" | |||||||||||||||||||||||||||||
| name: "UndefinedBehaviorSanitizer" | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| env: | |||||||||||||||||||||||||||||
| ASAN_OPTIONS: "detect_leaks=1:halt_on_error=0" | |||||||||||||||||||||||||||||
| TSAN_OPTIONS: "halt_on_error=1" | |||||||||||||||||||||||||||||
| UBSAN_OPTIONS: "halt_on_error=1" | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| steps: | |||||||||||||||||||||||||||||
| - name: Checkout repository and submodules | |||||||||||||||||||||||||||||
| uses: actions/checkout@v4 | |||||||||||||||||||||||||||||
| with: | |||||||||||||||||||||||||||||
| submodules: recursive | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Install dependencies | |||||||||||||||||||||||||||||
| run: | | |||||||||||||||||||||||||||||
| sudo apt-get update | |||||||||||||||||||||||||||||
| sudo apt-get install -y build-essential cmake | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Configure CMake with ${{ matrix.name }} | |||||||||||||||||||||||||||||
| run: | | |||||||||||||||||||||||||||||
| mkdir -p build | |||||||||||||||||||||||||||||
| cd build | |||||||||||||||||||||||||||||
| cmake .. \ | |||||||||||||||||||||||||||||
| -DCMAKE_BUILD_TYPE=Debug \ | |||||||||||||||||||||||||||||
| -DCSV_CXX_STANDARD=${{ matrix.cxx_standard }} \ | |||||||||||||||||||||||||||||
| -DCMAKE_CXX_COMPILER=g++ \ | |||||||||||||||||||||||||||||
| -DCMAKE_C_COMPILER=gcc \ | |||||||||||||||||||||||||||||
| -DCMAKE_CXX_FLAGS="${{ matrix.flag }} -g" \ | |||||||||||||||||||||||||||||
| -DCMAKE_C_FLAGS="${{ matrix.flag }} -g" | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Build with ${{ matrix.name }} | |||||||||||||||||||||||||||||
| run: cmake --build build --config Debug | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Test with ${{ matrix.name }} | |||||||||||||||||||||||||||||
| working-directory: build | |||||||||||||||||||||||||||||
| run: ctest --output-on-failure -V | |||||||||||||||||||||||||||||
| timeout-minutes: 20 | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Upload sanitizer logs | |||||||||||||||||||||||||||||
| if: failure() | |||||||||||||||||||||||||||||
| uses: actions/upload-artifact@v4 | |||||||||||||||||||||||||||||
| with: | |||||||||||||||||||||||||||||
| name: sanitizer-logs-${{ matrix.sanitizer }}-std${{ matrix.cxx_standard }} | |||||||||||||||||||||||||||||
| path: build/Testing/ | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| valgrind: | |||||||||||||||||||||||||||||
| runs-on: ubuntu-latest | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| steps: | |||||||||||||||||||||||||||||
| - name: Checkout repository and submodules | |||||||||||||||||||||||||||||
| uses: actions/checkout@v4 | |||||||||||||||||||||||||||||
| with: | |||||||||||||||||||||||||||||
| submodules: recursive | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Install dependencies | |||||||||||||||||||||||||||||
| run: | | |||||||||||||||||||||||||||||
| sudo apt-get update | |||||||||||||||||||||||||||||
| sudo apt-get install -y build-essential cmake valgrind | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Configure CMake | |||||||||||||||||||||||||||||
| run: | | |||||||||||||||||||||||||||||
| mkdir -p build | |||||||||||||||||||||||||||||
| cd build | |||||||||||||||||||||||||||||
| cmake .. \ | |||||||||||||||||||||||||||||
| -DCMAKE_BUILD_TYPE=Debug \ | |||||||||||||||||||||||||||||
| -DCSV_CXX_STANDARD=17 \ | |||||||||||||||||||||||||||||
| -DCMAKE_CXX_COMPILER=g++ \ | |||||||||||||||||||||||||||||
| -DCMAKE_C_COMPILER=gcc \ | |||||||||||||||||||||||||||||
| -DCMAKE_CXX_FLAGS="-g -O1" \ | |||||||||||||||||||||||||||||
| -DCMAKE_C_FLAGS="-g -O1" | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Build | |||||||||||||||||||||||||||||
| run: cmake --build build --config Debug | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Test with Valgrind | |||||||||||||||||||||||||||||
| working-directory: build | |||||||||||||||||||||||||||||
| run: ctest --output-on-failure | |||||||||||||||||||||||||||||
| timeout-minutes: 60 | |||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| - name: Upload Valgrind results | |||||||||||||||||||||||||||||
| if: failure() | |||||||||||||||||||||||||||||
| uses: actions/upload-artifact@v4 | |||||||||||||||||||||||||||||
| with: | |||||||||||||||||||||||||||||
| name: valgrind-results | |||||||||||||||||||||||||||||
| path: build/Testing/ | |||||||||||||||||||||||||||||
|
Comment on lines
+73
to
+111
Check warningCode scanning / CodeQL Workflow does not contain permissions Medium
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Copilot AutofixAI 2 months ago In general, to fix this issue you should explicitly declare The best, least-invasive fix here is to add a single root-level Concretely:
permissions:
contents: readafter the
Suggested changeset
1
.github/workflows/sanitizers.yml
Copilot is powered by AI and may make mistakes. Always verify output.
Refresh and try again.
|
|||||||||||||||||||||||||||||
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Copilot Autofix
AI 2 months ago
In general, the fix is to explicitly set a restrictive
permissionsblock in the workflow, ideally at the top level so it applies to all jobs, and only broaden it if/when a specific job needs more. For this workflow, the jobs only need to read repository contents and interact with artifacts; they do not need to write to code, PRs, or issues. A suitable minimal configuration ispermissions: contents: read, which still allowsactions/checkoutto function, and does not interfere withactions/upload-artifact(which does not useGITHUB_TOKEN).The single best fix without changing functionality is to add a top-level
permissionssection right under thename:(and beforeon:) that setscontents: read. This will apply to bothsanitizersandvalgrindjobs, satisfying CodeQL and enforcing least privilege. No other code, steps, or jobs need modifications, and no imports or external libraries are involved since this is a YAML configuration change only.Specifically, in
.github/workflows/sanitizers.yml, insert:between line 1 (
name: Memory and Thread Sanitizers) and line 3 (on:). No other file changes are required.