Skip to content

feat: multi-format upload support and dataset profiling endpoint#223

Open
24f2000777 wants to merge 4 commits intoc2siorg:mainfrom
24f2000777:feat/multi-format-upload-and-profiling
Open

feat: multi-format upload support and dataset profiling endpoint#223
24f2000777 wants to merge 4 commits intoc2siorg:mainfrom
24f2000777:feat/multi-format-upload-and-profiling

Conversation

@24f2000777
Copy link
Copy Markdown

Summary

Fixes the hardcoded CSV-only limitation throughout DataLoom and adds a
new dataset profiling endpoint.

Changes

Multi-format upload support

  • Adds xlsx, xls, json, parquet, and tsv support
  • Format-aware reader/writer maps in pandas_helpers.py
  • Fixes _copy path construction bug in file_service.py (was using
    string .replace(".csv", "_copy.csv") which breaks for non-CSV files
    and corrupts paths with multiple dots)
  • Fixes type inconsistency: get_original_path now accepts Path | str
  • Backward-compatible: read_csv_safe and save_csv_safe still work

Dataset profiling endpoint

  • New GET /projects/{id}/profile endpoint
  • Returns per-column dtype, null count, null %, unique count
  • Returns numeric quartiles (mean, std, min, p25, p50, p75, max)
  • Returns composite data quality score (0–100) penalising nulls and
    duplicate rows

Config

  • allowed_extensions updated to include all new formats
  • Added openpyxl and pyarrow as dependencies

Tests

  • 28 tests passing, 0 failures
  • New tests cover xlsx, xls, json, tsv, parquet upload acceptance
  • New tests cover unsupported format rejection, case-insensitive extensions
  • New tests cover profile endpoint keys, quality score range, numeric stats,
    duplicate detection, 404 on missing project

Related

Part of GSoC 2026 DataLoom proposal.

- Add xlsx, xls, json, parquet, tsv upload support
- Fix hardcoded .csv path assumptions in file_service.py
- Fix _copy path construction bug using Path stem instead of string replace
- Fix get_original_path type inconsistency (now accepts Path or str)
- Add read_file_safe and save_file_safe with format-aware reader/writer maps
- Keep read_csv_safe and save_csv_safe as backward-compatible aliases
- Add GET /projects/{id}/profile endpoint with column stats and quality score
- Add compute_quality_score and get_column_profile to pandas_helpers
- Register profiling router in main.py
- Update allowed_extensions in config.py for all new formats
- Add openpyxl and pyarrow dependencies
- Add 12 new tests, 28 passing total
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 23, 2026

PR Review

Squash

Your PR has 4 commits. Please squash into a single commit.

How to fix

git fetch origin
git rebase -i origin/main   # mark all but first commit as "squash"
git push --force-with-lease

This comment updates automatically on each push.

@24f2000777
Copy link
Copy Markdown
Author

Hi @OshanMudannayake — I'm Akshit Garg, 3rd year CS + BS Data Science at IIT Madras, planning to submit a GSoC 2026 proposal for DataLoom. This PR adds multi-format upload support and a profiling endpoint. Would love any feedback on the approach before I finalize my proposal. Happy to iterate on anything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant