feat: multi-format upload support and dataset profiling endpoint#223
Open
24f2000777 wants to merge 4 commits intoc2siorg:mainfrom
Open
feat: multi-format upload support and dataset profiling endpoint#22324f2000777 wants to merge 4 commits intoc2siorg:mainfrom
24f2000777 wants to merge 4 commits intoc2siorg:mainfrom
Conversation
- Add xlsx, xls, json, parquet, tsv upload support
- Fix hardcoded .csv path assumptions in file_service.py
- Fix _copy path construction bug using Path stem instead of string replace
- Fix get_original_path type inconsistency (now accepts Path or str)
- Add read_file_safe and save_file_safe with format-aware reader/writer maps
- Keep read_csv_safe and save_csv_safe as backward-compatible aliases
- Add GET /projects/{id}/profile endpoint with column stats and quality score
- Add compute_quality_score and get_column_profile to pandas_helpers
- Register profiling router in main.py
- Update allowed_extensions in config.py for all new formats
- Add openpyxl and pyarrow dependencies
- Add 12 new tests, 28 passing total
PR ReviewSquashYour PR has 4 commits. Please squash into a single commit. How to fixgit fetch origin
git rebase -i origin/main # mark all but first commit as "squash"
git push --force-with-leaseThis comment updates automatically on each push. |
Author
|
Hi @OshanMudannayake — I'm Akshit Garg, 3rd year CS + BS Data Science at IIT Madras, planning to submit a GSoC 2026 proposal for DataLoom. This PR adds multi-format upload support and a profiling endpoint. Would love any feedback on the approach before I finalize my proposal. Happy to iterate on anything! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the hardcoded CSV-only limitation throughout DataLoom and adds a
new dataset profiling endpoint.
Changes
Multi-format upload support
pandas_helpers.py_copypath construction bug infile_service.py(was usingstring
.replace(".csv", "_copy.csv")which breaks for non-CSV filesand corrupts paths with multiple dots)
get_original_pathnow acceptsPath | strread_csv_safeandsave_csv_safestill workDataset profiling endpoint
GET /projects/{id}/profileendpointduplicate rows
Config
allowed_extensionsupdated to include all new formatsopenpyxlandpyarrowas dependenciesTests
duplicate detection, 404 on missing project
Related
Part of GSoC 2026 DataLoom proposal.