Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for additional file formats (.txt and .md) to the upload functionality and implements robust encoding detection for file processing. The changes enhance the system's ability to handle text files with various character encodings.
- Added support for .txt and .md file formats in the frontend
- Implemented encoding detection using chardet with multiple fallback encodings
- Refactored file processing to use encoding detection instead of hardcoded UTF-8
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| frontend/index.html | Updated UI text to show support for .txt, .md, and .json files |
| backend.py | Added encoding detection functions and updated file processing logic to handle various encodings |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
backend.py
Outdated
| else: | ||
| # Unknown extension: attempt text decode | ||
| text = decode_bytes_with_detection(content_bytes) | ||
| corpus_data.append({ | ||
| "title": file.filename, | ||
| "text": text | ||
| }) |
There was a problem hiding this comment.
Files with unknown extensions are automatically treated as text files. Consider adding validation to reject unsupported file types or log a warning when processing unknown extensions to avoid processing binary files as text.
backend.py
Outdated
| content = f.read() | ||
| # Process file content using encoding detection | ||
| filename_lower = (file.filename or "").lower() | ||
| if filename_lower.endswith('.txt'): |
There was a problem hiding this comment.
The code checks for .txt files but the frontend now advertises support for .md files. Add handling for .md files in the file processing logic to match the frontend capabilities.
| if filename_lower.endswith('.txt'): | |
| if filename_lower.endswith('.txt') or filename_lower.endswith('.md'): |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| def _detect_encoding_from_bytes(data: bytes) -> Optional[str]: | ||
| """Detect encoding using chardet if available; return lower-cased encoding name or None.""" | ||
| try: | ||
| import chardet # type: ignore |
There was a problem hiding this comment.
The chardet import should be moved to the top of the file with other imports rather than being imported inside the function. This avoids repeated import overhead and makes dependencies more visible.
| if filename_lower.endswith(('.txt', '.md')): | ||
| text = decode_bytes_with_detection(content_bytes) | ||
| corpus_data.append({ | ||
| "title": file.filename, | ||
| "text": content | ||
| "text": text | ||
| }) | ||
| elif file.filename.endswith('.json'): | ||
| processed_count += 1 | ||
| elif filename_lower.endswith('.json'): |
There was a problem hiding this comment.
The file extension checking logic is duplicated - first checking ext not in allowed_extensions then using filename_lower.endswith(). Consider using a consistent approach throughout, such as always using the extracted ext variable.
| else: | ||
| corpus_data.append(data_obj) | ||
| processed_count += 1 | ||
| except Exception: |
There was a problem hiding this comment.
Consider catching more specific exceptions (e.g., json.JSONDecodeError, UnicodeDecodeError) instead of the broad Exception to provide better error handling and debugging information.
| except Exception: | |
| except (json.JSONDecodeError, UnicodeDecodeError): |
No description provided.