Skip to content

Upload encode detect#101

Merged
siyuan-youtu merged 3 commits intomainfrom
upload_encode_detect
Oct 10, 2025
Merged

Upload encode detect#101
siyuan-youtu merged 3 commits intomainfrom
upload_encode_detect

Conversation

@siyuan-youtu
Copy link
Copy Markdown
Contributor

No description provided.

@siyuan-youtu siyuan-youtu requested a review from Copilot October 10, 2025 09:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for additional file formats (.txt and .md) to the upload functionality and implements robust encoding detection for file processing. The changes enhance the system's ability to handle text files with various character encodings.

  • Added support for .txt and .md file formats in the frontend
  • Implemented encoding detection using chardet with multiple fallback encodings
  • Refactored file processing to use encoding detection instead of hardcoded UTF-8

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File Description
frontend/index.html Updated UI text to show support for .txt, .md, and .json files
backend.py Added encoding detection functions and updated file processing logic to handle various encodings

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

backend.py Outdated
Comment on lines +316 to +322
else:
# Unknown extension: attempt text decode
text = decode_bytes_with_detection(content_bytes)
corpus_data.append({
"title": file.filename,
"text": text
})
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files with unknown extensions are automatically treated as text files. Consider adding validation to reject unsupported file types or log a warning when processing unknown extensions to avoid processing binary files as text.

Copilot uses AI. Check for mistakes.
backend.py Outdated
content = f.read()
# Process file content using encoding detection
filename_lower = (file.filename or "").lower()
if filename_lower.endswith('.txt'):
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code checks for .txt files but the frontend now advertises support for .md files. Add handling for .md files in the file processing logic to match the frontend capabilities.

Suggested change
if filename_lower.endswith('.txt'):
if filename_lower.endswith('.txt') or filename_lower.endswith('.md'):

Copilot uses AI. Check for mistakes.
@siyuan-youtu siyuan-youtu requested a review from Copilot October 10, 2025 09:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

def _detect_encoding_from_bytes(data: bytes) -> Optional[str]:
"""Detect encoding using chardet if available; return lower-cased encoding name or None."""
try:
import chardet # type: ignore
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chardet import should be moved to the top of the file with other imports rather than being imported inside the function. This avoids repeated import overhead and makes dependencies more visible.

Copilot uses AI. Check for mistakes.
Comment on lines +307 to +314
if filename_lower.endswith(('.txt', '.md')):
text = decode_bytes_with_detection(content_bytes)
corpus_data.append({
"title": file.filename,
"text": content
"text": text
})
elif file.filename.endswith('.json'):
processed_count += 1
elif filename_lower.endswith('.json'):
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file extension checking logic is duplicated - first checking ext not in allowed_extensions then using filename_lower.endswith(). Consider using a consistent approach throughout, such as always using the extracted ext variable.

Copilot uses AI. Check for mistakes.
else:
corpus_data.append(data_obj)
processed_count += 1
except Exception:
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider catching more specific exceptions (e.g., json.JSONDecodeError, UnicodeDecodeError) instead of the broad Exception to provide better error handling and debugging information.

Suggested change
except Exception:
except (json.JSONDecodeError, UnicodeDecodeError):

Copilot uses AI. Check for mistakes.
@siyuan-youtu siyuan-youtu merged commit 69ffde7 into main Oct 10, 2025
1 check passed
@siyuan-youtu siyuan-youtu deleted the upload_encode_detect branch October 10, 2025 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants