Upload encode detect by siyuan-youtu · Pull Request #101 · TencentCloudADP/youtu-graphrag

siyuan-youtu · 2025-10-10T09:11:44Z

No description provided.

Copilot

Pull Request Overview

This PR adds support for additional file formats (.txt and .md) to the upload functionality and implements robust encoding detection for file processing. The changes enhance the system's ability to handle text files with various character encodings.

Added support for .txt and .md file formats in the frontend
Implemented encoding detection using chardet with multiple fallback encodings
Refactored file processing to use encoding detection instead of hardcoded UTF-8

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File	Description
frontend/index.html	Updated UI text to show support for .txt, .md, and .json files
backend.py	Added encoding detection functions and updated file processing logic to handle various encodings

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-10T09:14:23Z

backend.py

+            else:
+                # Unknown extension: attempt text decode
+                text = decode_bytes_with_detection(content_bytes)
+                corpus_data.append({
+                    "title": file.filename,
+                    "text": text
+                })


Files with unknown extensions are automatically treated as text files. Consider adding validation to reject unsupported file types or log a warning when processing unknown extensions to avoid processing binary files as text.

Copilot · 2025-10-10T09:14:24Z

backend.py

-                    content = f.read()
+            # Process file content using encoding detection
+            filename_lower = (file.filename or "").lower()
+            if filename_lower.endswith('.txt'):


The code checks for .txt files but the frontend now advertises support for .md files. Add handling for .md files in the file processing logic to match the frontend capabilities.

Suggested change

if filename_lower.endswith('.txt'):

if filename_lower.endswith('.txt') or filename_lower.endswith('.md'):

Copilot

Pull Request Overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-10T09:35:05Z

backend.py

+def _detect_encoding_from_bytes(data: bytes) -> Optional[str]:
+    """Detect encoding using chardet if available; return lower-cased encoding name or None."""
+    try:
+        import chardet  # type: ignore


The chardet import should be moved to the top of the file with other imports rather than being imported inside the function. This avoids repeated import overhead and makes dependencies more visible.

Copilot · 2025-10-10T09:35:05Z

backend.py

+            if filename_lower.endswith(('.txt', '.md')):
+                text = decode_bytes_with_detection(content_bytes)
                corpus_data.append({
                    "title": file.filename,
-                    "text": content
+                    "text": text
                })
-            elif file.filename.endswith('.json'):
+                processed_count += 1
+            elif filename_lower.endswith('.json'):


The file extension checking logic is duplicated - first checking ext not in allowed_extensions then using filename_lower.endswith(). Consider using a consistent approach throughout, such as always using the extracted ext variable.

Copilot · 2025-10-10T09:35:05Z

backend.py

+                    else:
+                        corpus_data.append(data_obj)
+                    processed_count += 1
+                except Exception:


Consider catching more specific exceptions (e.g., json.JSONDecodeError, UnicodeDecodeError) instead of the broad Exception to provide better error handling and debugging information.

Suggested change

except Exception:

except (json.JSONDecodeError, UnicodeDecodeError):

siyuan-youtu added 2 commits October 10, 2025 16:50

feat:update file encode detect

cee335e

feat:add support file types

93d5b1a

siyuan-youtu requested a review from Copilot October 10, 2025 09:13

Copilot AI reviewed Oct 10, 2025

View reviewed changes

feat:fix file type support

ba571f3

siyuan-youtu requested a review from Copilot October 10, 2025 09:33

Copilot AI reviewed Oct 10, 2025

View reviewed changes

siyuan-youtu merged commit 69ffde7 into main Oct 10, 2025
1 check passed

siyuan-youtu deleted the upload_encode_detect branch October 10, 2025 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload encode detect#101

Upload encode detect#101
siyuan-youtu merged 3 commits intomainfrom
upload_encode_detect

siyuan-youtu commented Oct 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 10, 2025

Uh oh!

Copilot AI Oct 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 10, 2025

Uh oh!

Copilot AI Oct 10, 2025

Uh oh!

Copilot AI Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if filename_lower.endswith('.txt'):
	if filename_lower.endswith('.txt') or filename_lower.endswith('.md'):

	except Exception:
	except (json.JSONDecodeError, UnicodeDecodeError):

Conversation

siyuan-youtu commented Oct 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants