Skip to content

[Bug]: Chinese font setup in MineruParser.parse_text_file never works #24

@Glinte

Description

@Glinte

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • I believe this is a legitimate bug, not just a question or feature request.

Describe the bug

...because you are trying to initialize UnicodeCIDFonts with font names that are not supported, but no one ever noticed this is failed because you suppressed all the errors. You used ["SimSun", "SimHei", "Microsoft YaHei"] on Windows and ["STSong-Light", "STHeiti"] on MacOS, but the only supported fonts to be used in UnicodeCIDFont is one of the 6 hardcoded values. By a quick read of the UnicodeCIDFont class docstring, I believe CIDFont should be used instead.

# Try to register a font that supports Chinese characters
try:
# Try to use system fonts that support Chinese
import platform
system = platform.system()
if system == "Windows":
# Try common Windows fonts
for font_name in ["SimSun", "SimHei", "Microsoft YaHei"]:
try:
from reportlab.pdfbase.cidfonts import (
UnicodeCIDFont,
)
pdfmetrics.registerFont(UnicodeCIDFont(font_name))
normal_style.fontName = font_name
heading_style.fontName = font_name
break
except Exception:
continue
elif system == "Darwin": # macOS
for font_name in ["STSong-Light", "STHeiti"]:
try:
from reportlab.pdfbase.cidfonts import (
UnicodeCIDFont,
)
pdfmetrics.registerFont(UnicodeCIDFont(font_name))
normal_style.fontName = font_name
heading_style.fontName = font_name
break
except Exception:
continue
except Exception:
pass # Use default fonts if Chinese font setup fails

https://github.com/eduardocereto/reportlab/blob/98758940eeae30db80bbc9c555e42b8c89b86be8/src/reportlab/pdfbase/cidfonts.py#L390-L395

class UnicodeCIDFont(CIDFont):
    def __init__(self, face, isVertical=False, isHalfWidth=False):
        #pass
        try:
            lang, defaultEncoding = defaultUnicodeEncodings[face]
        except KeyError:
            raise KeyError("don't know anything about CID font %s" % face)

https://github.com/eduardocereto/reportlab/blob/98758940eeae30db80bbc9c555e42b8c89b86be8/src/reportlab/pdfbase/_cidfontdata.py#L130-L141

defaultUnicodeEncodings = {
    #we ddefine a default Unicode encoding for each face name;
    #this should be the most commonly used horizontal unicode encoding;
    #also define a 3-letter language code.
    'HeiseiMin-W3': ('jpn','UniJIS-UCS2-H'),
    'HeiseiKakuGo-W5': ('jpn','UniJIS-UCS2-H'),
    'STSong-Light': ('chs', 'UniGB-UCS2-H'),
    'MSung-Light': ('cht', 'UniGB-UCS2-H'),
    #'MHei-Medium': ('cht', 'UniGB-UCS2-H'),
    'HYSMyeongJo-Medium': ('kor', 'UniKS-UCS2-H'),
    'HYGothic-Medium': ('kor','UniKS-UCS2-H'),
    }

Steps to reproduce

No response

Expected Behavior

No response

LightRAG Config Used

Paste your config here

Logs and screenshots

No response

Additional Information

  • LightRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions