Skip to content

Preprocessing of header blocks #45

@jojennin

Description

@jojennin

Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML.

It seems that the current version (3.0.0) does not process header blocks as is described in the documentation:

The preprocessing looks for short header blocks which precede good blocks and at the same time there is no more than MAX_HEADING_DISTANCE characters between the header block and the good block.

The code for this appears to be implemented on line 314 of core.py and works by searching for good blocks that come after the short header block by looking at the class_type attribute. However, at this point in the code, the class_type of each paragraph has only been initialized and not copied from cf_class, so when searching for a good block within max_heading_distance characters, a good block will never be found, as class_type has not been copied for subsequent paragraphs.

Here is a simple dummy HTML file that I have created to provide an example of how a short block that comes between a short header and a good block will be classified as "bad". However, had the header block been labeled as "neargood" from the preprocessing step, the short block would have been labeled as "good".

Is this preprocessing of the header blocks as described above intended? If so, then I think the documentation should be updated to reflect this change. Otherwise, it seems a simple loop over the paragraphs copying cf_class to class_type (as was done in version 2.2.0) would suffice to fix the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions