Preprocessing of header blocks

Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML.

It seems that the current version (3.0.0) does not process header blocks as is described in the documentation:

> The preprocessing looks for short header blocks which precede good blocks and at the same time there is no more than MAX_HEADING_DISTANCE characters between the  header block and the good block.

The code for this appears to be implemented on [line 314 of core.py](https://github.com/miso-belica/jusText/blob/main/justext/core.py#L314) and works by searching  for good blocks that come after the short header block by looking at the `class_type` attribute. However, at this point in the code, the `class_type` of each     paragraph has only been initialized and not copied from `cf_class`, so when searching for a good block within `max_heading_distance` characters, a good block   will never be found, as `class_type` has not been copied for subsequent paragraphs. 

[Here is a simple dummy HTML file](https://gist.github.com/jojennin/ccbbae383b7cdb99c0284203d5d83bd0) that I have created to provide an example of how a short block that comes between a short header and a good block will be classified as "bad". However, had the    header block been labeled as "neargood" from the preprocessing step, the short block would have been labeled as "good". 

Is this preprocessing of the header blocks as described above intended? If so, then I think the documentation should be updated to reflect this change. Otherwise, it seems a simple loop over the paragraphs copying `cf_class` to `class_type` (as was done in version 2.2.0) would suffice to fix the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing of header blocks #45

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Preprocessing of header blocks #45

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions