Content extraction issues #5

nzakas · 2025-09-12T20:00:04Z

nzakas
Sep 12, 2025
Maintainer

🚧 Context extraction is very much a work in progress! 🚧

One of the most challenging parts of building Bredbox is correctly identifying articles and extracting the HTML. For pages that correctly implement things like microdata and Open Graph, it's fairly straightforward to find the article content and extract it into a readable form. However, those pages are in the minority and so the rest is very hit-and-miss.

I'm currently using Mozilla Readability but I've found that it needs a lot of help in those sites that don't use microdata or Open Graph (or use them incorrectly). I'm working on a custom solution that will filter things down further before passing into Readability for a final extraction.

What you can do: If you add a URL that is an article but Bredbox either doesn't extract the content or the extracted content has problems, please open an issue with the details. I'll use this to continue to iterate on the content extract functionality.

yumtwinkle · 2025-09-15T18:58:28Z

yumtwinkle
Sep 15, 2025

👌

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Content extraction issues #5

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Content extraction issues #5

Uh oh!

nzakas Sep 12, 2025 Maintainer

Replies: 1 comment

Uh oh!

yumtwinkle Sep 15, 2025

nzakas
Sep 12, 2025
Maintainer

yumtwinkle
Sep 15, 2025