Content extraction issues #5
nzakas
announced in
Announcements
Replies: 1 comment
-
|
👌 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
🚧 Context extraction is very much a work in progress! 🚧
One of the most challenging parts of building Bredbox is correctly identifying articles and extracting the HTML. For pages that correctly implement things like microdata and Open Graph, it's fairly straightforward to find the article content and extract it into a readable form. However, those pages are in the minority and so the rest is very hit-and-miss.
I'm currently using Mozilla Readability but I've found that it needs a lot of help in those sites that don't use microdata or Open Graph (or use them incorrectly). I'm working on a custom solution that will filter things down further before passing into Readability for a final extraction.
What you can do: If you add a URL that is an article but Bredbox either doesn't extract the content or the extracted content has problems, please open an issue with the details. I'll use this to continue to iterate on the content extract functionality.
Beta Was this translation helpful? Give feedback.
All reactions