Skip to content

Refactor to allow content_length limitation to webpreview generation#22

Open
inkhey wants to merge 2 commits intoludbek:masterfrom
algoo:master
Open

Refactor to allow content_length limitation to webpreview generation#22
inkhey wants to merge 2 commits intoludbek:masterfrom
algoo:master

Conversation

@inkhey
Copy link

@inkhey inkhey commented Feb 28, 2022

Hello and thank for this nice library,
This proposal is a big refactoring to limit resource usage of webpreview, we do need for the Tracim software :

  • Add a possible content_length limit to limit the amount of data loaded by a query.
  • reduce to only one query per request to webpreview function (not anymore request for each preview type).
  • do not check anymore url scheme with an extra query (this will fix duplicate requests sometimes not necessary #15).
  • better separation between the query code and the extract code, making easier for external program to reuse the response to get extra information if necessary.

@vduseev
Copy link
Collaborator

vduseev commented May 25, 2022

Hi @inkhey,

Would you be open to the idea of retrieving the content only up to the content_limit_length instead of dropping it all together? Below is some reasoning behind this idea.

  • BeautifulSoup is able to parse even a broken half completed HTML content:

    In [6]: example = """
     ...: <!DOCTYPE html>
     ...: <html>
     ...:     <head>
     ...:         <meta charset="utf-8">
     ...:         <meta name="viewport" content="width=device-width">
     ...:         <title></title>
     ...:         <meta property="og:title" content="a title" />
     ...:         <meta property="og:price:amo
     ...: """
    
    In [7]: soup = BeautifulSoup(example, "html.parser")
    
    In [9]: soup.title
    Out[9]: <title></title>
  • Tying the implementation to inheriting from requests.Response limits future options to easily add support for async libraries such as aiohttp.

  • There might be valid cases where users just want the first 1,000-10,000 characters of the webpage for parsing, saving on memory and retrieval speeds.

Do you think that could a valid approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

duplicate requests sometimes not necessary

2 participants