Skip to content

✨Implement all Enums (🤖🥣scraping from bo4e.de)#218

Merged
hf-kklein merged 18 commits intomasterfrom
auto_generate_enums
Dec 8, 2021
Merged

✨Implement all Enums (🤖🥣scraping from bo4e.de)#218
hf-kklein merged 18 commits intomasterfrom
auto_generate_enums

Conversation

@hf-kklein
Copy link
Copy Markdown
Contributor

@hf-kklein hf-kklein commented Dec 1, 2021

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help.

Beautiful Soup

Fixes #82
Fixes #128
Fixes #83
Fixes #94
Fixes #108
Fixes #129
Fixes #84
Fixes #95
Fixes #110
Fixes #96
Fixes #85
Fixes #97
Fixes #130
Fixes #86
Fixes #98
Fixes #112
Fixes #87
Fixes #131
Fixes #99
Fixes #114
Fixes #132
Fixes #88
Fixes #100
Fixes #115
Fixes #89
Fixes #101
Fixes #117
Fixes #133
Fixes #90
Fixes #102
Fixes #118
Fixes #134
Fixes #91
Fixes #103
Fixes #119
Fixes #135
Fixes #136
Fixes #104
Fixes #120
Fixes #137
Fixes #92
Fixes #106
Fixes #124
Fixes #93
Fixes #107
Fixes #125
Fixes #177

I wrote a little scraper:

from os.path import exists
from typing import List

import requests as requests
from bs4 import BeautifulSoup

ENUM_CODE_TEMPLATE = f'''
# pylint: disable=missing-module-docstring
from bo4e.enum.strenum import StrEnum
class {{enum_class_name}}(StrEnum):
    """
    {{enum_class_docstring}}
    """
{{enum_members}}
'''

doc_main_html = requests.get("https://www.bo4e.de/dokumentation").text
doc_main_soup = BeautifulSoup(doc_main_html, "html.parser")
enum_urls: List[str] = [a.attrs["href"] for a in doc_main_soup.find_all("a") if a.text.startswith("ENUM")]
for enum_url in enum_urls:
    html_doc = requests.get(enum_url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    enum_class_name = soup.find("h2").text.split(" ")[-1]
    relevant_div = soup.find("div", attrs={"class": "large-12 column"})
    python_file_path = "src/bo4e/enum/" + enum_class_name.lower() + ".py"
    if exists(python_file_path):
        continue
    try:
        enum_class_docstring = relevant_div.find("p").text
    except AttributeError:
        # bdew artikelnummer, arithmetische operation
        # scheiss CMS, es gibt nicht immer nen paragraph
        enum_class_docstring = relevant_div.text.split(".")[0].strip()
    table_rows = relevant_div.find("tbody").findAll("tr")
    enum_member_codes: List[str] = []
    for table_row in table_rows:
        row_cells = list(table_row.find_all("td"))
        if len(row_cells) == 0:
            # this is probably the <th> row.
            use_name_as_docstring = False
            headings = [th.text.strip() for th in table_row.find_all("th")]
            name_index = headings.index("Bezeichnung")  # usually 0
            try:
                doc_index = headings.index("Name")  # usually 1
            except ValueError:
                try:
                    doc_index = headings.index("Beschreibung")
                except ValueError:
                    # f.e. https://www.bo4e.de/dokumentation/enumerations/enum-medium
                    use_name_as_docstring = True
            continue  # probably the <th> row
        enum_member_name = row_cells[name_index].text
        if use_name_as_docstring:
            enum_member_doc = enum_member_name
        else:
            enum_member_doc = row_cells[doc_index].text
        enum_member_codes.append(f'    {enum_member_name} = "{enum_member_name}" #: {enum_member_doc}')
    replacement_dict = {
        "enum_class_name": enum_class_name,
        "enum_class_docstring": enum_class_docstring,
        "enum_members": "\n".join(enum_member_codes),
    }
    enum_code = ENUM_CODE_TEMPLATE.format(**replacement_dict)
    with open(python_file_path, "w", encoding="utf-8") as enum_code_file:
        enum_code_file.write(enum_code)

@hf-kklein hf-kklein self-assigned this Dec 1, 2021
@hf-kklein hf-kklein changed the title Add Enum Add Enums (scraping from bo4e.de) Dec 1, 2021
@hf-kklein hf-kklein requested review from a team and hf-krechan December 1, 2021 17:53
@hf-kklein hf-kklein marked this pull request as ready for review December 1, 2021 17:53
@hf-kklein hf-kklein changed the title Add Enums (scraping from bo4e.de) Implement all Enums (scraping from bo4e.de) Dec 1, 2021
@hf-kklein hf-kklein changed the title Implement all Enums (scraping from bo4e.de) ✨Implement all Enums (🤖 🥣scraping from bo4e.de) Dec 1, 2021
@hf-kklein hf-kklein changed the title ✨Implement all Enums (🤖 🥣scraping from bo4e.de) ✨Implement all Enums (🤖🥣scraping from bo4e.de) Dec 1, 2021
@hf-kklein hf-kklein requested a review from hf-fvesely December 2, 2021 08:27
Copy link
Copy Markdown
Collaborator

@hf-krechan hf-krechan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job.
I found some typos and an encoding error.

hf-kklein and others added 2 commits December 8, 2021 10:07
Co-authored-by: Kevin <68426071+hf-krechan@users.noreply.github.com>
Co-authored-by: Kevin <68426071+hf-krechan@users.noreply.github.com>
hf-kklein and others added 5 commits December 8, 2021 10:14
Co-authored-by: Kevin <68426071+hf-krechan@users.noreply.github.com>
Co-authored-by: Kevin <68426071+hf-krechan@users.noreply.github.com>
Co-authored-by: Kevin <68426071+hf-krechan@users.noreply.github.com>
Co-authored-by: Kevin <68426071+hf-krechan@users.noreply.github.com>
@hf-kklein hf-kklein requested a review from hf-krechan December 8, 2021 09:20
@hf-kklein hf-kklein merged commit 1626368 into master Dec 8, 2021
@hf-kklein hf-kklein deleted the auto_generate_enums branch December 8, 2021 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants