-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
[WIP] Data/Model storage #1492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Data/Model storage #1492
Conversation
|
Looks nice! Is this your original code? If not, you have to attribute it properly. |
gensim/__main__.py
Outdated
| @@ -1,50 +1,57 @@ | |||
| """Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH, | |||
| 2015 Matthew Honnibal | |||
| https://github.com/explosion/spaCy/blob/master/LICENSE | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List the explicit license, not a link (a link can change in time, go dead etc).
Also, if you're making changes, be sure to include yourself in the authors and copyright (say "adapted from ..., original author ..., license ...").
|
Before merging, we'll have to clean up the code a little:
|
|
In fact, drawing inspiration for the API but writing it from scratch, in a few concise functions, is preferable to this mass copy&paste from spaCy. It looks straightforward enough, and will avoid much of the complexity here as well as any copyright issues. |
gensim/console_api/link.py
Outdated
| dictionary. If require is set to True, raise an error if no meta.json | ||
| found. | ||
| """ | ||
| location = package_path / package / 'meta.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this / operator do? Where does it come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ is similar to os.path.join()
https://docs.python.org/3/library/pathlib.html#operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, I didn't know about that. Does that work in Python 2 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
Ok, i will start writing my code. Should I first make a template and share it, so that there is no extra or less functions than required? |
|
I have two questions
|
|
Good questions! I'd say 1) no (except in the repo history, which I think is still downloadable? we could add a little how-to to our FAQ or something, but I don't think we need to maintain a full-blown automated dependency resolution packaging system, sounds like a headache) 2) no (just one way to do it -- the fewer moving pieces, the better). |
gensim/api/__init__.py
Outdated
| [sys.executable, '-m', 'pip', 'install', '--no-cache-dir', url], | ||
| env=os.environ.copy()) | ||
| url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/" | ||
| url = url+"download/"+file+"/"+file+".tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file is a reserved keyword in Python.
Also, shouldn't it be url-encoded if used like this? Or is the expectation that the argument is already url-encoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh,I will change the variable name from file.
No, i have not encoded the url, as currently I am not using any special characters or spaces in corpus/model names, but if the plan is to have those in model/corpus names, i'll add it.
gensim/api/__init__.py
Outdated
| base_dir = os.path.join(user_dir, 'gensim-data') | ||
| extracted_folder_dir = os.path.join(base_dir, file) | ||
| if not os.path.exists(base_dir): | ||
| os.makedirs(base_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a clear log message (INFO or even WARNING).
In fact, what is the strategy for communicating information to the user in this PR? Do we use logging (incl. timestamps, log level etc), or just print stuff to stdout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, i am just printing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what we want? What are the pros/cons vs logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read about it and logging is way better than using print. So i'll add logging.
piskvorky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style comments.
gensim/api/__init__.py
Outdated
| except ImportError: | ||
| from urllib2 import urlopen | ||
|
|
||
| logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', filename="api.log", level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging configuration belongs only to __main__.
gensim/api/__init__.py
Outdated
| extracted_folder_dir = os.path.join(base_dir, file) | ||
| extracted_folder_dir = os.path.join(base_dir, file_name) | ||
| if not os.path.exists(base_dir): | ||
| logging.info("Creating {}".format(base_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module should create a logger and use that for logging (not these global logging wrappers). See other gensim modules for an example.
gensim/api/__init__.py
Outdated
| tar.close() | ||
| logging.info("{} installed".format(file_name)) | ||
| else: | ||
| logging.error("Not able to create {d}. Make sure you have the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indent (please use hanging indent; here and everywhere else).
gensim/api/__init__.py
Outdated
| import six | ||
| import tarfile | ||
| import shutil | ||
| from ..utils import SaveLoad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use absolute paths, for clarity and readability.
gensim/__main__.py
Outdated
| parser = argparse.ArgumentParser(description="Gensim console API") | ||
| group = parser.add_mutually_exclusive_group() | ||
| group.add_argument( | ||
| "-d", "--download", nargs=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No needed line breaks (the string is not too long), here and below
gensim/api/__init__.py
Outdated
|
|
||
| def download(file_name): | ||
| url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/" | ||
| url = url + "download/" + file_name + "/" + file_name + ".tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use format method (instead of string concatenation through +)
gensim/__main__.py
Outdated
| "-c", "--catalogue", help="To get the list of all models/corpus stored" | ||
| " : python -m gensim -c", action="store_true") | ||
| args = parser.parse_args() | ||
| if sys.argv[1] == "-d" or sys.argv[1] == "--download": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks suspicious: you already use argparse before it -> you no need to look into sys.argv. Please use only argparse.
gensim/api/__init__.py
Outdated
| user_dir = os.path.expanduser('~') | ||
| base_dir = os.path.join(user_dir, 'gensim-data') | ||
| extracted_folder_dir = os.path.join(base_dir, file_name) | ||
| if not os.path.exists(base_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have file ~/gensim-data, this will not work correctly.
gensim/api/__init__.py
Outdated
| base_dir = os.path.join(user_dir, 'gensim-data') | ||
| extracted_folder_dir = os.path.join(base_dir, file_name) | ||
| if not os.path.exists(base_dir): | ||
| logger.info("Creating {}".format(base_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use logging correct formatting logger.info("Creating %s", base_dir) (here and anywhere). Look at this example.
gensim/api/__init__.py
Outdated
|
|
||
| def catalogue(print_list=True): | ||
| url = "https://raw.githubusercontent.com/chaitaliSaini/Corpus_and_models/" | ||
| url = url + "master/list.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should store full URL as constant (without concatenations)
gensim/api/__init__.py
Outdated
| print("Models available : ") | ||
| for model in models: | ||
| print(model) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need else section, return data always.
gensim/api/__init__.py
Outdated
| print("{} has already been installed".format(file_name)) | ||
|
|
||
|
|
||
| def catalogue(print_list=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change default to False
gensim/api/__init__.py
Outdated
| elif file_name in models: | ||
| print(data['gensim']['model'][file_name]) | ||
| else: | ||
| print("Incorrect model/corpus name.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise exception with lists (what's correct)
gensim/api/__init__.py
Outdated
| base_dir = os.path.join(user_dir, 'gensim-data') | ||
| folder_dir = os.path.join(base_dir, file_name) | ||
| if not os.path.exists(folder_dir): | ||
| print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some addition changes:
- Use something from
gensim.corporafor corpuses - Test that all loaded correctly, this behaviour unacceptable
>>> q = api.load("glove_common_crawl_42B")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "gensim/api/__init__.py", line 164, in load
data = module.load_data()
File "/home/ivan/gensim-data/glove_common_crawl_42B/__init__.py", line 19, in load_data
model = KeyedVectors.load_word2vec_format(output_file_dir)
File "gensim/models/keyedvectors.py", line 255, in load_word2vec_format
raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))
ValueError: invalid vector on line 78864 (is this really the text format?)
- Add more datasets
- Add instruction to orig repo "how to add new model"
The interface should be very simple, only load and info functions, that's enough (it's about "merge comments")
gensim/api/__init__.py
Outdated
| logger.info("%s installed", dataset) | ||
|
|
||
|
|
||
| def catalogue(print_list=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it, print_list is useless, please remove this arg (and printing code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please merge catalogue and info function to new info function (add arguent for concrete model/dataset, by default, it should be None)
gensim/api/__init__.py
Outdated
| else: | ||
| catalogue(print_list=True) | ||
| raise Exception( | ||
| "Incorrect model/corpus name. Choose the model/corpus from the list " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substitute list of available models/datasets to exception message (without printing to stdout before)
gensim/api/__init__.py
Outdated
| corpuses = data['gensim']['corpus'] | ||
| models = data['gensim']['model'] | ||
| if dataset in corpuses: | ||
| print(data['gensim']['corpus'][dataset]["desc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add more detailed descriptions (in storage repo)
gensim/api/__init__.py
Outdated
| return data['gensim']['model'][dataset]["filename"] | ||
|
|
||
|
|
||
| def load(dataset, return_path=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add doc-strings to all functions in google-style format (here and anywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please merge load and download to new load function (If I have no needed model on my PC, I want to download it, not see an exception).
gensim/api/__init__.py
Outdated
| "above.") | ||
|
|
||
|
|
||
| def get_filename(dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only for inner proposes?
gensim/api/__init__.py
Outdated
| f = f.readlines() | ||
| for line in f: | ||
| if installed_message in line: | ||
| print("{} has already been installed".format(dataset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace all print to logger
gensim/api/__init__.py
Outdated
| corpuses = data['gensim']['corpus'] | ||
| models = data['gensim']['model'] | ||
| if dataset not in corpuses and dataset not in models: | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise an exception (not a logger.error + sys.exit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'corpuses' => 'corpora'
gensim/api/__init__.py
Outdated
| if dataset not in corpuses and dataset not in models: | ||
| logger.error( | ||
| "Incorect Model/corpus name. Use catalogue(print_list=TRUE) or" | ||
| " python -m gensim -c to get a list of models/corpuses" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not working now, please update
gensim/api/__init__.py
Outdated
| @@ -0,0 +1,174 @@ | |||
| from __future__ import print_function | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this file as downloader.py and move to library root gensim/downloader.py, also merge file with cli and this
gensim/api/__init__.py
Outdated
| downloaded_message = "{f} downloaded".format(f=dataset) | ||
| if os.path.exists(data_folder_dir): | ||
| log_file_dir = os.path.join(base_dir, 'api.log') | ||
| with open(log_file_dir) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very suspicious, you write all to this file (not only models), please do several things
- Use this file only for models
- Add checksum (in file and in original repo) and compare it, if something happens (file broken) - warning + download file.
- If file doesn't exists - warning + create this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plaintext isn't best format for this file, please use json/pickle/etc
gensim/api/__init__.py
Outdated
| urllib.urlretrieve(data_url, data_dir) | ||
| logger.info("%s downloaded", dataset) | ||
| if not is_installed: | ||
| tar = tarfile.open(compressed_folder_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all of your files is tar or gz, for example, you also have non-zipped fasttext, big glove in zip, etc ....
| args = parser.parse_args() | ||
| if args.download is not None: | ||
| data_path = load(args.download[0], return_path=True) | ||
| logger.info("Data has been installed and data path is %s", data_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any logging setup here in __main__. Where will this logging go?
We don't want any printing inside a library, but in a user-invoked top-level script, printing is fine.
gensim/downloader.py
Outdated
| Args: | ||
| dataset(string): Name of the corpus/model. | ||
| """ | ||
| url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/download/{f}/{f}.tar.gz".format(f=dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a FIXME comment so we don't merge such temporary constructs by accident.
gensim/downloader.py
Outdated
| base_dir = os.path.join(user_dir, 'gensim-data') | ||
| data_log_file_dir = os.path.join(base_dir, 'data.json') | ||
|
|
||
| logging.basicConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please setup it in __main__ section (because we need logging in console, but no need it in programming version without explicit configuration from user side
gensim/downloader.py
Outdated
| data = response.read().decode("utf-8") | ||
| data = json.loads(data) | ||
| if dataset is not None: | ||
| corpora = data['gensim']['corpus'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gensim is useless key in your json, you need only model and dataset keys for upper level
gensim/downloader.py
Outdated
| corpora = data['gensim']['corpus'] | ||
| models = data['gensim']['model'] | ||
| if dataset in corpora: | ||
| logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing return
gensim/downloader.py
Outdated
| if dataset in corpora: | ||
| logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"]) | ||
| elif dataset in models: | ||
| logger.info("%s \n", data['gensim']['model'][dataset]["desc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing return
gensim/downloader.py
Outdated
| return hash_md5.hexdigest() | ||
|
|
||
|
|
||
| def info(dataset=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not dataset, this can be a model too (and same thing everywhere).
gensim/downloader.py
Outdated
|
|
||
| user_dir = os.path.expanduser('~') | ||
| base_dir = os.path.join(user_dir, 'gensim-data') | ||
| data_log_file_dir = os.path.join(base_dir, 'data.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's path to file, not a dir
| .format(base_dir)) | ||
|
|
||
|
|
||
| def initialize_data_log_file(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pass data_log_file explicitly and open it here with with operator
gensim/downloader.py
Outdated
|
|
||
|
|
||
| def get_data_status(dataset): | ||
| """Function for finding the status of the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset and model
gensim/downloader.py
Outdated
| dataset(string): Name of the corpus/model. | ||
| status(string): Status to be updates to i.e downloaded or installed. | ||
| """ | ||
| jdata = json.loads(open(data_log_file_dir).read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use with for file open.
gensim/downloader.py
Outdated
| logger.info("%s installed", dataset) | ||
| else: | ||
| logger.error("There was a problem in installing the file. Retrying.") | ||
| _download(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursion here
2017-09-26 13:22:03,650 :gensim.api :INFO : Creating /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:22:03,652 :gensim.api :INFO : Creation of /home/ivan/gensim-data/glove_common_crawl_42B successful.
2017-09-26 13:22:03,654 :gensim.api :INFO : Downloading glove_common_crawl_42B
2017-09-26 13:33:58,964 :gensim.api :INFO : glove_common_crawl_42B downloaded
2017-09-26 13:33:58,967 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:03,320 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:03,590 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:07,628 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:07,910 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:12,020 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:12,306 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:16,379 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
|
|
||
|
|
||
| def get_data_list(): | ||
| """Function getting the list of all datasets/models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-write your docstrings according to numpy style (here and anywhere) + add missing .rst to docs/src + change apiref.rst
| return data_names | ||
|
|
||
|
|
||
| def get_data_name(data_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be a good idea to hide functions that no needed to user (with underscores)
| models = data['model'] | ||
| json_list = [] | ||
| for corpus in corpora: | ||
| json_object = {"name": corpus, "status": "None"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to store anything with "None" status if we check file with data/model from data repo each time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for initialising the json log file that stores the status if the model has been downloaded or installed on the users computer.
| compressed_folder_name = "{f}.tar.gz".format(f=data_) | ||
| compressed_folder_dir = os.path.join(base_dir, compressed_folder_name) | ||
| if get_data_status(data_) != "downloaded": | ||
| if not os.path.exists(data_folder_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If exists - need to remove all broken files and after it creates all that needed.
| _download(data_) | ||
|
|
||
| if get_data_status(data_) != "installed": | ||
| tar = tarfile.open(compressed_folder_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very long indentation, need 4 spaces instead of 8
|
|
||
|
|
||
| def load(data_, return_path=False): | ||
| """Loads the corpus/model to the memory, if return_path is False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: Python docstrings use imperative mode ("Load X", not "Loads X"). Here and elsewhere.
Api for model/data storage