refactor: extract models.py, ui.py, docx_utils.py from index.py#40
refactor: extract models.py, ui.py, docx_utils.py from index.py#40
Conversation
…x.py - Extract shared model utilities (_get_cached_model_path, resolve_flashrank_model_name, configure_offline_mode) into models.py, eliminating duplication between index.py and search.py - Extract IndexingUI, FileProcessingContext, FileProcessingTimeoutError, GracefulAbort into ui.py (~560 lines) - Extract DOCX processing (_parse_heading_level, _get_doc_temp_dir, _convert_doc_to_docx, split_docx_into_heading_documents) into docx_utils.py (~250 lines) - Modernize type annotations in index.py to Python 3.11+ syntax (list[], dict[], X | None) - Add ruff linter configuration to pyproject.toml index.py drops from 2,655 to 1,720 lines (-35%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the import of EXCLUDED_EMBED_METADATA_KEYS, EXCLUDED_LLM_METADATA_KEYS, and get_heading_store to inside split_docx_into_heading_documents() to avoid the index -> docx_utils -> index circular import at module load time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4ee99ddb9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/chunksilo/docx_utils.py
Outdated
| from llama_index.core import Document as LlamaIndexDocument | ||
|
|
||
| from . import cfgload | ||
| from .index import EXCLUDED_EMBED_METADATA_KEYS, EXCLUDED_LLM_METADATA_KEYS, get_heading_store |
There was a problem hiding this comment.
Remove circular import between index and DOCX utils
Importing chunksilo.index now fails at module import time because index.py imports docx_utils, and docx_utils.py immediately imports EXCLUDED_EMBED_METADATA_KEYS, EXCLUDED_LLM_METADATA_KEYS, and get_heading_store back from index.py; those names are not defined yet when the first import is in progress, so Python raises ImportError from a partially initialized module. This blocks any workflow that loads chunksilo.index (including CLI indexing entrypoints) before runtime logic can execute.
Useful? React with 👍 / 👎.
Pass heading_store, excluded_embed_metadata_keys, and excluded_llm_metadata_keys as parameters to split_docx_into_heading_documents() instead of importing them from index.py. This cleanly breaks the circular dependency without any runtime import tricks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
models.py, eliminating duplication betweenindex.pyandsearch.pyIndexingUI,FileProcessingContext,GracefulAbort) intoui.py(~560 lines)docx_utils.py(~250 lines)index.pyto Python 3.11+ syntax (list[],dict[],X | None)pyproject.tomlindex.pydrops from 2,655 to 1,720 lines (-35%).Test plan
ruff check src/chunksilo/passes on new files🤖 Generated with Claude Code