Tasks
Phase 1: Foundation & Setup (Sprint 0)
- [ ] Provision and configure MinIO bucket.
- [ ] Provision and configure PostgreSQL server.
- [ ] Design and finalize the database schema; create initial migration scripts.
- [ ] Set up Python project structure, virtual environment, and Git repository.
- [ ] Acquire API keys for the generator LLM and establish connection patterns.
Phase 2: Document Processing Pipeline
- [ ] Task: Implement MinIO connector to list and download documents.
- [ ] Task: Implement PDF text extractor using a library like
PyMuPDF
. - [ ] Task: Implement DOCX text extractor using
python-docx
. - [ ] Task: Implement text chunking strategy (e.g., Recursive Character Text Splitter).
- [ ] Task: Research and implement logic for creating "context chunks" (e.g., combining N consecutive chunks).
- [ ] Task: Implement database logic to store document metadata and chunk content.
Phase 3: Core Generation & Storage
- [ ] Task: Develop a module for interacting with the LLM API.
- [ ] Task: Engineer the initial prompt templates for generating Q\&A pairs from a context chunk.
- [ ] Task: Implement the main orchestration logic: fetch chunks -> create context -> call LLM -> parse response.
- [ ] Task: Implement robust error handling and retry mechanisms for LLM API calls.
- [ ] Task: Implement database logic to store generated questions, answers, and link them to source documents/chunks in the
Master
table.
Phase 4: SDK & Export
- [ ] Task: Design the public-facing API for the Python SDK.
- [ ] Task: Develop the SDK wrapper functions that call the core orchestration logic.
- [ ] Task: Implement the JSON export functionality, ensuring the output format is compatible with common training libraries.
- [ ] Task: Package the project for distribution (e.g.,
setup.py
,pyproject.toml
).
Phase 5: Documentation & Testing
- [ ] Task: Set up the MkDocs site and theme.
- [ ] Task: Write "Getting Started" and "Installation" guides.
- [ ] Task: Write detailed documentation for the SDK with code examples.
- [ ] Task: Write unit tests for individual modules (extractors, chunkers, DB connectors).
- [ ] Task: Write integration tests for the end-to-end pipeline (from document drop to JSON export).
- [ ] Task: Create a final
README.md
for the project repository.