Tasks

Phase 1: Foundation & Setup (Sprint 0)

  • [ ] Provision and configure MinIO bucket.
  • [ ] Provision and configure PostgreSQL server.
  • [ ] Design and finalize the database schema; create initial migration scripts.
  • [ ] Set up Python project structure, virtual environment, and Git repository.
  • [ ] Acquire API keys for the generator LLM and establish connection patterns.

Phase 2: Document Processing Pipeline

  • [ ] Task: Implement MinIO connector to list and download documents.
  • [ ] Task: Implement PDF text extractor using a library like PyMuPDF.
  • [ ] Task: Implement DOCX text extractor using python-docx.
  • [ ] Task: Implement text chunking strategy (e.g., Recursive Character Text Splitter).
  • [ ] Task: Research and implement logic for creating "context chunks" (e.g., combining N consecutive chunks).
  • [ ] Task: Implement database logic to store document metadata and chunk content.

Phase 3: Core Generation & Storage

  • [ ] Task: Develop a module for interacting with the LLM API.
  • [ ] Task: Engineer the initial prompt templates for generating Q\&A pairs from a context chunk.
  • [ ] Task: Implement the main orchestration logic: fetch chunks -> create context -> call LLM -> parse response.
  • [ ] Task: Implement robust error handling and retry mechanisms for LLM API calls.
  • [ ] Task: Implement database logic to store generated questions, answers, and link them to source documents/chunks in the Master table.

Phase 4: SDK & Export

  • [ ] Task: Design the public-facing API for the Python SDK.
  • [ ] Task: Develop the SDK wrapper functions that call the core orchestration logic.
  • [ ] Task: Implement the JSON export functionality, ensuring the output format is compatible with common training libraries.
  • [ ] Task: Package the project for distribution (e.g., setup.py, pyproject.toml).

Phase 5: Documentation & Testing

  • [ ] Task: Set up the MkDocs site and theme.
  • [ ] Task: Write "Getting Started" and "Installation" guides.
  • [ ] Task: Write detailed documentation for the SDK with code examples.
  • [ ] Task: Write unit tests for individual modules (extractors, chunkers, DB connectors).
  • [ ] Task: Write integration tests for the end-to-end pipeline (from document drop to JSON export).
  • [ ] Task: Create a final README.md for the project repository.