Business Requirements

Project Name: SynthDataGen Version: 1.0 Date: August 2, 2025

1. Introduction This document outlines the business requirements for the SynthDataGen project. The project's goal is to create a system that automates the generation of synthetic datasets from internal company documents to support LLM fine-tuning and optimization efforts.

2. Business Problem The manual creation of question-answer datasets from domain-specific documents is a critical bottleneck for enterprise AI adoption. This process is slow, expensive, and non-scalable, which hinders the ability to build and deploy custom AI solutions that understand the unique context of the business.

3. Project Goals and Objectives

  • Goal: To significantly reduce the time and effort required to create training datasets for LLMs.
  • Objective 1: Automate the end-to-end process from document ingestion to dataset export.
  • Objective 2: Generate high-relevance question-answer pairs from unstructured PDF and DOCX documents.
  • Objective 3: Provide a robust, developer-friendly SDK for integration into CI/CD and MLOps workflows.
  • Objective 4: Ensure data is stored in a structured, queryable format that allows for future enhancements like manual review and curation.

4. Scope

  • In-Scope:
  • Ingestion of .pdf and .docx files from a specified MinIO bucket.
  • Text extraction from these documents.
  • Strategic chunking of extracted text.
  • Creation of "context chunks" by combining related text chunks.
  • Using a pre-existing LLM (via API) to generate question-answer pairs.
  • Storing documents, chunks, and Q\&A pairs in a PostgreSQL database with the specified schema.
  • Functionality to export the Q\&A dataset as a JSON file.
  • A Python SDK for programmatic control of the pipeline.
  • Project documentation created using MkDocs.
  • Out-of-Scope:
  • A user interface (UI) for uploading documents or reviewing Q\&A pairs.
  • The fine-tuning process of the target LLM itself.
  • Hosting or deployment of the fine-tuned models.
  • Processing of structured data sources (e.g., CSV, JSON) in this phase.
  • Real-time data generation. The process is designed to be run as a batch job.

5. Functional Requirements

  • FR-1: Document Ingestion: The system shall monitor a designated MinIO bucket and process new PDF and DOCX documents.
  • FR-2: Content Processing: The system shall accurately extract all text content from the documents and split it into smaller, coherent chunks.
  • FR-3: Context Generation: The system shall implement a mechanism to group related chunks to form a larger "context chunk" to be fed to the LLM.
  • FR-4: Q\&A Generation: The system shall send context chunks to a configured LLM API and parse the returned question-answer pairs.
  • FR-5: Data Storage: The system shall store all data and metadata in a PostgreSQL database according to the defined schema (Documents, Chunks, Context Chunk, Master Q\&A Table).
  • FR-6: Data Export: The system shall provide a function to export all reviewed Q\&A pairs from the Master table into a single JSON file.
  • FR-7: SDK: The system shall provide a Python SDK with functions to initiate the generation process, check status, and trigger the export.

6. Non-Functional Requirements

  • NFR-1: Scalability: The system should be able to process a corpus of at least 1,000 documents without significant performance degradation.
  • NFR-2: Reliability: The system must include robust error handling for failed document parsing, API calls, and database connections.
  • NFR-3: Usability: The SDK must be well-documented with clear examples to ensure ease of use for ML Engineers.
  • NFR-4: Maintainability: The code should be modular and well-commented to allow for future extensions (e.g., supporting new file types).

7. Assumptions & Constraints

  • Assumption: Access to a MinIO instance, a PostgreSQL database, and valid API keys for a generator LLM will be provided.
  • Assumption: Documents contain machine-readable text. Scanned PDFs (images of text) are not in scope for V1.
  • Constraint: The initial development will be in Python.
  • Constraint: The quality of the generated Q\&A pairs is dependent on the quality of the source documents and the capabilities of the generator LLM.