Business Requirements

Project Name: SynthDataGen Version: 1.0 Date: August 2, 2025

1. Introduction This document outlines the business requirements for the SynthDataGen project. The project's goal is to create a system that automates the generation of synthetic datasets from internal company documents to support LLM fine-tuning and optimization efforts.

2. Business Problem The manual creation of question-answer datasets from domain-specific documents is a critical bottleneck for enterprise AI adoption. This process is slow, expensive, and non-scalable, which hinders the ability to build and deploy custom AI solutions that understand the unique context of the business.

3. Project Goals and Objectives

Goal: To significantly reduce the time and effort required to create training datasets for LLMs.
Objective 1: Automate the end-to-end process from document ingestion to dataset export.
Objective 2: Generate high-relevance question-answer pairs from unstructured PDF and DOCX documents.
Objective 3: Provide a robust, developer-friendly SDK for integration into CI/CD and MLOps workflows.
Objective 4: Ensure data is stored in a structured, queryable format that allows for future enhancements like manual review and curation.

4. Scope

In-Scope:
Ingestion of .pdf and .docx files from a specified MinIO bucket.
Text extraction from these documents.
Strategic chunking of extracted text.
Creation of "context chunks" by combining related text chunks.
Using a pre-existing LLM (via API) to generate question-answer pairs.
Storing documents, chunks, and Q\&A pairs in a PostgreSQL database with the specified schema.
Functionality to export the Q\&A dataset as a JSON file.
A Python SDK for programmatic control of the pipeline.
Project documentation created using MkDocs.
Out-of-Scope:
A user interface (UI) for uploading documents or reviewing Q\&A pairs.
The fine-tuning process of the target LLM itself.
Hosting or deployment of the fine-tuned models.
Processing of structured data sources (e.g., CSV, JSON) in this phase.
Real-time data generation. The process is designed to be run as a batch job.

5. Functional Requirements

FR-1: Document Ingestion: The system shall monitor a designated MinIO bucket and process new PDF and DOCX documents.
FR-2: Content Processing: The system shall accurately extract all text content from the documents and split it into smaller, coherent chunks.
FR-3: Context Generation: The system shall implement a mechanism to group related chunks to form a larger "context chunk" to be fed to the LLM.
FR-4: Q\&A Generation: The system shall send context chunks to a configured LLM API and parse the returned question-answer pairs.
FR-5: Data Storage: The system shall store all data and metadata in a PostgreSQL database according to the defined schema (Documents, Chunks, Context Chunk, Master Q\&A Table).
FR-6: Data Export: The system shall provide a function to export all reviewed Q\&A pairs from the Master table into a single JSON file.
FR-7: SDK: The system shall provide a Python SDK with functions to initiate the generation process, check status, and trigger the export.

6. Non-Functional Requirements

NFR-1: Scalability: The system should be able to process a corpus of at least 1,000 documents without significant performance degradation.
NFR-2: Reliability: The system must include robust error handling for failed document parsing, API calls, and database connections.
NFR-3: Usability: The SDK must be well-documented with clear examples to ensure ease of use for ML Engineers.
NFR-4: Maintainability: The code should be modular and well-commented to allow for future extensions (e.g., supporting new file types).

7. Assumptions & Constraints

Assumption: Access to a MinIO instance, a PostgreSQL database, and valid API keys for a generator LLM will be provided.
Assumption: Documents contain machine-readable text. Scanned PDFs (images of text) are not in scope for V1.
Constraint: The initial development will be in Python.
Constraint: The quality of the generated Q\&A pairs is dependent on the quality of the source documents and the capabilities of the generator LLM.