we don’t see anything

Heading

we don’t see anything

Building an Ownership-First, On-Device AI

11/25. Mitwa Research Team

The mental health crisis

Mental health is universal. At some point in our lives, we all struggle with anxiety, stress, loneliness, grief, trauma, or simply the weight of being human. Yet accessing mental health support remains strain with barriers like cost, availability, stigma, geographic limitations, and perhaps most critically, fear. Fear that our most vulnerable moments will be recorded, analyzed, monetized, or exposed.

This fear isn't paranoia. It's a rational response to how modern technology works. When you open a mental health app at 3 AM and pour out your struggles with depression, your relationship fears, you might anxiousy spiral thoughts like where does that data go? Who can access it? What happens if the company gets acquired? If there's a data breach? If a government subpoenas records?

‍

The Cloud-Based Dilemma

The intersection of Artificial Intelligence (AI) and mental health presents unprecedented opportunities for accessible psychological support. However, current implementations of conversational AI in healthcare predominantly rely on cloud-based architectures that necessitate transmitting, storing, and analyzing sensitive user conversations on remote servers. Your deeply personal disclosures travel across the internet, get stored in corporate databases, and feed into analytics pipelines.

This creates a fundamental tension between the therapeutic requirement for open, honest communication and users' legitimate concerns about data privacy, corporate surveillance, and potential breaches of confidential mental health information. Privacy policies promise protection, but policies can change. Companies can be sold. Servers can be breached. Promises can be broken.
‍
‍

Why Traditional AI Development Fails Mental Health

‍

Traditional Machine Learning (ML) development follows an iterative cycle represented by Figure 1. This paradigm has powered advances across numerous AI applications, from recommendation engines to virtual assistants.But it becomes ethically untenable when applied to mental health conversations.

Users sharing struggles with depression, anxiety, trauma, or suicidal ideation deserve absolute assurance that these vulnerable disclosures remain private and not merely protected by privacy policies subject to change, but fundamentally inaccessible by design.

‍

A “We Cannot See Anything” Approach

Most companies treat privacy as a policy question: "How can we protect the data we collect?"
This frames data collection as inevitable and protection as the challenge.
We asked a different question: "What if we never collected the data in the first place?"

At Mitwa.ai, we built something different.
We built something that doesn't ask you to trust us with your secrets.
We built something where we cannot see your conversations, not because we promise not to look, but because the
architecture makes it impossible. By design. By principle. By code.
This single principle of "we cannot see anything” became the foundation that shaped every technical decision in
building Bloom.
It's not a feature we added.
It's the constraint we designed around.
This decision to build a completely offline mental health model that operates entirely on-device without any data
transmission created a fundamental research challenge that required rethinking the entire AI development
paradigm.

Mitwa, the Hindi word for "friend."
Not just any friend, but the kind of friend you turn to in your darkest moments.
The one who listens without judgment. The one who sees you at your most vulnerable and stays.
The one you trust with the thoughts you can't share anywhere else.
At Mitwa.ai, we didn't set out to build just another AI companion, we set out to enable genuine companionship, the kind that
requires absolute safety, trust, and ownership.

What is Bloom?

Bloom is a foundational language model designed specifically for mental health support and not as a companion itself, but as the enabling technology that makes true companionship possible. Bloom helps users work through their struggles, process their emotions, and find healthier perspectives. But unlike every other AI model in this space, Bloom runs completely offline, entirely on the user's device. No internet connection required. No data transmitted. No cloud servers. No analytics. No company dashboard showing your conversations. No possibility of data breach, because there's no data to breach.

Your conversations with Bloom happen in a digital space that only you control. When you close the app, those conversations remain locked on your device. We don't have copies. We can't access them. We genuinely, architecturally, and fundamentally cannot see anything. This isn't marketing language. This is technical reality. This is how Mitwa.ai enables companionship, by building technology that gets out of the way, that protects rather than exploits, that enables trust through technical impossibility of betrayal.

0.2. System Architecture

The Mitwa.ai Radical Shift

‍

Our system architecture reflects a fundamental shift from conventional ML pipelines. Rather than a feedback loop connecting deployed models to training systems through user data, we constructed a closed-loop synthetic data factory. The architecture is depicted by Figure 2 which consists of three concentric layers:

Synthetic Data Generation: A Generative Adversarial Network (GAN) pipeline generates realistic training dialogues using the complete knowledge base as its foundation.
‍

Knowledge Base: A scraping agent collects ethically sourced content while a Retrieval-Augmented Generation (RAG) model extracts and organizes psychological frameworks into a unified knowledge base.
‍
Ownership Principle: The guiding constraint “we cannot see anything” that informs every technical decision, from data collection policies to deployment architecture.

‍

The Core Layer: Ownership Principle

This principle manifests as both philosophical commitment and technical implementation. Philosophical commitment of understanding that mental health conversations represent the most intimate form of human vulnerability.
Users discussing trauma, suicidal ideation, relationship struggles, or psychological distress deserve absolute privacy and not as a courtesy but as a fundamental right. With technical implementation by complete on-device model deployment with zero data transmission. The application functions entirely offline, with all processing occurring locally on the user's device. No conversation logs, no analytics, no cloud synchronization, no“anonymized” data collection. The company infrastructure is architecturally incapable of accessing user interactions.

This constraint forced innovation across our entire development pipeline. Unable to rely on A/B testing with real users, behavioral analytics, or iterative fine-tuning from deployment feedback, we instead invested in synthetic data generation of sufficient sophistication that it could substitute for real conversational data.

‍

The Middle Layer: Knowledge Base

‍

The Knowledge Base serves as the foundational layer for generating psychologically-grounded synthetic data. As illustrated in Figure 3, middle layer integrates two complementary knowledge acquisition pathways: (1) a web-scale data collection pipeline that gathers multilingual psychological content through keyword-guided scraping, and (2) a curated psychological literature corpus consisting of books. Both sources are ingested into a unified preprocessing and tokenization pipeline to ensure structural consistency and cross-source comparability. Following preprocessing, the combined corpus is filtered through a specialized Retrieval-Augmented Generation (RAG) system that performs semantic selection and synthesis across both data streams. This RAG-based filtering stage extracts psychologically grounded semantic understanding while simultaneously capturing syntactic conversational patterns relevant to psychological discourse.

This dual-source approach ensures both breadth of conversational patterns and depth of evidence-based psychological frameworks. The next section analyzes the integration and purpose of each individual process.

‍

Ethical Multilingual Data Collection

‍

We developed specialized scraping agent that operate under strict ethical constraints, collecting only content explicitly released for commercial training purposes. The agent employs keyword-based filtering focused on domains relevant to mental wellness: emotional intelligence, conversational patterns, psychological concepts, supportive communication, and general knowledge enabling contextual understanding. Every collected data undergoes automated license verification, accepting only Creative Commons Zero (CC0) content or materials with explicit commercial training permissions. This ensures our training pipeline respects intellectual property while maintaining the ethical sourcing central to our mission.
Our scraping infrastructure collects data in 21 languages, spanning major linguistic families to ensure our eventual model can serve diverse global populations. The multilingual approach required developing language-specific keyword taxonomies, handling varied character encodings, and managing diverse document formats. Our collection spans textual content from public repositories, government databases, educational institutions, and openly licensed publications. Figure 4 represent the process of web-scrapping agent.

The pipeline begins by querying the web to list and sort domain specific dataset files, also using language-specific prefixes. For each language, shards are downloaded as gzipped JSON files via streaming requests with a 60-second timeout for reliability. Followed by text extraction occurs line- by-line, appending a newline to each document's content while skipping malformed entries to maintain data integrity. To manage large-scale data efficiently, the pipeline implements resume logic: it scans existing output directories using glob patterns to identify the last processed chunk and appends accordingly. Output is chunked into TXT files limited to approximately 500 MB each, preventing memory overflows during processing. This approach not only respects intellectual property through automated license verification but also ensures scalability for multilingual corpora.

Raw collected text data from web sources requires substantial processing and tokenization before becoming injecting into RAG database. We have built an end-to-end preprocessing agent which is explained in Section 2.3.3.

Document Intelligence and Literature Processing

‍

A significant portion of our psychology knowledge base comes from scanned PDFs with non-descriptive filenames, artifacts of mass digitization projects containing more than 300 books covering clinical psychology, emotional wellness, therapeutic approaches, communication strategies, and mental health fundamentals. Figure 5 portraits the complete pipeline for processing the literature data.

We developed an automated system that reads the first page of each PDF using optical character recognition, extracts the publication title using Google's document understanding API, and renames files accordingly. In addition to file renaming, the document intelligence system extracts and records structured metadata for each scanned document to support traceability, retrieval, and downstream processing. The extracted metadata includes the publication title, document identifier, language, page count, OCR confidence score, and source category (book, report, or academic text).

When available, auxiliary information such as section headings and publication year is also captured.All metadata is stored alongside the processed text in a structured format (JSON), maintaining a persistent linkage between the original scanned document, its textual representation, and its downstream embeddings. This metadata is not used as training content itself but serves as contextual and organizational information that improves document management, enables accurate retrieval within the RAG system, and supports auditing and reproducibility of the knowledge base construction process.

Following the renaming process, OCR technology extracts the full textual content from each document, converting scanned images into machine-readable text files. This extraction process handles various document layouts, font styles, and image qualities while maintaining the semantic structure of the original content. All text inputs are processed using the common preprocessing and tokenization method described in Section 2.3.3.

To construct a high-quality and domain-aligned knowledge base from large volumes of unstructured text, we employed a RAG pipeline combined with prompt-guided semantic retrieval. After preprocessing stage, all textual documents are segmented into semantically coherent chunks using an overlapping recursive splitting strategy. Each chunk is embedded into a dense vector representation and indexed within a FAISS-based vector store, enabling efficient similarity-based retrieval over the entire corpus.

When new content or queries are processed, semantically relevant text segments are retrieved from the vector store and supplied to a large language model through a structured prompt. This prompt serves a dual role: grounding the model’s reasoning in external knowledge while simultaneously guiding semantic evaluation of the retrieved content. Rather than relying on static rules or keyword-based heuristics, the model assesses each segment for domain relevance, contextual alignment, and informational value.

Content that satisfies these semantic criteria is retained and consolidated into the knowledge base, while off-domain, redundant, or weakly informative segments are automatically excluded.  By embedding the filtering logic directly within prompt instructions, the system enables flexible and interpretable domain control without requiring modifications to the retrieval or embedding infrastructure. This design ensures consistent semantic grounding across tasks, reduces redundancy in pipeline components, and supports scalable knowledge integration. By decoupling retrieval, filtering, and generation while coordinating them through prompt-level control, the system maintains adaptability, reproducibility, and high precision in knowledge base construction.

RAG pipeline extracted data is annotated and converted into a constant format via Semantic Annotation agent and Universal Data Formatter agent which is elaborated in
Section 2.3.4.

Preprocessing Methodology

‍

Figure 6 presents an overview of the unified preprocessing pipeline applied to all collected text data. Irrespective of source, language, or acquisition method, every document is processed through a single, standardized pipeline to ensure semantic consistency, predictable resource usage, and compatibility with downstream training and retrieval systems.

Given the large-scale and multilingual nature of the corpus, preprocessing is preceded by an explicit token analysis phase designed for resource planning rather than text transformation. Languages and scripts exhibit substantial variation in token density when processed by subword tokenizers, which directly affects memory consumption, batching efficiency, indexing cost, and training feasibility. To ensure alignment across training, retrieval indexing, and inference, the LLaMA 3.2 tokenizer is adopted as the only tokenizer throughout the system.

Before any cleaning or normalization is applied, a Token Analysis Agent performs a non-persistent, stateless token counting pass over raw documents using the LLaMA 3.2 tokenizer. This analysis intentionally operates on unprocessed text to capture worst case tokenization behaviour, avoiding underestimation of computational and memory requirements caused by later noise removal.

Document-level token statistics serve as a foundational signal for multiple downstream decisions in the preprocessing pipeline. They are used to identify and filter excessively long or structurally inefficient documents that would otherwise distort batching efficiency or exceed context constraints. These statistics further enable the application of language aware inclusion thresholds that account for script and language-specific token density, preventing disproportionate representation of high token density languages. By incorporating these signals, the pipeline maintains a balanced dataset composition across languages while selecting batch sizes that remain compatible with memory constrained training and deployment environments. In addition, token-level estimates allow accurate forecasting of preprocessing, indexing, and training costs prior to execution, ensuring that large scale processing remains predictable and resource efficient.

Following token analysis, all data passes through a modular, language aware preprocessing pipeline designed to transform raw text into a clean, normalized, and training-ready representation. The pipeline consists of the following stages:

Data Cleaning: Removal of OCR-induced artifacts, malformed characters, HTML remnants, boilerplate text, and encoding inconsistencies. Language specific rules are applied to preserve grammatical and orthographic integrity.
Deduplication: A multi-level strategy combining exact hashing, fuzzy string similarity, and semantic embedding similarity to eliminate both exact and near-duplicate documents. This reduces redundancy and mitigates training bias caused by repeated content.
Normalization: Standardization of punctuation, whitespace, and structural elements while preserving language specific conventions. This step ensures statistical consistency across the corpus without erasing meaningful linguistic variation.
Quality Filtering: Automated rejection of incomplete, incoherent, or low information documents based on structural and semantic thresholds. Only content meeting minimum coherence and density criteria is retained.
‍

The pipeline is implemented as a configurable system, allowing individual stages to be enabled or disabled depending on dataset characteristics, while preserving a consistent processing contract across all data sources.

After completion of all preprocessing stages, retained documents are tokenized using the LLaMA 3.2 tokenizer to generate the canonical token sequences used by downstream systems. This tokenization step is distinct from the earlier token analysis phase, which is limited to non-persistent token counting for resource planning. Final tokenization produces deterministic token ID sequences that are directly consumed by model training, fine-tuning, and RAG indexing.

Tokenization is performed only after cleaning, deduplication, normalization, and quality filtering to ensure that token boundaries are computed on structurally stable and semantically valid text. This ordering minimizes token inflation caused by noise, formatting artifacts, and duplicated content. A single tokenizer is enforced across all languages and data sources to maintain consistency with the model’s vocabulary and context window constraints. Documents exceeding predefined token limits are truncated or segmented into overlapping chunks based on downstream requirements, such as embedding generation or long-context training, ensuring predictable memory usage and runtime behavior during indexing and inference.

Semantic Annotation Agent and Universal Data Formatter

Large-scale language model training requires consistently structured supervision signals; however, manual annotation at the scale required for modern systems is infeasible due to cost, time, and inherent variability across annotators. To address this limitation, we implement a fully automated Semantic Annotation Agent that operates as a shared semantic processing layer across both the web-scraping pipeline and the RAG data preparation workflow. This agent constitutes the final semantic refinement stage prior to ingestion into the knowledge base.

The annotation agent performs deterministic, model-driven semantic labeling at both document and conversational levels using large language models guided by versioned, fixed prompt templates. Each categorical label is accompanied by a normalized confidence score in the range [0,1], derived from the model’s relative likelihood over candidate classes. All annotations conform to a fixed internal schema, independent of output serialization format, ensuring consistency across heterogeneous data sources and across time. Table 1 summarizes the annotation structure.

‍

After semantic annotation, all data is transformed into structures compatible with downstream training, fine-tuning, and retrieval indexing stages. Given the diversity of model architectures and training objectives, a single dataset may need to be represented in multiple formats. To avoid maintaining format-specific transformation scripts, we implement a Universal Data Formatter that serves as the final structural normalization step before knowledge base ingestion.

The formatter operates on semantically annotated data and converts it into standardized training representations, including conversational fine-tuning formats, instruction–response pairs, and retrieval-optimized document layouts. Format requirements are specified declaratively using controlled natural language descriptions that map to predefined output templates.

Format interpretation is handled by an LLM-based parser, while transformation execution is performed using deterministic, rule-based logic to ensure structural correctness, schema compliance, and reproducibility. This hybrid design prevents malformed outputs while preserving flexibility across training configurations. The formatter is applied uniformly to data produced by both the web-scraping pipeline and the RAG synthesis system. As a result, all content entering the knowledge base conforms to a consistent structural contract, independent of source. This eliminates downstream special-case handling and enables rapid iteration over alternative training and ingestion strategies without modifying transformation code.

‍
The Outer Layer: Synthetic Conversation Generation

The synthetic conversation generation layer represents the culmination of our knowledge base construction, transforming processed psychological literature and multilingual web-scraped data into training-ready dialogue sequences.
This layer of the architecture is responsible for generating large-scale, high-fidelity psychologically grounded conversational data, multi-turn conversations without ever requiring access to real user interactions.

Generation Architecture Overview

The generation system employs a prompt-driven LLM inference pipeline that considers the complete knowledge base as contextual grounding. Unlike traditional data augmentation techniques that apply rule-based transformations or template filling, our approach uses generative model to produce novel, contextually rich conversations that exhibit the linguistic diversity, emotional nuance, and psychological realism required for mental health support training.

We employ Azure OpenAI's open-source OSS-120B parameter model as the generation engine. This model class provides sufficient capacity for nuanced psychological reasoning, multilingual generation, and long-context coherence while maintaining inference feasibility within cloud-based batch generation workflows. The selection prioritizes models with demonstrated performance on instruction-following, emotional intelligence, and conversational depth rather than purely maximizing parameter count.

Rather than generating isolated utterances or single-turn exchanges, the system produces complete multi-turn conversations ranging from 12 to 20 turns, simulating extended therapeutic dialogues spanning approximately 30-minute sessions. This design decision reflects the reality that effective mental health support emerges through sustained engagement, gradual emotional exploration, and iterative meaning-making rather than rapid problem-solving exchanges.

Prompt Engineering for Psychological Grounding

The generation quality depends critically on prompt design that encodes psychological principles, conversational dynamics, and linguistic constraints within a single coherent instruction. Our prompt engineering methodology integrates three distinct knowledge sources:

System Identity Definition: A fixed system prompt defines the conversational agent's therapeutic orientation, combining psychoanalytic depth psychology, existential exploration, and trauma-informed presence. This prompt explicitly specifies core therapeutic stances including emotional containment, non-directive curiosity, resistance to premature closure, and ethical boundaries. The system prompt is versioned and frozen during each generation batch to ensure consistency across millions of synthetic samples.

Conversational Rhythm Specification: The prompt explicitly defines a distribution over response types that mirrors evidence-based therapeutic practice: 60% reflective questioning (open-ended invitations to self-exploration), 25% empathic mirroring (concise emotional reflections), and 15% gentle guidance (subtle reframing or grounding interventions). This quantitative specification prevents generation bias toward interrogative patterns or advice-giving responses, both of which contradict therapeutic best practices for mental health support.

Linguistic and Cultural Constraints: Generation instructions enforce code-switching patterns representative of multilingual user populations, particularly Indian English with embedded Hindi lexical items ( "yaar," "matlab," "bas," "theek").
The prompt explicitly prohibits repetitive sentence structures, prescriptive advice, solution-oriented responses, and premature conversation termination. These constraints address failure modes observed in early generation experiments where models defaulted to problem-solving rather than emotional exploration.

Batch Generation Pipeline

Synthetic data generation operates as a stateless, parallelizable batch inference process with built-in resilience mechanisms for large-scale production. The pipeline architecture balances generation throughput with API rate limits, cost constraints, and quality maintenance.

Each generation request specifies a target sample count (typically 5-20 conversations per batch) and includes the complete system prompt, generation guidelines, and 5 reference examples drawn fromhuman-curated seed conversations. These examples are not templates but serve as distributional anchors that guide stylistic consistency, emotional range, and structural diversity. The generation prompt explicitly instructs the model to produce novel content distinct from examples, preventing memorization or template replication.

Generated conversations are returned as JSON Lines (JSONL) format, where each line represents a complete conversation containing a system message and an array of alternating user ("Mitwa") and assistant ("Bloom") turns. This schema enables efficient streaming, incremental validation, and parallel processing during downstream training data preparation.

All generated conversation undergoes schema validation to ensure structural correctness: presence of system prompt, minimum turn count (≥12), balanced turn alternation, and non-empty utterances. Conversations failing validation are logged but excluded from the training corpus. This validation occurs synchronously during generation to provide immediate feedback on prompt effectiveness and model behavior.

Quality Control Mechanisms

Maintaining high-quality synthetic data at scale requires automated quality assurance mechanisms that operate without human annotation.

1. Structural Diversity Enforcement: The generation prompt explicitly prohibits pattern replication across conversations. To enforce diversity, we implement post-generation deduplication using both exact string matching and semantic similarity thresholds. Conversations exhibiting high lexical overlap (>70% n-gram similarity) or semantic redundancy (cosine similarity >0.85 in embedding space) are flagged and excluded from training data.
Psychological Realism Verification: While we cannot validate therapeutic effectiveness without real user interactions, we employ automated heuristics to detect generation failures that violate psychological principles:
- Advice Detection: Regex-based pattern matching identifies prescriptive language (e.g., "youshould," "you need to," "I recommend") that contradicts non-directive therapeutic stance.
- Premature Closure Detection: Conversations concluding with solution proposals or definitive resolutions are flagged, as authentic mental health dialogues rarely achieve clean narrative closure.
- Emotional Flatness Detection: Sentiment analysis across turns identifies conversations lacking emotional variation, which would fail to train models for affective responsiveness.
Linguistic Naturalism Assessment: Generated text undergoes perplexity measurement using an independently trained language model to identify unnaturally low-perplexity sequences indicative of template replication or formulaic generation. Conversations below a perplexity threshold are examined for repetitive structures and excluded if deemed insufficiently naturalistic.

Resilience and Rate Limit Management

Large-scale generation targeting gigabyte-scale datasets requires robust error handling and adaptive rate limiting to maintain production stability. Implementation of Exponential Backoff Strategy is a retry mechanism with exponential backoff for transient failures (HTTP 429, 500, 502, 503, 504). Initial retry delay is 2 seconds, doubling after each consecutive failure up to a maximum of 300seconds. This strategy accommodates temporary quota exhaustion or infrastructure instability without prematurely terminating generation runs.

Consecutive Failure Circuit Breaker is employed to prevent infinite retry loops during sustained API unavailability or quota depletion, the system tracks consecutive empty batch responses. After 10 consecutive failures, generation terminates with detailed logging of failure context, enabling manual diagnosis of quota limits, endpoint configuration errors, or systematic prompt failures.

Quota-Aware Pacing is a inter-batch delays which are calibrated based on API quota structure. For Azure OpenAI deployments with daily token limits, we enforce a 60-second inter-batch delay to distribute request load evenly across the quota window. This conservative pacing prevents burst exhaustion of daily allocations while maintaining generation throughput sufficient to produce gigabyte-scale datasets within multi-day windows.

Token Usage Monitoring ensures each API response includes token consumption metadata (prompt tokens, completion tokens, total tokens). These metrics are logged synchronously and aggregated to track cumulative quota consumption, enabling real-time projection of dataset completion time and early detection of quota exhaustion risk.

Post-Generation Transformation Pipeline

Raw generated conversations undergo structural transformation before ingestion into training datasets, ensuring compatibility with model training frameworks and enabling metadata-driven filtering during training.

Schema Transformation: Each conversation is converted from the flat JSONL generation format into a structured training format containing:

A system message role specifying the therapeutic identity
An alternating sequence of user and assistant message objects with explicit role labels
A metadata object containing conversation type ("multi"), detected language code, and generation timestamp

This transformation is performed by a deterministic Python function that validates message alternation, ensures non-empty content, and applies language detection heuristics based on character set analysis (ASCII-only → English, Devanagari presence → Hindi, mixed → Hindi-English code- switching).

Language Detection: Language classification operates on concatenated conversation text using character-level analysis. The detection algorithm identifies the presence of Devanagari Unicode ranges (U+0900 to U+097F) and ASCII alphabetic characters. Mixed presence is classified as "hi_en" (Hindi-English code-switching), pure Devanagari as "hi", and ASCII-only as "en". This lightweight approach avoids the computational overhead and potential biases of ML based language detection while providing sufficient granularity for training data stratification.

Metadata Enrichment: Each conversation is annotated with automatically derived metadata that enables training-time filtering and evaluation stratification:
Conversation Type: Always "multi" for generated conversations, distinguishing them from potential single-turn question-answering data in hybrid training scenarios
Language Code: Detected language classification enabling language-specific sampling strategies during training
Turn Count: Explicit count of message pairs, enabling length-based curriculum learning or filtering
Generation Timestamp: ISO 8601 timestamp enabling temporal tracking of dataset composition and retrospective analysis of prompt engineering iterations

Integration with Knowledge Base

The synthetic generation layer operates as the terminal consumer of the knowledge base, directly leveraging both web-scraped data and RAG extracted psychological frameworks. While the generation prompt does not explicitly query the RAG system during inference, the base LLM's pretraining and instruction-tuning on psychological literature creates implicit knowledge transfer. The model's understanding of therapeutic techniques, emotional regulation strategies, and psychological frameworks derives from the same corpus of 300+ curated texts processed by the RAG system.

The generation prompt serves as a knowledge grounding mechanism by explicitly encoding therapeutic principles, conversational rhythms, and ethical constraints extracted from psychological literature. Rather than retrieving specific passages during generation, the prompt distills evidence-based practices into operational instructions that guide the model's generation behavior.

Dual-Path Knowledge Synthesis: The final training dataset embodies knowledge synthesis from both acquisition pathways: syntactic diversity and multilingual coverage from web-scraped data inform linguistic naturalism, while semantic depth and therapeutic validity from RAG-processed literature inform psychological appropriateness. This dual-path synthesis occurs implicitly through the generation model's learned representations rather than explicit retrieval or template instantiation.