LLMsFine-tuningKnowledge AwarenessQuestion AnsweringDPO

KnowTuning: Knowledge-aware Fine-tuning for Large Language Models

Yougang Lyu et al.

This paper introduces KnowTuning (Knowledge-aware Fine-tuning), a novel two-stage method to enhance the fine-grained and coarse-grained knowledge awareness of LLMs. This method addresses LLMs' limitations in complex knowledge-intensive tasks, such as generating incomplete, non-factual, or illogical answers. It involves fine-grained knowledge augmentation and coarse-grained knowledge comparison across completeness, factuality, and logicality.

Featured

Read Paper

Video Explanation

My Insights

Summary of the KnowTuning Research Paper

1. Problem Focus

The research addresses the limitation that despite their general success, Large Language Models (LLMs) struggle to effectively leverage knowledge for complex knowledge-intensive tasks. Vanilla fine-tuning often results in answers that are incomplete, non-factual, or illogical.

The core problems stem from inadequate knowledge awareness during standard fine-tuning:

Fine-grained Knowledge Awareness: LLMs struggle to identify detailed, atomic knowledge precisely within the answer.
Coarse-grained Knowledge Awareness: LLMs frequently fail to distinguish effectively between reliable and unreliable knowledge across critical aspects: completeness, factuality, and logicality.

While prior work like FactTune focuses mainly on improving factuality, KnowTuning aims to enhance knowledge awareness across all these essential aspects simultaneously for solving complex knowledge-intensive Question Answering (QA) tasks.

2. Contributions

The main contributions of the paper are summarized as follows:

Systematic Enhancement: Focusing on systematically enhancing LLM knowledge awareness at both fine-grained and coarse-grained levels to address complex knowledge-intensive tasks.
Novel Method: Introducing KnowTuning, a method that fine-tunes LLMs using two stages: fine-grained knowledge augmentation and coarse-grained knowledge comparison.
Demonstrated Effectiveness: Verifying the effectiveness of KnowTuning on generic and medical domain QA datasets through automatic and human evaluations across various LLM sizes.
Factual Precision: Demonstrating that KnowTuning generates more facts with a lower factual error rate during fine-grained facts evaluation.

3. Proposed Method (KnowTuning)

KnowTuning is structured into two sequential stages designed to improve different levels of knowledge awareness.

Stage I: Fine-grained Knowledge Augmentation

This stage trains LLMs to identify difficult knowledge points to improve fine-grained knowledge awareness.

Atomic Knowledge Extraction: Answers ( $a_i$ ) are broken down into individual facts, called atomic knowledge ( $K_i$ ). This process is implemented by prompting OpenAI models (Extract(·)).
Difficult Knowledge Filtering: The generation perplexity ( $ppl_{ji}$ ) of each atomic knowledge piece ( $k_{ji}$ ) is calculated using the baseline SFT model ( $\pi_{SFT}$ ). High perplexity indicates a lack of knowledge awareness. An $\alpha$ percentage of atomic knowledge with the highest perplexity is selected as the difficult knowledge ( $K^*_i$ ).
QA Pair Rewriting: The original question ( $q_i$ ) is rewritten into a fine-grained question ( $q^*_i$ ) relevant to $K^*_i$ using Rewrite(·). A fine-grained answer ( $a^*_i$ ) is also rewritten based on $K^*_i$ .
Training: The LLM is trained on the combined dataset ( $D_{ka}$ ), which includes both the original QA pairs and the newly generated fine-grained QA pairs.

Stage II: Coarse-grained Knowledge Comparison

This stage trains the LLM (initialized as $\pi_{ka}$ from Stage I) to distinguish between reliable ( $a_w$ ) and unreliable ( $a_l$ ) knowledge across completeness, factuality, and logicality using Direct Preference Optimization (DPO).

To prevent overfitting, the better answer ( $a_w$ ) used is a rephrased original answer ( $a_{ri}$ ) based on the original atomic knowledge. The worse answers ( $a_l$ ) are created using knowledge-disturbing techniques:

Comparison Aspect	Disturbing Technique	Worse Answer ( $a_l$ )	How it's Created (Source Detail)
Completeness ( $D_{kcc}$ )	Deleting ( $\beta$ percent) atomic knowledge	Incomplete answer ( $a_{ci}$ )	Randomly delete atomic knowledge
Factuality ( $D_{kfc}$ )	Revising atomic knowledge	Nonfactual answer ( $a_{fi}$ )	Prompting OpenAI models (`Revise(·)`) to revise atomic knowledge into wrong knowledge
Logicality ( $D_{klc}$ )	Shuffling atomic knowledge	Illogical answer ( $a_{li}$ )	Randomly shuffling the order of all atomic knowledge

The coarse-grained knowledge comparison dataset ( $D_{kc}$ ) combines these three comparison sets. The model is optimized using a combined loss function $L_{kc}= L_{dpo} + \gamma L_{sft}$ , where $L_{dpo}$ is the DPO loss and $L_{sft}$ is a term for the SFT loss on the better answers ( $a_w$ ).

4. Implementation Details

Backbone Models: Llama2-base models of 7b and 13b parameters were used.
External Tools: The OpenAI model gpt-3.5-turbo was utilized for Extract(·), Rewrite(·), and Revise(·) functions.
Hyperparameters: The filtering percentage ( $\alpha$ ) and deleting percentage ( $\beta$ ) were both fixed at 0.5. The scalar weighting hyperparameter ( $\gamma$ ) was set to 0.2.
Training Process: SFT training ran for 3 epochs and DPO ran for 1 epoch. Training leveraged the AdamW optimizer, PEFT, LLaMA-Factory, and LoRA.

5. Performance Comparison

KnowTuning demonstrated superior performance compared to baselines, including vanilla SFT, RLAIF, and FactTune.

Automatic and GPT-4 Evaluation (RQ1)

Absolute Quality: KnowTuning consistently improved the absolute quality of answers and outperformed baselines on lexicon-based (METEOR) and semantic-based (BERTScore) metrics across generic (Dolly, NQ, ELI5) and medical (MedQuAD) QA datasets.
Knowledge Awareness Metrics (GPT-4): KnowTuning consistently outperformed baselines in terms of completeness, factuality, and logicality. Compared to RLAIF and FactTune (which focus narrowly on helpfulness or factuality), KnowTuning was more effective at improving performance across these multiple aspects simultaneously.

Fine-Grained Fact Evaluation (RQ2)

KnowTuning showed success at generating high-quality facts at a fine-grained level.

Correct Facts: KnowTuning consistently generated a larger amount of correct facts (# Correct) across generic and domain-specific QA datasets compared to SFT and FactTune.
Proportion of Correct Facts: KnowTuning generated answers with a higher proportion of correct facts (% Correct) across various sizes. For the Llama2-7b-base model on Dolly, KnowTuning achieved 85.92% correct facts, surpassing FactTune's 85.42%.
Factual Error Rate: KnowTuning generates more facts with a less factual error rate.

Ablation Study (RQ3)

Ablation studies confirmed the necessity of both stages:

Removing Fine-grained Augmentation (-KA): Decreased performance across completeness, factuality, and logicality, indicating its effectiveness in improving fine-grained knowledge awareness.
Removing Coarse-grained Comparison (-KC): Removing all coarse-grained comparison sets resulted in the most substantial performance degradation across all aspects. Removing individual comparison sets (e.g., -KCC, -KFC, -KLC) adversely affected their corresponding metric.

Human Evaluation and Cost

Human Evaluation: Human judgment confirmed that KnowTuning consistently surpassed FactTune in terms of completeness, factuality, and logicality performance across both generic and medical domains.
Cost Efficiency: KnowTuning was found to be more cost-effective for dataset construction (Dolly cost: $8.45) compared to RLAIF ($ 9.94) and FactTune ($10.53).