Score Prediction from User Log with BERT

July 19, 2024 admin

This project aims to predict user scores based on sequences of their interaction logs with a quiz system, without knowledge of the correct answers or the choices made by the user. The primary challenge is to infer the score from patterns in user behavior during the quiz.

Problem statement:
Implementation steps:
Data Collection and Preprocessing
Model Training
Curriculum Learning Implementation
data augmentation algorithm:
Hyperparameter Comparison Table
Results
Conclusion

Problem statement:

– Actions: Distinct types of interactions like changing an answer, requesting a hint, visiting educational resources (EDA), and interacting with a chatbot.
– States: Representations of the user’s quiz attempt at each action log, including time spent, number of hints requested, and other engagement metrics.
– Labels: Final scores of students, which are the target predictions of the model.

Implementation steps:

1- Data Collection and Preprocessing: Gather and preprocess data into a format compatible with the Transformer model.
2- Feature Engineering: Develop comprehensive features that encapsulate diverse aspects of user interactions.
3- Model Training: Train the model using the prepared dataset, adjusting parameters as necessary.4- Model Evaluation: Validate the model’s performance on a separate test set to ensure its effectiveness.

Data Collection and Preprocessing

The dataset consists of user interaction logs from a quiz system (DaTu). Each log entry records the type of action performed by the user and the corresponding time interval. The actions include various interactions such as changing an answer, requesting hints, visiting educational resources, and interacting with a chatbot. The raw data is preprocessed to convert actions and time intervals into a sequence of tokens. Each action is assigned a unique token (e.g., ‘UA’ for changing an answer, ‘FH’ for requesting the first hint), and time intervals are binned into categories (e.g., ‘0’ for 0-1 seconds, ‘1’ for 1-5 seconds). Additionally, problem IDs (‘Q1’, ‘Q2’, ‘Q3’) are incorporated to indicate the specific problem being attempted.

Table for each tokens:

Action	token	Action	token
Change answer	‘UA’	First answer	‘FA’
Paste answer’	‘PA’	Update answer explanation	‘UE’
Request first hint	‘FH’	Request another hint	‘UH’
Respond to hint feedback	‘RH’	New answer explanation	‘FE’
Freeform code run	‘RF’	Run code	‘RC’
User request	‘B’	Update confidence	‘C’
Complete sub-module	‘M’	Streamlit interaction	‘S’
Problem	‘Q1’, ‘Q2’, ‘Q3’	time	‘T’

Each internal of time is defined as follow:

interval	token	interval	token
0-1	‘0’	1-5	‘1’
5-10	‘2’	10-15	‘3’
15-20	‘4’	20-30	‘5’
30-60	‘6’	60-120	‘7’
120-300	‘8’	300>	‘MAX’

Model Training

The core model architecture used is a DistilBERT-based Transformer model, enhanced with Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model incorporates custom embeddings for the action-time tokens. The training loop includes standard components such as loss calculation, gradient descent optimization, and learning rate scheduling. Curriculum learning is employed to gradually introduce more complex sequences, starting with simpler ones, to enhance model robustness.

Curriculum Learning Implementation

Curriculum learning is applied by sorting the training data based on sequence length and complexity. Initially, the model is trained on shorter, simpler sequences. As training progresses, more complex sequences are gradually introduced. This approach helps the model to build a strong foundation before tackling more difficult examples.

data augmentation algorithm:

Algorithm:

Input:

sequences: A list of sequences where each sequence is a list of tokens.
max_changes: The maximum number of changes allowed per sequence.
action_prob: The probability of performing an action (repeat, skip, insert).
time_prob: The probability of varying time intervals.

Output:

all_sequences: A list of original and augmented sequences.

Algorithm:

Initialize all_sequences as an empty list.
For each sequence in sequences:
1. Append the original sequence to all_sequences.
2. Initialize augmented_sequence as an empty list.
3. Initialize number_of_changes to 0.
4. Initialize index i to 0.
While i is less than the length of the sequence:
1. If number_of_changes exceeds max_changes:
  - Append the remaining part of the original sequence to augmented_sequence.
  - Append augmented_sequence to all_sequences.
  - Reset augmented_sequence to the first i elements of the original sequence.
2. Set token to the i-th element of the sequence.
3. If token starts with ‘Q’:
  - Append token to augmented_sequence.
  - Increment i by 1.
4. Else if token starts with ‘T’:
  - If a random number is less than time_prob, vary the time interval:
    - Extract the time value from token.
    - Apply a normal distribution to vary the time value within the bounds [0, 8].
    - Append the new time token to augmented_sequence.
    - Increment number_of_changes by 1.
  - Else:
    - Append token to augmented_sequence.
5. Else (for action tokens):
  - Generate a random number p.
  - Increment number_of_changes by 1.
  - If p is less than action_prob:
    - Repeat the action by appending action, a random time interval (T0, T1, or T2), and action to augmented_sequence.
  - Else if p is less than 2 * action_prob:
    - Skip the action by incrementing i by 2.
  - Else if p is less than 2.5 * action_prob:
    - Insert a random action and time interval before the current action in augmented_sequence.
  - Else:
    - Append the action to augmented_sequence and decrement number_of_changes by 1.
6. Increment i by 1.
Append the augmented_sequence to all_sequences.

Return all_sequences.

Hyperparameter Comparison Table

Hyperparameter	BERT	DistilBERT	DistilBERT + Dropout	DistilBERT + Curriculum Learning	DistilBERT + LoRA	DistilBERT + Data Augmentation
Model Name	bert-base-uncased	distilbert-base-uncased	distilbert-base-uncased	distilbert-base-uncased	distilbert-base-uncased	distilbert-base-uncased
Learning Rate	8e-5	8e-5	8e-5	8e-5	8e-5	8e-5
Batch Size	8	8	8	8	8	8
Max Sequence Length	512	512	512	512	512	512
Epochs	50	50	50	50	50	50
Warmup Steps	4	4	4	4	4	4
Gradient Clipping	1.0	1.0	1.0	1.0	1.0	1.0
Dropout Rate	0.1 (default)	0.1 (default)	0.3 (custom)	0.1 (default)	0.1 (default)	0.1 (default)
Curriculum Learning	No	No	No	Yes	No	No
Low-Rank Adaptation	No	No	No	No	Yes	No
Data Augmentation	No	No	No	No	No	Yes

Results

	MSE	MAE	R2	Max error
Bert	0.056	0.178	0.322	0.567
Distbert	0.055	0.183	0.335	0.543
DB+ dropout	0.057	0.181	0.341	0.552
DB+Curriculum	0.052	0.170	0.062	0.521
LORA+B+Cur	0.049	0.179	0.113	0.517
LORA+DB+Cur	0.054	0.172	0.030	0.540
Augmentdata	0.048	0.178	0.125	0.467

Conclusion

The proposed approach effectively leverages Transformer models and advanced techniques like curriculum learning and LoRA to predict user scores from interaction logs. The comprehensive feature engineering and targeted fine-tuning strategies result in a robust model capable of providing accurate predictions, demonstrating the potential for enhancing educational tools with advanced machine learning methodologies.

GitHub

Two related YouTube videos: