PVE - Physics VarExtractor | Taissir Boukrouba

Project Overview

Knowing the importance of research papers and the huge amounts of unstructured data that is submitted to the web makes it a very prominent problem. Significant insights have been left as brute text in these papers, making it not easily accessible. The aim of this project is to compare the ability of several LLMs to extract variables and their names from physics research papers. This task is called sequence labelling, where a sentence is given and each of its tokens is categorized into the convenient class.

A collection of papers was gathered from Arxiv, an open-access repository for research papers, particularly in the fields of physics, mathematics, and computer science. The data is in PDF format, which was transformed into text using Python PDF-to-text translation tools. Data preprocessing techniques specific to NLP were used to clean the text, which was later put in a feature extraction pipeline using different grammatical tools such as POS and NER to extract and create a dataframe of variables and their names. The result data, which had two iterations (one with 300 rows and the second with 2K rows), was trained on two different types of models: one pre-trained (DistillBERT) and the other built from scratch (Encoder-only Transformer).

Ethical Considerations

In Cornel’s University Privacy Principles "Support for US and International Data Privacy Standards" and ArXiv’s Privacy Policy "Special Notice for EU Residents" sections mention and acknowledge GDPR requirements. They list principles like Notice, Data Integrity, Purpose Limitation, Access, and Security, which correspond to GDPR rules (creating, maintaining, using or disseminating personal data must take "reasonable and appropriate" security measures).
The UH policy focuses on the integrity of the research itself, not necessarily the platform (ArXiv). This project does not violate any copyright infringements or data fabrication (all data uploaded to ArXiv is verified by its moderators).
Under ArXiv’s terms and submission agreements "Grant of the License to arXiv," submitters agree that their submission grants us (content users) a non-exclusive, perpetual, irrevocable, and royalty-free license to include and use their work.

Document Control

This project maintains a well-organized directory structure to ensure efficient document control and project management. The /data directory contains the final datasets from both iterations of the study. The /documentationè folder holds comprehensive documentation detailing the methodology and steps undertaken. Saved models are stored in the /models directory, while the /notebooks folder includes Jupyter notebooks used throughout the project. Finally, the /scripts directory has the feature extraction script designed to be executed on a cluster for enhanced performance and time efficiency. The library requirements are specified inside the requirements.txt` file.

Computational Environment

I - Feature Extraction Phase - UHHPC Cluster:

For the feature extraction phase of this project, I utilized the University of Hertfordshire’s cluster computing resources (UHHPC). The jobs were submitted using the PBS (Portable Batch System), which is a system designed to manage job scheduling on a computing cluster. The job for feature extraction was submitted under the name feat-extract using the -N flag, and it was queued in the main queue (-q). The job requested one node with 16 processors per node, with a maximum runtime of 168 hours (-l), equivalent to one week. Throughout execution, the text output and errors were logged in output.log and error.log, respectively. This can be summarised in the following table:

Resource	Details
System	UHHPC
Job Scheduler	PBS
Job Name	"feat-extract"
Queue	Main queue
Nodes Requested	1
Processors/Node	16
Max Time	168 hours (1 week)

II - Modeling Phase - Google Colab:

Given the limited time and the extensive computational demands of the task, I utilized Google Colab’s A100 GPUs for the modeling phase. The A100 GPU is powered by NVIDIA’s Ampere architecture, featuring 40 GB of high-bandwidth memory (HBM2) and offering up to 312 teraflops of performance. This GPU provides a significant acceleration for deep learning tasks, making it well-suited for training large language models. This can be summarised in the following table:

Resource	Details
System	Google Colab
GPU Used	NVIDIA A100
GPU Architecture	NVIDIA Ampere
GPU Memory	40 GB High-Bandwidth Memory (HBM2)
Performance	Up to 312 Teraflops
Use Case	Training large language models

Methodology

I - Data Collection

The dataset is called ArXiv, which is open access. The data was collected from Google Cloud Storage (GCS), where it was available for free in buckets for bulk access. The command line tool gsutil was used to access ArXiv’s physics PDF buckets and download them to the local machine. The dataset was then uploaded into Google Drive for easy access through Google Colab.

II - Data Preparation

NOTE: The complete implementation for this phase is available and can be found in the notebooks folder here.

This is the first step of the pipeline. The goal is to convert PDF data into text data. Since the files are native PDFs (which means text is already digitally encoded), there is no need to apply any OCR (Optical Character Recognition) techniques because they are less accurate and can lead to more grammatical errors. This means I will use PyMuPDF, PyPDF2, and PDFMiner.six. I will conduct two tests where the first test is general text extraction using one PDF file from the dataset. The pipeline is summarized in the following diagram:

In terms of text parsing, PDFMiner.six had the best results with impressive formatting and no text spacing issues. Symbol detection was moderate but the best compared to the other tools. Also, when tested on math notations, it had moderate but acceptable performance. Since the spacing issue with PyPDF2 couldn’t be fixed, PDFMiner.six is the tool selected for PDF translation into textual data. The following table summarizes the performances of all three tools tested on the PDF sample and math notations:

	PyMuPDF	PyPDF2	PDFMiner.six
Test I	Moderate	Bad	Good
Test II	Bad	Good	Moderate
Upsides	Moderate formatting	Good symbol detection	Good formatting
Downsides	Extremely bad symbol detection	Very bad formatting	Moderate symbol detection
Decision	Excluded	Excluded	Selected

III - Data Preprocessing

NOTE: The complete implementation for this phase is available and can be found in the notebooks folder here.

The initial phase involves processing the PDF documents, which have been converted into text files and subsequently organized into a designated directory. These text files will undergo a comprehensive preprocessing procedure as displayed in the following diagram:

Upon conversion of the PDF data into text format, it is crucial to undertake a thorough cleaning and validation process to ensure the integrity and accuracy of the text. This preprocessing phase is critical for preparing the data for further analysis and involves a series of methodical steps, which are outlined below:

Regex Preprocessing: Uses regular expressions to clean and standardize text, removing unwanted characters and fixing extraction errors for consistency.
Text Reconstruction: Reassembles and organizes text to correct formatting issues and restore readability, ensuring a coherent and well-structured corpus.

These preprocessing procedures are designed to refine the raw text data and enhance its quality, thereby improving the reliability and effectiveness of subsequent analyses and applications.

IMPORTANT: For a more in-depth exploration, please refer to the following document.

IV - Feature Extraction

After meticulously transforming and cleaning the text data, we now move to a crucial phase in the data processing pipeline: the extraction of (variable, name) pairs from the documents. This phase is instrumental in structuring the data for meaningful analysis. To achieve accurate and reliable extraction, this phase is broken down into a series of 7 steps, each designed to systematically address different aspects of the data and ensure that the resulting pairs are both precise and relevant. The steps involved in this phase are as follows:

NOTE: The complete implementation for this phase is available and can be found in the notebooks folder here.

Lastly, all of these previous steps are combined to create the full feature extraction pipeline. Also, due to the limited time and the long demanding process, I have only created two iterations (versions) of the result’s dataframe, which is summarized in the following table:

Iteration	Row Count
First Iteration	300
Second Iteration	2100

V - Modelling

NOTE: The complete implementation for this phase is available and can be found in the notebooks folder here.

To transition into the modeling phase, we start with pre-modeling processing, a critical step where the data undergoes thorough preparation and refinement before it gets pushed into the models. This ensures that the dataset is in the best possible condition, free from noise and ready for accurate analysis. Following this, we move into the model definition stage, where careful consideration is given to selecting the appropriate models. Specifically, this involves choosing a pre-trained model that can leverage existing knowledge and a custom model built from scratch, tailored to the specific nuances of our dataset. The architectures of these models are meticulously designed to effectively capture the underlying patterns and relationships, providing a robust foundation for subsequent training and evaluation. Therefore, we have 2 steps in this phase:

All of the modelling pipeline parameters for both models are summarized in this table:

Parameter	Value	Parameter	Value
Save Directory	Custom to each model	Logging Steps	100
Learning Rate	2e-5	Evaluation Strategy	"epochs"
Epochs	10 or 20	Save Strategy	"epochs"
Steps	100	Per Device Batch	16
Weight Decay	0.01	Early Stopping	Only for fine-tuned model

Results

The graphs show the evolution of loss and accuracy for both defined models (DistillBERT and Custom Transformer) across two dataset versions (Iteration 1 and Iteration 2). Overall, the DistillBERT model demonstrates superior performance compared to the Custom Transformer, which is notably slower in both iterations, exhibiting lower accuracy and higher loss.

In particular, the loss graph indicates that DistillBERT in the first iteration starts with a higher loss close to 100%, which decreases rapidly and then levels off after approximately 5 epochs. It then begins to increase steadily after the 8th epoch, eventually reaching a loss of 55%. In contrast, both Custom Transformer curves show more stability. The second iteration of the Custom Transformer displays about 5% more loss compared to the first iteration, ending at 78% and 65% loss, respectively. Conversely, the second iteration of DistillBERT starts with the lowest loss at 45%, which drops to below 30% after the third epoch, stabilizes, and fluctuates until reaching a minimal loss of 30% again.

For the accuracy graph, which reflects the loss, it is evident that the model with the least loss achieved the highest accuracy of approximately 94% (DistillBERT II), with a 96% F1-score for detecting variables and around 66% F1-score for name extraction. DistillBERT I follows with 87% accuracy, while the Custom Transformers (Iteration 2 and Iteration 1) achieve 69% and 65% accuracy, respectively.

Both successfully fine-tuned models were tested on a couple of sentences with the following results:

Example 1

"The electric potential energy f(x)"

Token	Prediction	DistillBERT I
the	B-NAME	0.96
electric	I-NAME	0.49
potential	I-NAME	0.73
energy	I-NAME	0.86
f(x)	B-VAR	0.96

Example 2

"The velocity v"

Token	Prediction	DistillBERT I	DistillBERT II
the	B-NAME	0.96	0.99
velocity	I-NAME	0.87	0.98
v	B-VAR	0.98	0.99