Enhancing Discovery in Scientific Research through Object-Oriented Approach for Large Language Models

Enhancing Discovery in Scientific Research through Object-Oriented Approach for Large Language Models


The Motivation

This article presents an idea of combining Object-Oriented programming (OOP) paradigm for Large Language Models (LLM), particularly, GPT to enhance scientific discovery throught GPT-learning from scientific papers. This idea came about as a result of release of several new features in Python LangChain package for LLM in late 2023.

Pydantic package was used to facilitate the use of OOP.  Another reason Pydantic was used is to address 2 of the challenges facing LLM – malicious intent of hijacking LLM applications and hallucinations.  Pydantic is primarily a data validation and parsing library.  It doesn't inherently deter malicious intent on its own, its main purpose is to ensure that the data received or sent follows a predefined structure and correct data type. Nevertheless, it indirectly helps in deterring malicious intent by enforcing strict data validation, serving as a contract for expected data schema, and reducing the risk of unexpected type-related issues that could be exploited maliciously.

Text summarization, entity recognition, question-answering, chatbots are some of the most common applications of LLMs.  In these applications, both input and output are UNSTRUCTURED data.  In this experiment, a combination of OOP, Pydantic, LangChain and GPT was used to facilitate the generation of STRUCTURED output.  LangChain has also offered a short course on Function Calls via DeepLearning.ai.  Some of the codes here have been inspired by the course.

There are 2 LLM tasks in this discovery:

  • Tag and extract important information from a scientific paper
  • Extract citations into structured data for further analysis


The Scientific Paper

The scientific paper selected for this experiment was authored by Professor David C. Page and his co-workers published in American Journal of Human Genetics in 2018 titled “Selection Has Countered High Mutability to Preserve the Ancestral Copy Number of Y Chromosome Amplicons in Diverse Human Lineages”.  Professor Page was the director at the Whitehead Institute for Biomedical Research at Cambridge, Massachusetts.  He has spent nearly his entire scientific career studying and defending the honor of Y chromosome which was widely believed to be degrading leading to the extinction of human males.  His ground breaking research led to discoveries of pivotal role of Y genes beyond merely sex determination, suggesting the well being of the human race could be hinging on this diminutive Y.

I have also previously authored an article on Y-chromosome based on Page's research here.


Required Python Packages for the Experiment

The required packages consists of standard Python, LangChain and Pydantic libraries.  Access to GPT was brokered by LangChain ChatOpenAI in the langchain.chat_models package.  OpenAI token is required and must be loaded into the environment.


Tagging Instantiation

The selected paper was first subjected to tagging through the creation of a Tagging class that contains C++-like declaration of variable annotations using colons.  This declaration was introduced in Python 3.6.  The Tagging class was passed to LangChain’s convert_pydantic_to_openai_function and converted into a list object.

Notice also the Tagging class inherit Pydantic’s BaseModel so that it has access to the variables and functions of the package.


Tagging instructions were provided to the prompt template to teach GPT on how to behave when carrying out Tagging work.  The chaining process was made up of 3 components.  The first one was functions collection to enable automated function calling. The second component was tagging of model that contain API call to ChatGPT and finally the chaining process.


Scientific Paper Content

The URL of the paper was passed to LangChain’s WebBaseLoader to load the content.  After loading the content, a small chunk of was sampled to ensure the content was there.

Class Object for Paper Overview

The purpose of PaperOverview class is to extract information such as summary, statistics, discipline and keywords from the paper.  The chaining took place with the collection of functions for automated function calls, template creation and chaining.  Notice that the class name was passed to LangChain pydantic function and the tagging model.

The output obtained is as follow:

Summary: This paper investigates the evolutionary forces that govern the formation, maintenance, and diversification of Y chromosome amplicons in humans. The authors develop computational tools to detect amplicon copy number with unprecedented accuracy from high-throughput sequencing data. They find that amplicon copy number is maintained among divergent branches of the Y chromosome phylogeny, indicating that the reference copy number is ancestral to all modern human Y chromosomes. The distribution of males with copy number variants within the phylogenetic tree is incompatible with neutral evolution and instead displays hallmarks of mutation-selection balance. The authors also observe cases of amplicon rescue, in which deleted amplicons are restored through subsequent duplications.

 Statistics: 16.9% of males in the dataset have an amplicon copy number variant. The study analyzed whole-genome sequencing data of 1,216 males from the 1000 Genomes Project.

Discipline: Genetics, Evolutionary Biology

Keywords: Y chromosome, amplicons, copy number variation, mutation-selection balance, human evolution

Extract Citations

Citation Classes

The next task of discovery was to extract and analyze citations that came with Professor Page’s paper.  To accomplish this, 2 Python classes were created that contained attributes for the extraction.   The Citations class was simply referring to the Page’s paper itself.  This class was passed to CitationInfo class to ensure only citations from Page’s paper were extracted and not other papers in GPT’s memory. Notice the citations annotation is CitationInfo class was passed to the chain.

Prompt Template

The classes in this task worked differently from the previous one. Here, a separate instruction template was created to allow a set of rules to be defined so that GPT can extract information more precisely.  The subsequent chaining process worked similarly.  The instructions told GPT to extract only authors, title, journal name and year from each citation found only in Page’s paper.


Check for Hallucinations

Before starting the extraction process, irrelevant prompts to the paper were used to check if our guardrail is working satisfactorily to deter hallucinations.  The prompts were ‘How are you?’ and ‘Have a nice weekend’ . As illustrated in the following, GPT returns nothing.

Split Text

For LangChain and GPT to work more efficiently, RecursiveCharacterTextSplitter was used to split the content into manageable chunks.  The following illustration shows Page’s paper was splitted into 22 chunks.

To carry out Extraction, splitted text was subjected to LangChain tracing using RunnableLambda followed by the chaining process.  The flatten function ensured the output is in 1-dimension.  Otherwise, errors would be thrown.  Invocation of the chaining process returned a JSON-formatted output containing title, author, journal and year as instructed earlier.

Citations Dataframe

To better analyze the citations, JSON output was converted into pandas dataframe as follow. Notice the outcome was imperfect.  Data cleaning was done to remove citations that did not have journal names.

The following illustration shows cleaned data.


Top-Rated and Regular Journals

It is every scientist aspiration to get his / her works published in top-rated journals such as AAAS Science publications, Nature Publications, Cell, and Proceedings of the National Academy (PNAS).  For this reason, investigations were done to see if LLM helped.


The following illustration shows the list of journal names used to identify top-rated journals. This list was used to separate top-rated from regular journals.  Journal names that were not on the list were automatically taken out as regular.

The resulting top-rated journal articles are as follow:

The resulting regular journal articles are as follow:


Retrieve Abstracts from PUBMED using Titles

In order to better understand the characteristics of top-rated and regular journals, we needed much more than just titles. Titles from both tables were extracted and into Markdown bullet points.  These titles were then used by GPT to retrieve the corresponding abstracts from PUBMED.

The following illustration shows several functions were created to orchestrate the retrieval of abstracts from PUBMED. get_gpt_response is a generic GPT completion function. Functions get_abstract_prompt and get_abstract_prompt_json were created to construct prompt to retrieve completions with different output format.

Prompt constructed using get_abstract_prompt returns a standard output while get_abstract_prompt_json returns a JSON-formatted output.

Top-Rated Title-Abstract

Similarities and Differences

With abstracts from Top-Rated and Regular journals properly separated, finding the similarities and differences of both groups could be carried out.  GPT was again enlisted to sort this out.  Lists of both groups were used to construct a prompt with the instruction to find similarities and differences with regards to the study of Y chromosome on human health.

Similarities between Top-Rated and Regular Journals with Respect to the Study of Y-Chromosome Associated with Human Health

Differences between Top-Rated and Regular Journals with Respect to the Study of Y-Chromosome Associated with Human Health

Refining Discovery Through the Retrieve of Relevant Abstracts

Research discovery can be improved and refined by having the ability to efficiently, quickly and accurately retrieving specific abstracts relevant to a question asked.  This can be achieved by using a technique called Vectorstore Embedding.

The following function uses Vectorstore to store the abstracts that we built earlier. The parameter in_question takes a query, passes it to the Vectorstore to retrieve the relevant abstracts.

It's now time to ask some questions to see if we can get the relevant abstracts.

Question 1: Genes involved in male fertility such as sperm production and function

Question 2: What are the challenges in the analysis of Y chromosome genome sequences?

Question 3: Implications for human evolution and health associated with human Y chromosome











To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics