Raj Ratn Pranesh

I'm Raj, a first year grad student in Computer Science at the Penn State University. I'm a researcher and a developer, passionate about research works and developing new skills.  

Research interests

I have worked in related fields like machine learning, artificial intelligence, software engineering, internet of things and data mining. I strive to contribute to machine learning fields as the concept and large applicability of machine learning fascinates my curious mind and provides a space to come up with my own ideas. "Growth occurs when one goes beyond one's limit. Realizing that is also a part of the training." I enjoy learning new and keep moving forward so that I could acquire as much as I could get. I consider work as an ongoing process, and I'm always looking for opportunities to work with those who are willing to share their knowledge. At the end of the day, my primary goal is to work hard and gather knowledge.

My current research work is focused on training and improving NLP models for various tasks that involve language models, telemedical facilities, learning, social computing and core linguistics.    I love teaching machines through large textual, speech, visual and/or multimodal datasets. I am currently working on research projects focusing on health informatics and social informatics to progress my research on robust, trustworthy and sustainable data-driven ML systems with a broad application area in the real-world.    I am open for suggestion/discussions so do reach out to me :)

When I'm not in front of a computer screen, I'm probably reading books, thinking about robotics, playing football, reading mangas, or cooking. Do check my blog page where I put interesting articles (I am trying to be consistent there).



  • [June 2021] [New!] One paper on multilingual NLP and CSS got accepted at ACL 2021. See you there!
  • [December 2020] [New!] I'll be attending NIPS 2020 online. Come check out our work at MLPH 2020. Shoot me an email if you want to meet and talk about anything!
  • [November 2020] [New!] I'll be attending EMNLP 2020 online. Check out our work on SustainNLP 2020. See you there!
  • [October 2020] [New!] Paper presenting assemble Q&A language model got accepted at ICCCS 2020. See you there!
  • [July 2020] [New!] I'll be attending ICML 2020 online. See you there!
  • [June 2020] [New!] Our paper on OCR table extraction has been accepted to ICDAR 2020.


Accepted Publications

  • CMTA: COVID-19 Misinformation Multilingual Analysis on Twitter

    Raj Ratn Pranesh, Mehrdad Farokhenajd, Ambesh Shekhar, Genoveva Vargas-Solar
    Association for Computational Linguistics (ACL), 2021

    TLDR: This paper presents a multilingual COVID-19 re- lated tweet analysis method, CMTA, that uses BERT, a deep learning model for multilingual tweet misinformation detection and classification. CMTA extracts features from multilingual textual data, which is then categorized into specific information classes. Classification is done by a Dense-CNN model trained on tweets manually annotated into information classes (i.e., ’false’, ’partly false’, ’mislead- ing’). The paper presents an analysis of mul- tilingual tweets from February to June, show- ing the distribution type of information spread across different languages. To access the performance of the CMTA multilingual model, we performed a comparative analysis of 8 monolingual model and CMTA for the misin- formation detection task. The results show that our proposed CMTA model has surpassed various monolingual models which consolidated the fact that through transfer learning a multilingual framework could be developed.

  • Looking for COVID-19 misinformation inmultilingual social media texts

    Raj Ratn Pranesh, Mehrdad Farokhenajd, Ambesh Shekhar, Genoveva Vargas-Solar
    ADBIS 2021: 25th European Conference on Advances in Databases and Information Systems

  • Biomedical Network Link Prediction using Neural Network Graph Embedding

    Sumit Kumar, Raj Ratn Pranesh, Ambesh Shekhar
    ACM CoDS-COMAD, 2021 (Young Researchers’ Symposium)

    TLDR: In this paper, we aim at Graph embedding learning for automatic grasping of low-dimensional node representation on biomedical networks. The purpose is to use different neural Graph embedding methods for conducting analysis on 3 major biomedical link prediction tasks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) classification, and protein-protein interaction (PPI) classification. We observe that graph embedding method achieve a promising result without the use of any biological features.

  • A Conglomerate of Multiple OCR Table Detection and Extraction

    Smita Pallavi, Raj Ratn Pranesh, Sumit Kumar
    22nd International Conference on Document Analysis and Recognition, 2020

    TLDR: Information representation as tables are compact and concise method that eases searching, indexing, and storage requirements. Extracting and cloning tables from parsable documents is easier and widely used, however industry still faces challenge in detecting and extracting tables from OCR documents or images. This paper proposes an algorithm that detects and extracts multiple tables from OCR document. The algorithm uses a combination of image processing techniques, text recognition and procedural coding to identify distinct tables in same image and map the text to appropriate corresponding cell in dataframe which can be stored as Comma-separated values, Database, Excel and multiple other usable formats.

  • Automated Medical Assistance: Attention Based Consultation System

    Raj Ratn Pranesh, Ambesh Shekhar, Sumit Kumar
    NeurIPS, 2020 MLPH: Machine Learning in Public Health Workshop) 2020

    TLDR: We designed three transformers based encoder-decoder model, namely, BERT, GPT2, and BART and trained them on large the dialogue dataset for text generation. We performed a comparative study of the models and in our analysis, we found that the BART model generates a doctor-like response and contains clinically informative data. The overall generated results were very promising and show that through transfer learning pre-trained transformers are reliable for developing automated medical assistance system and doctor-like-treatments.

  • MemeSem:A Multi-modal Framework for Sentimental Analysis of Meme via Transfer Learning

    Raj Ratn Pranesh, Ambesh Shekhar
    37th International Conference on Machine Learning (ICML), 2020 (4th Lifelong Learning Workshop)

    TLDR: In this paper, we present MemeSem- a multimodal deep neural network framework for sentiment analysis of memes via transfer learning. Our proposed model utilizes VGG19 pretrained on ImageNet dataset and BERT language model to learn the visual and textual feature of the meme and combine them together to make predictions.

  • QuesBELM: A BERT based Ensemble Language Model for Natural Questions

    Raj Ratn Pranesh, Ambesh Shekhar, Smita Pallavi
    5th IEEE ICCCS (International Conference)

    TLDR: In our work, we systematically compare the performance of powerful variant models of Transformer architectures- ’BERTbase, BERT-large-WWM and ALBERT-XXL’ over Natural Questions dataset. We also propose a state-of-the-art BERT based ensemble language model- QuesBELM. QuesBELM leverages the power of existing BERT variants combined together to build a more accurate stacking ensemble model for question answering (QA) system

  • Analysis of Resource-efficient Predictive Models for Natural Language Processing

    Raj Ratn Pranesh, Ambesh Shekhar
    EMNLP, 2020 (SustaiNLP Workshop)

    TLDR: In this paper, we presented an analyses of the resource efficient predictive models, namely Bonsai, Binary Neighbor Compression(BNC), ProtoNN, Random Forest, Naive Bayes and Support vector machine(SVM), in the machine learning field for resource constraint devices. These models try to minimize resource requirements like RAM and storage without hurting the accuracy much. We utilized these models on multiple benchmark natural language processing tasks, which were sentimental analysis, spam message detection, emotion analysis and fake news classification.

  • Classifying Micro-text Document Datasets: Application to Query Expansion of Crisis-Related Tweets

    Mehrdad Farokhen, Raj Ratn Pranesh, Javier A. Espinosa-Oviedo
    ICSOC, 2020 (STRAPS 2020 Workshop).

    TLDR: This paper introduces an approach based on classification and query expansion techniques in the context of micro-texts (i.e., tweets) search. In our approach, a user’s query is rewritten using a classified vocabulary derived from top-k results, to reflect her search intent better.

  • A Fine-Grained Analysis of Misinformation in COVID-19 Tweets

    Sumit Kumar, Raj Ratn Pranesh, Kathleen M. Carle
    Computational and Mathematical Organization Theory, Springer.

    TLDR: In this paper, We have proposed a Twitter dataset for fine-grained classification. Our dataset is consist of 1970 manually annotated tweets and is categorized into 4 misinformation classes, i.e, “Irrelevant”, “Conspiracy”, “True Information”, and “False Information” on the basis of response erupted during COVID-19. In this work, we also generated useful insights on our dataset and performed a systematic analysis of various language models.

  • CLPLM: Character Level Pretrained Language Model for ExtractingSupport Phrases for Sentiment Labels

    Raj Ratn Pranesh, Sumit Kumar, Ambesh Shekhar
    17th International Conference on Natural Language Processing, 2020.

    TLDR: In this paper, we have designed a character-level pre-trained language model for extracting support phrases from tweets based on the sentiment label. We also propose a character-level ensemble model designed by properly blending Pre-trained Contextual Embeddings (PCE) models- RoBERTa, BERT, and ALBERT along with Neural network models-RNN, CNN and WaveNet at different stages of the model. For a given tweet and associated sentiment label, our model predicts the span of phrases in a tweet that prompts the particular sentiment in the tweet.


  • Marsview.ai

    ML/NLP Engineer at Marsview.ai, Sunnyvale, California, June 2021 – August 2021

    Topic: Developed transformer and Elastic search based Machine Reading Comprehension ML pipeline for extracting answers out of long context inputs. Designed an End-to-End Automatic Speech Recognition system with online real-time decoding for transcribing live audio inputs.

  • Samsung Research

    Research & Development Intern at Samsung India, January 2021 – May 2021

    Topic: Worked on intent detection for social media application with on-device AI NLP team.

  • Tiny Banyan Technologies

    Machine Learning Engineer Intern, January 2021 – April 2021

    Mentor: Nemji Vora

    Topic: Working with the ML team on developing an automated heart arrhythmia detecting system using ECG signals. Utilizing MIT-BIH Arrhythmia Dataset for extracting 10 sec ECG strips, followed by median filter to remove the baseline wander and Discrete Wavelet transform to remove the remaining noise. Working toward developing a frugal, efficient yet robust deep learning model for the ECG classification task.

  • School of Computer Science, Carnegie Mellon University

    Undergraduate Research Assistant, June 2020 – November 2020

    Mentor: Prof. Kathleen M. Carley

    Topic: Working under Prof. Kathleen M. Carley on Fine-grained classification and inference task for Tweets misinformation during the Covid-19 outbreak.

  • EECS, University of California, Berkeley

    Undergraduate Research Assistant, May. 2020 – November 2020

    Mentor: Prof. Kurt Keutzer

    Topic: Working under Prof. Kurt Keutzer on low resource language Sanskrit for word segmentation task.

  • Aca

    AI-ML-NLP Lab,Indian Institute of Technology, Patna

    Winter Research Intern, December 2019 – February 2020

    Mentor: Prof. Sriparna Saha

    Topic: Working at AI-ML-NLP lab under Prof. Dr Sriparna Saha for developing a deep learning multi-model framework for the protein-protein interaction model using protein sequence and 3D protein structure

  • hi

    Université Grenoble Alpes , CNRS, LIG

    Summer Research Assistant, May 2019 —July 2019

    Mentor: Prof. Genoveva Vargas Solar

    Topic: Worked with HADAS Team of LIG under the supervision of Dr. Genoveva Vargas Solar on developing a human-guided data exploration approach for efficiently extracting knowledge from disaster dataset (micro-blogs) based on a given user-generated query.

Awards and Recognition

  • Recipient of visiting scholar fellowship at Sorbonne University, Paris, Frace. Fully funded research internship under the supervision of Dr Prof. Antoine Mine at APR Team of LIP-6 Lab(CNRS).
  • Received ICML 2020 registration grant as a support for attending and presenting my work at the ICML 2020
  • Received NIPS 2020 registration grant as a support for attending and presenting my work at the NIPS 2020
  • Virtually presented our work "COVID-19 question-answering exploration system" at the LIG-CNRS workshop on databases and information systems.
  • Recognized on the university’s website as young research scholar on the recommendation of Institute Director
  • Received a special mention for excellence in research and abroad internship in the yearly university magazine
  • Start-Up idea accepted for presentation at the annual start-up summit "National Seminar on Pragmatic Role of Technological Innovation in Start-Up NPTIS-2020" organized by the government of India.
  • Developed and deployed Google assistant service for Google console that was used worldwide for promoting E-Learning
  • Selected as Microsoft Student Partner for the year 2019-2020. Conducted seminar and workshop for creating awareness
  • Actively participated in competitive coding contest and data science competitions. Six start coder on HackaRank platform and Kaggle competitions contributor

Technical Skills

  • Languages: Java, Python, C/C++, Ocaml, SQL, HTML/CSS, LATEX
  • Scientific Libraries: TensorFlow, Keras, PyTorch Sklearn, Pandas, Numpy, NLTK
  • Developer Tools: Git, Google Cloud Platform, VS Code, Google Colab, Terminal, Android Studio

Visitors since Jan 2021,
Flag Counter