TLDR: Despite significant advancements of deep neural networks (DNNs) in text classification tasks, one of the most crucial factors behind achieving human-level accuracy is the quality of large manually annotated training data, which are time-consuming and costly to accumulate. There are several low-cost data repositories available on the internet, but they often bear incorrect labels. But DNNs can easily overfit to label noise when trained on noisily labeled datasets, resulting in performance deterioration. Therefore, to address this problem, we present a noise-tolerant model-agnostic training algorithm. In this method, prior to the conventional gradient update, a meta-learning update is done by simulating the actual training using each set of generated synthetic noisy labels. Training on randomized synthetic noise helps the model to not overfit to any specific label noise which consequently, enhances its robustness to label noise present in the dataset. We perform extensive experiments on multiple massive text classification dataset and show that through our method DNNs learn finer textual representation making them robust to erroneous labels.
I'm Raj, a first year grad student in Computer Science at the Penn State University. I'm a researcher and a developer, passionate about research works and developing new skills.
Research interests
I have worked in related fields like machine learning, artificial intelligence, software engineering, internet of things and data mining. I strive to contribute to machine learning fields as the concept and large applicability of machine learning fascinates my curious mind and provides a space to come up with my own ideas. "Growth occurs when one goes beyond one's limit. Realizing that is also a part of the training." I enjoy learning new and keep moving forward so that I could acquire as much as I could get. I consider work as an ongoing process, and I'm always looking for opportunities to work with those who are willing to share their knowledge. At the end of the day, my primary goal is to work hard and gather knowledge.
My current research work is focused on training and improving NLP models for various tasks that involve language models, telemedical facilities, learning, social computing and core linguistics. I love teaching machines through large textual, speech, visual and/or multimodal datasets. I am currently working on research projects focusing on health informatics and social informatics to progress my research on robust, trustworthy and sustainable data-driven ML systems with a broad application area in the real-world. I am open for suggestion/discussions so do reach out to me :)
When I'm not in front of a computer screen, I'm probably reading books, thinking about robotics, playing football, reading mangas, or cooking. Do check my blog page where I put interesting articles (I am trying to be consistent there).
Updates
- [October 2022] [New!] One paper on multimodal disaster tweet analysis got accepted at COLING 2022. See you there!
- [June 2021] [New!] One paper on multilingual NLP and CSS got accepted at ACL 2021. See you there!
- [December 2020] [New!] I'll be attending NIPS 2020 online. Come check out our work at MLPH 2020. Shoot me an email if you want to meet and talk about anything!
- [November 2020] [New!] I'll be attending EMNLP 2020 online. Check out our work on SustainNLP 2020. See you there!
- [October 2020] [New!] Paper presenting assemble Q&A language model got accepted at ICCCS 2020. See you there!
- [July 2020] [New!] I'll be attending ICML 2020 online. See you there!
- [June 2020] [New!] Our paper on OCR table extraction has been accepted to ICDAR 2020.
Publications
Current/Submitted Research Papers
-
-
TLDR: In the modern human society, mental health is one of the most critical concern. Over past many years a large proportion of population has been affected with serious mental disorders. People with mental illness require effective mental health intervention and treatment as early as possible to decrease the chances of any further mental defilement. In this paper, we present RMHIDD, a new dialogue corpus for automated mental health intervention. The dataset is consists of over 200K Reddit posts collected from 18 different sub-Reddit groups with each post consisting of sequential conversation between the users. On this dataset, we also trained various models for dialogue generation task, namely-'Seq2Seq', 'BART' and 'DialoGPT'. In our analysis we found that the BART model outperformed other models with a higher Perplexity score of 19.7. We also found that the DialoGPT model surpasses other models on various machine translation evaluation metrics. The results generated from various language models were promising and showed the possibility of building automated mental health intervention.
Accepted Publications
-
CMTA: COVID-19 Misinformation Multilingual Analysis on Twitter
Raj Ratn Pranesh, Mehrdad Farokhenajd, Ambesh Shekhar, Genoveva Vargas-Solar
Association for Computational Linguistics (ACL), 2021
PaperTLDR: This paper presents a multilingual COVID-19 re- lated tweet analysis method, CMTA, that uses BERT, a deep learning model for multilingual tweet misinformation detection and classification. CMTA extracts features from multilingual textual data, which is then categorized into specific information classes. Classification is done by a Dense-CNN model trained on tweets manually annotated into information classes (i.e., ’false’, ’partly false’, ’mislead- ing’). The paper presents an analysis of mul- tilingual tweets from February to June, show- ing the distribution type of information spread across different languages. To access the performance of the CMTA multilingual model, we performed a comparative analysis of 8 monolingual model and CMTA for the misin- formation detection task. The results show that our proposed CMTA model has surpassed various monolingual models which consolidated the fact that through transfer learning a multilingual framework could be developed.
-
Looking for COVID-19 misinformation inmultilingual social media texts
Raj Ratn Pranesh, Mehrdad Farokhenajd, Ambesh Shekhar, Genoveva Vargas-Solar
ADBIS 2021: 25th European Conference on Advances in Databases and Information Systems
Paper -
Biomedical Network Link Prediction using Neural Network Graph Embedding
Sumit Kumar, Raj Ratn Pranesh, Ambesh Shekhar
ACM CoDS-COMAD, 2021 (Young Researchers’ Symposium)
PaperTLDR: In this paper, we aim at Graph embedding learning for automatic grasping of low-dimensional node representation on biomedical networks. The purpose is to use different neural Graph embedding methods for conducting analysis on 3 major biomedical link prediction tasks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) classification, and protein-protein interaction (PPI) classification. We observe that graph embedding method achieve a promising result without the use of any biological features.
-
A Conglomerate of Multiple OCR Table Detection and Extraction
Smita Pallavi, Raj Ratn Pranesh, Sumit Kumar
22nd International Conference on Document Analysis and Recognition, 2020
PaperTLDR: Information representation as tables are compact and concise method that eases searching, indexing, and storage requirements. Extracting and cloning tables from parsable documents is easier and widely used, however industry still faces challenge in detecting and extracting tables from OCR documents or images. This paper proposes an algorithm that detects and extracts multiple tables from OCR document. The algorithm uses a combination of image processing techniques, text recognition and procedural coding to identify distinct tables in same image and map the text to appropriate corresponding cell in dataframe which can be stored as Comma-separated values, Database, Excel and multiple other usable formats.
-
Automated Medical Assistance: Attention Based Consultation System
Raj Ratn Pranesh, Ambesh Shekhar, Sumit Kumar
NeurIPS, 2020 MLPH: Machine Learning in Public Health Workshop) 2020
PaperTLDR: We designed three transformers based encoder-decoder model, namely, BERT, GPT2, and BART and trained them on large the dialogue dataset for text generation. We performed a comparative study of the models and in our analysis, we found that the BART model generates a doctor-like response and contains clinically informative data. The overall generated results were very promising and show that through transfer learning pre-trained transformers are reliable for developing automated medical assistance system and doctor-like-treatments.
-
MemeSem:A Multi-modal Framework for Sentimental Analysis of Meme via Transfer Learning
Raj Ratn Pranesh, Ambesh Shekhar
37th International Conference on Machine Learning (ICML), 2020 (4th Lifelong Learning Workshop)
PaperTLDR: In this paper, we present MemeSem- a multimodal deep neural network framework for sentiment analysis of memes via transfer learning. Our proposed model utilizes VGG19 pretrained on ImageNet dataset and BERT language model to learn the visual and textual feature of the meme and combine them together to make predictions.
-
QuesBELM: A BERT based Ensemble Language Model for Natural Questions
Raj Ratn Pranesh, Ambesh Shekhar, Smita Pallavi
5th IEEE ICCCS (International Conference)
PaperTLDR: In our work, we systematically compare the performance of powerful variant models of Transformer architectures- ’BERTbase, BERT-large-WWM and ALBERT-XXL’ over Natural Questions dataset. We also propose a state-of-the-art BERT based ensemble language model- QuesBELM. QuesBELM leverages the power of existing BERT variants combined together to build a more accurate stacking ensemble model for question answering (QA) system
-
Analysis of Resource-efficient Predictive Models for Natural Language Processing
Raj Ratn Pranesh, Ambesh Shekhar
EMNLP, 2020 (SustaiNLP Workshop)
PaperTLDR: In this paper, we presented an analyses of the resource efficient predictive models, namely Bonsai, Binary Neighbor Compression(BNC), ProtoNN, Random Forest, Naive Bayes and Support vector machine(SVM), in the machine learning field for resource constraint devices. These models try to minimize resource requirements like RAM and storage without hurting the accuracy much. We utilized these models on multiple benchmark natural language processing tasks, which were sentimental analysis, spam message detection, emotion analysis and fake news classification.
-
Classifying Micro-text Document Datasets: Application to Query Expansion of Crisis-Related Tweets
Mehrdad Farokhen, Raj Ratn Pranesh, Javier A. Espinosa-Oviedo
ICSOC, 2020 (STRAPS 2020 Workshop).
PaperTLDR: This paper introduces an approach based on classification and query expansion techniques in the context of micro-texts (i.e., tweets) search. In our approach, a user’s query is rewritten using a classified vocabulary derived from top-k results, to reflect her search intent better.
-
A Fine-Grained Analysis of Misinformation in COVID-19 Tweets
Sumit Kumar, Raj Ratn Pranesh, Kathleen M. Carle
Computational and Mathematical Organization Theory, Springer.
PaperTLDR: In this paper, We have proposed a Twitter dataset for fine-grained classification. Our dataset is consist of 1970 manually annotated tweets and is categorized into 4 misinformation classes, i.e, “Irrelevant”, “Conspiracy”, “True Information”, and “False Information” on the basis of response erupted during COVID-19. In this work, we also generated useful insights on our dataset and performed a systematic analysis of various language models.
-
CLPLM: Character Level Pretrained Language Model for ExtractingSupport Phrases for Sentiment Labels
Raj Ratn Pranesh, Sumit Kumar, Ambesh Shekhar
17th International Conference on Natural Language Processing, 2020.
PaperTLDR: In this paper, we have designed a character-level pre-trained language model for extracting support phrases from tweets based on the sentiment label. We also propose a character-level ensemble model designed by properly blending Pre-trained Contextual Embeddings (PCE) models- RoBERTa, BERT, and ALBERT along with Neural network models-RNN, CNN and WaveNet at different stages of the model. For a given tweet and associated sentiment label, our model predicts the span of phrases in a tweet that prompts the particular sentiment in the tweet.
Experience
-
ML/NLP Engineer at Marsview.ai, Sunnyvale, California, June 2021 – August 2021
Topic: Developed transformer and Elastic search based Machine Reading Comprehension ML pipeline for extracting answers out of long context inputs. Designed an End-to-End Automatic Speech Recognition system with online real-time decoding for transcribing live audio inputs.
-
Research & Development Intern at Samsung India, January 2021 – May 2021
Topic: Worked on intent detection for social media application with on-device AI NLP team.
-
Machine Learning Engineer Intern, January 2021 – April 2021
Mentor: Nemji Vora
Topic: Working with the ML team on developing an automated heart arrhythmia detecting system using ECG signals. Utilizing MIT-BIH Arrhythmia Dataset for extracting 10 sec ECG strips, followed by median filter to remove the baseline wander and Discrete Wavelet transform to remove the remaining noise. Working toward developing a frugal, efficient yet robust deep learning model for the ECG classification task.
-
School of Computer Science, Carnegie Mellon University
Undergraduate Research Assistant, June 2020 – November 2020
Mentor: Prof. Kathleen M. Carley
Topic: Working under Prof. Kathleen M. Carley on Fine-grained classification and inference task for Tweets misinformation during the Covid-19 outbreak.
-
EECS, University of California, Berkeley
Undergraduate Research Assistant, May. 2020 – November 2020
Mentor: Prof. Kurt Keutzer
Topic: Working under Prof. Kurt Keutzer on low resource language Sanskrit for word segmentation task.
-
AI-ML-NLP Lab,Indian Institute of Technology, Patna
Winter Research Intern, December 2019 – February 2020
Mentor: Prof. Sriparna Saha
Topic: Working at AI-ML-NLP lab under Prof. Dr Sriparna Saha for developing a deep learning multi-model framework for the protein-protein interaction model using protein sequence and 3D protein structure
-
Université Grenoble Alpes , CNRS, LIG
Summer Research Assistant, May 2019 —July 2019
Mentor: Prof. Genoveva Vargas Solar
Topic: Worked with HADAS Team of LIG under the supervision of Dr. Genoveva Vargas Solar on developing a human-guided data exploration approach for efficiently extracting knowledge from disaster dataset (micro-blogs) based on a given user-generated query.
Awards and Recognition
- Recipient of visiting scholar fellowship at Sorbonne University, Paris, Frace. Fully funded research internship under the supervision of Dr Prof. Antoine Mine at APR Team of LIP-6 Lab(CNRS).
- Received ICML 2020 registration grant as a support for attending and presenting my work at the ICML 2020
- Received NIPS 2020 registration grant as a support for attending and presenting my work at the NIPS 2020
- Virtually presented our work "COVID-19 question-answering exploration system" at the LIG-CNRS workshop on databases and information systems.
- Recognized on the university’s website as young research scholar on the recommendation of Institute Director
- Received a special mention for excellence in research and abroad internship in the yearly university magazine
- Start-Up idea accepted for presentation at the annual start-up summit "National Seminar on Pragmatic Role of Technological Innovation in Start-Up NPTIS-2020" organized by the government of India.
- Developed and deployed Google assistant service for Google console that was used worldwide for promoting E-Learning
- Selected as Microsoft Student Partner for the year 2019-2020. Conducted seminar and workshop for creating awareness
- Actively participated in competitive coding contest and data science competitions. Six start coder on HackaRank platform and Kaggle competitions contributor
Open-Source Contribution
- Contributed to HuggingFace pretrained models: Trained BERT-base and RoBERTa language model from scratch using Sanskrit DCS corpus in IAST and SLP1 format. Models were published. [August 2020 – October 2020]
- Google Code-In Mentor-TensorFlow: Selected as mentor by Paige Bailey for the TensorFlow organization to supervise and guide young students in the field of Deep Learning and Open Source Community through TensorFlow [November 2019 — January 2020]
- Microsoft Repository’s Bug Fix: Contributed to MICROSOFT Mt-DNN repository. Apex and PyTorch bug fix [August 2019 — October 2019]
Technical Skills
- Languages: Java, Python, C/C++, Ocaml, SQL, HTML/CSS, LATEX
- Scientific Libraries: TensorFlow, Keras, PyTorch Sklearn, Pandas, Numpy, NLTK
- Developer Tools: Git, Google Cloud Platform, VS Code, Google Colab, Terminal, Android Studio