Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Ali Mohammadi,Vedula Bhaskara Hanuma,Hemank Lamba,Edward Raff,Ponnurangam Kumaraguru,Francis Ferraro,Manas Gaur
@inproceedings{bib_Do_L_2025, AUTHOR = {Ali Mohammadi, Vedula Bhaskara Hanuma, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru, Francis Ferraro, Manas Gaur}, TITLE = {Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2025}}
Do LLMs genuinely incorporate external definitions, or do they primarily rely on their parametric knowledge? To address these questions, we conduct controlleD experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-
generated, perturbed, and swapped definitions. Our results reveal that while explicit label definitions can enhance accuracy and explainability, their integration into an LLM’s task-solvinG processes is neither guaranteed nor consistent, suggesting reliance on internalized representations in many cases. Models often default to their internal representations, particularly in general tasks, whereas domain-specific tasks benefit more from explicit definitions. These
findings underscore the need for a deeper understanding of how LLMs process external knowledge alongside their pre-existing capabilities.
Arvindh A,Sumit Kumar,Mojtaba Nayyeri,Bo Xiong,Ponnurangam Kumaraguru,Antonio Vergari,Steffen Staab
@inproceedings{bib_SEMM_2025, AUTHOR = {Arvindh A, Sumit Kumar, Mojtaba Nayyeri, Bo Xiong, Ponnurangam Kumaraguru, Antonio Vergari, Steffen Staab}, TITLE = {SEMMA: A Semantic Aware Knowledge Graph Foundation Model}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2025}}
Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.
G Satya Swaroop,Sreeram Vipparla,Harpreet Singh,Shashwat Goel,Ponnurangam Kumaraguru
@inproceedings{bib_Enha_2025, AUTHOR = {G Satya Swaroop, Sreeram Vipparla, Harpreet Singh, Shashwat Goel, Ponnurangam Kumaraguru}, TITLE = {Enhancing AI Safety Through the Fusion of Low Rank Adapters}, BOOKTITLE = {Research and Applications of Foundation Models for Data Mining and Affective Computing Workshop}. YEAR = {2025}}
Instruction fine-tuning of large language models (LLMs) is a powerful method for improving task-specific performance, but it can inadvertently lead to a phenomenon where models generate harmful responses when faced with malicious prompts. In this paper, we explore Low-Rank Adapter Fusion (LoRA) as a means to mitigate these risks while preserving the model’s ability to handle diverse instructions effectively. Through an extensive comparative analysis against established baselines using recognized benchmark datasets, we demonstrate a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter, the latter of which is specifically trained on our safety dataset. In addition, we made noteworthy observations related to exaggerated safety behavior, where the model rejects safe prompts that closely resemble unsafe ones.
Kavathekar Ishan Kishorkumar,Jain Hemang Ashok,Ameya Sandesh Rathod,Ponnurangam Kumaraguru,Tanuja Ganu
@inproceedings{bib_TAMA_2025, AUTHOR = {Kavathekar Ishan Kishorkumar, Jain Hemang Ashok, Ameya Sandesh Rathod, Ponnurangam Kumaraguru, Tanuja Ganu}, TITLE = {TAMAS: A Dataset for Investigating Security Risks in Multi-Agent LLM Systems}, BOOKTITLE = {International Conference on Machine Learning}. YEAR = {2025}}
Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents through tool use, planning, and decision-making abilities, leading to their widespread adoption across diverse tasks. As task complexity grows, multi-agent LLM systems are increasingly used to collaboratively solve problems. However, safety and security of these multi-agent systems remains largely unexplored. Existing benchmarks and datasets predominantly focus on single-agent settings, failing to capture the unique vulnerabilities of multi-agent dynamics and co-ordination. To address this gap, we introduce textbf{T}hreats and textbf{A}ttacks in textbf{M}ulti-textbf{A}gent textbf{S}ystems (textbf{TAMAS}), a dataset designed to evaluate the robustness and security of multi-agent LLM systems. TAMAS includes five distinct scenarios comprising 250 adversarial instances across five attack types and 163 different normal and attack tools, along with 100 harmless tasks. We assess system performance across 5 backbone LLMs and 3 agent interaction configurations from Autogen framework, highlighting critical challenges and failure modes in current multi-agent deployments. Our findings show that multi-agent systems are highly vulnerable to adversarial attacks, with Impersonation reaching a 73% success rate and other attacks ranging from 27% to 67%, underscoring the need for stronger defenses.
@inproceedings{bib_Just_2025, AUTHOR = {Rahul Garg, Jain Hemang Ashok, Ponnurangam Kumaraguru, Ugur Kursuncu}, TITLE = {Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes}, BOOKTITLE = {Association for Computational Linguistics - Findings}. YEAR = {2025}}
Toxicity identification in online multimodal environments remains a challenging task due to the complexity of contextual connections across modalities (e.g., textual and visual). In this paper, we propose a novel framework that integrates Knowledge Distillation (KD) from Large Visual Language Models (LVLMs) and knowledge infusion to enhance the performance of toxicity detection in hateful memes. Our approach extracts sub-knowledge graphs from ConceptNet, a large-scale commonsense Knowledge Graph (KG) to be infused within a compact VLM framework. The relational context between toxic phrases in captions and memes, as well as visual concepts in memes enhance the model's reasoning capabilities. Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively. Given the contextual complexity of the toxicity detection task, our approach showcases the significance of learning from both explicit (i.e. KG) as well as implicit (i.e. LVLMs) contextual cues incorporated through a hybrid neurosymbolic approach. This is crucial for real-world applications where accurate and scalable recognition of toxic content is critical for creating safer online environments.
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Kodali Prashant,Anmol Goel,Likhith Asapu,Vamshi Krishna Bonagiri,Anirudh Govil,Monojit Choudhury,Ponnurangam Kumaraguru,Manish Shrivastava
@inproceedings{bib_From_2025, AUTHOR = {Kodali Prashant, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Ponnurangam Kumaraguru, Manish Shrivastava}, TITLE = {From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences}, BOOKTITLE = {ACM Trasactions on Asian and Low Resource Language Information Processing}. YEAR = {2025}}
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models when trained solely using code-mixing metrics as features are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, among Encoder models XLM-Roberta and Bernice outperform IndicBERT across different configurations. Among Encoder-Decoder models, mBART performs better than mT5, however Encoder-Decoder models are not able to outperform Encoder-only models. Decoder-only models perform the best when compared to all other MLLMS, with Llama 3.2 - 3B models outperforming similarly sized Qwen, Phi models. Comparison with zero and fewshot capabilitites of ChatGPT show that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from En-Hi to En-Te acceptability judgments are better than random baselines.
Varshita Kolipaka,Akshit Sinha,Debangan Mishra,Sumit Kumar,Arvindh A,Shashwat Goel,Ponnurangam Kumaraguru
@inproceedings{bib_A_sh_2025, AUTHOR = {Varshita Kolipaka, Akshit Sinha, Debangan Mishra, Sumit Kumar, Arvindh A, Shashwat Goel, Ponnurangam Kumaraguru}, TITLE = {A shot of Cognac to forget bad memories: Corrective Unlearning in GNNs}, BOOKTITLE = {International Conference on Machine Learning}. YEAR = {2025}}
Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. Because graph data does not follow the independently and identically distributed i.i.d. assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, which deteriorates the model's performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem of Corrective Unlearning. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method,Cognac, which can unlearn the effect of the manipulation set even when only
% of it is identified. It recovers most of the performance of a strong oracle with fully corrected training data, even beating retraining from scratch without the deletion set, and is 8x more efficient while also scaling to large datasets. We hope our work assists GNN developers in mitigating harmful effects caused by issues in real-world data, post-training.
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Shiven Sinha,Shashwat Goel,Ponnurangam Kumaraguru,Jonas Geiping,Matthias Bethge,Ameya Prabhu
@inproceedings{bib_Can__2025, AUTHOR = {Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu}, TITLE = {Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation}, BOOKTITLE = {workshop on International Conference on Learning Representations}. YEAR = {2025}}
There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability — creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only < 9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs’ ability to falsify incorrect solutions — a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.
Deep learning and transfer learning to understand emotions: a PoliEMO dataset and multi-label classification in Indian elections.
@inproceedings{bib_Deep_2025, AUTHOR = {Anuradha Gupta, Shikha Mehta, Ponnurangam Kumaraguru}, TITLE = {Deep learning and transfer learning to understand emotions: a PoliEMO dataset and multi-label classification in Indian elections.}, BOOKTITLE = {International Journal of Data Science and Analytics}. YEAR = {2025}}
Understanding user emotions to identify user opinion, sentiment, stance, and preferences has become a hot topic of research in the last few years. Many studies and datasets are designed for user emotion analysis including news websites, blogs, and user tweets. However, there is little exploration of political emotions in the Indian context for multi-label emotion detection. This paper presents a PoliEMO dataset—a novel benchmark corpus of political tweets in a multi-label setup for Indian elections, consisting of over 3512 tweets manually annotated. In this work, 6792 labels were generated for six emotion categories: anger, insult, joy, neutral, sadness, and shameful. Next, PoliEMO dataset is used to understand emotions in a multi-label context using state-of-the-art machine learning algorithms with multi-label classifier (binary relevance (BR), label powerset (LP), classifier chain (CC), and multi-label k-nearest neighbors (MkNN)) and deep learning models like convolutional neural network (CNN), long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM), and transfer learning model, i.e., bidirectional encoder representations from transformers (BERT). Experiments and results show Bi-LSTM performs better with micro-averaged F1 score of 0.81, macro-averaged F1 score of 0.78, and accuracy 0.68 as compared to state-of-the-art approaches.
C S Ramakrishna Tejasvi,Rohan Chowdary V Modepalle,Nemani Harsha Vardhan,Ponnurangam Kumaraguru,Ashwin Rajadesingan
@inproceedings{bib_Fram_2025, AUTHOR = {C S Ramakrishna Tejasvi, Rohan Chowdary V Modepalle, Nemani Harsha Vardhan, Ponnurangam Kumaraguru, Ashwin Rajadesingan}, TITLE = {Framing the Fray:
Conflict Framing in Indian Election News Coverage}, BOOKTITLE = {ACM Web Science Conference}. YEAR = {2025}}
In covering elections, journalists often use conflict frames which depict events and issues as adversarial, often highlighting confronta- tions between opposing parties. Although conflict frames result in more citizen engagement, they may distract from substantive policy discussion. In this work, we analyze the use of conflict frames in online English-language news articles by seven major news outlets in the 2014 and 2019 Indian general elections. We find that the use of conflict frames is not linked to the news outlets’ ideological biases but is associated with TV-based (rather than print-based) media. Further, the majority of news outlets do not exhibit ideolog- ical biases in portraying parties as aggressors or targets in articles with conflict frames. Finally, comparing news articles reporting on political speeches to their original speech transcripts, we find that, on average, news outlets tend to consistently report on attacks on the opposition party in the speeches but under-report on more substantive electoral issues covered in the speeches such as farmers’ issues and infrastructure.
Personal Narratives Empower Politically Disinclined Individuals to Engage in Political Discussions
C S Ramakrishna Tejasvi,Ponnurangam Kumaraguru,Ashwin Rajadesingan
ACM Web Science Conference, ACMWSC, 2025
@inproceedings{bib_Pers_2025, AUTHOR = {C S Ramakrishna Tejasvi, Ponnurangam Kumaraguru, Ashwin Rajadesingan}, TITLE = {Personal Narratives Empower Politically Disinclined Individuals to Engage in Political Discussions}, BOOKTITLE = {ACM Web Science Conference}. YEAR = {2025}}
Engaging in political discussions is crucial in democratic societies, yet many individuals remain politically disinclined due to various factors such as perceived knowledge gaps, conflict avoidance, or a sense of disconnection from the political system. In this paper, we explore the potential of personal narratives—short, first-person ac- counts emphasizing personal experiences—as a means to empower these individuals to participate in online political discussions. Using a text classifier that identifies personal narratives, we conducted a large-scale computational analysis to evaluate the relationship between the use of personal narratives and participation in po- litical discussions on Reddit. We find that politically disinclined individuals (PDIs) are more likely to use personal narratives than more politically active users. Personal narratives are more likely to attract and retain politically disinclined individuals in political discussions than other comments. Importantly, personal narratives posted by politically disinclined individuals are received more posi- tively than their other comments in political communities. These results emphasize the value of personal narratives in promoting inclusive political discourse.
COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models
Priyanshul Govil,Jain Hemang Ashok,Vamshi Krishna Bonagiri,Aman Chadha,Ponnurangam Kumaraguru,Manas Gaur,Sanorita Dey
ACM Web Science Conference, ACMWSC, 2025
@inproceedings{bib_COBI_2025, AUTHOR = {Priyanshul Govil, Jain Hemang Ashok, Vamshi Krishna Bonagiri, Aman Chadha, Ponnurangam Kumaraguru, Manas Gaur, Sanorita Dey}, TITLE = {COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models}, BOOKTITLE = {ACM Web Science Conference}. YEAR = {2025}}
Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices. Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets. These benchmarks measure bias by observing an LLM’s behavior on biased statements. However, these statements lack contextual considerations of the situations they try to present. To address this, we introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear. We develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to measure a biased statement’s reliability in detecting bias, based on the variance in model behavior across different contexts. To evaluate the metric, we augmented 2,291 stereotyped statements from two existing benchmark datasets by adding contextual information. We show that COBIAS aligns with human judgment on the contextual reliability of biased statements (Spearman’s 𝜌 = 0.65, 𝑝 = 3.4 ∗ 10−60) and can be used to create reliable benchmarks, which would assist bias mitigation works.
Higher Order Structures For Graph Explanations
Akshit Sinha,Sreeram Reddy Vennam,Charu Sharma,Ponnurangam Kumaraguru
AAAI Conference on Artificial Intelligence, AAAI, 2025
@inproceedings{bib_High_2025, AUTHOR = {Akshit Sinha, Sreeram Reddy Vennam, Charu Sharma, Ponnurangam Kumaraguru}, TITLE = {Higher Order Structures For Graph Explanations}, BOOKTITLE = {AAAI Conference on Artificial Intelligence}. YEAR = {2025}}
Graph Neural Networks (GNNs) have emerged as powerful tools for learning representations of graph-structured data, demonstrating remarkable performance across various tasks. Recognising their importance, there has been extensive research focused on explaining GNN predictions, aiming to enhance their interpretability and trustworthiness. However, GNNs and their explainers face a notable challenge: graphs are primarily designed to model pair-wise relationships between nodes, which can make it tough to capture higher-order, multi-node interactions. This characteristic can pose difficulties for existing explainers in fully representing multi-node relationships. To address this gap, we present Framework For Higher-Order Representations In Graph Explanations (FORGE), a framework that enables graph explainers to capture such interactions by incorporating higher-order structures, resulting in more accurate and faithful explanations. Extensive evaluation shows that on average real-world datasets from the GraphXAI benchmark and synthetic datasets across various graph explainers, FORGE improves average explanation accuracy by 1.9x and 2.25x, respectively. We perform ablation studies to confirm the importance of higher-order relations in improving explanations, while our scalability analysis demonstrates FORGE's efficacy on large graphs.
https://openreview.net/pdf/cf959d04ea5f33179a7208c85c5bce756b1bcf3f.pdf
Shashwat Goel,Ameya Prabhu,Philip Torr,Ponnurangam Kumaraguru,Amartya Sanyal
Transactions in Machine Learning Research, TMLR, 2024
@inproceedings{bib_http_2024, AUTHOR = {Shashwat Goel, Ameya Prabhu, Philip Torr, Ponnurangam Kumaraguru, Amartya Sanyal}, TITLE = {https://openreview.net/pdf/cf959d04ea5f33179a7208c85c5bce756b1bcf3f.pdf}, BOOKTITLE = {Transactions in Machine Learning Research}. YEAR = {2024}}
Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the Internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects including vulnerability to backdoored samples, systemic biases, and reduced accuracy on certain input domains. Realistically, all manipulated training samples cannot be identified, and only a small, representative subset of the affected data can be flagged.
We formalize ``Corrective Machine Unlearning'' as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, only having identified a subset of the corrupted data. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacy-oriented unlearning. We find most existing unlearning methods, including retraining-from-scratch without the deletion set, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, Selective Synaptic Dampening, achieves limited success, unlearning adverse effects with just a small portion of the manipulated samples in our setting, which shows encouraging signs for future progress. We hope our work spurs research towards developing better methods for corrective unlearning and offers practitioners a new strategy to handle data integrity challenges arising from web-scale training.
Effectiveness of Higuchi fractal dimension in differentiating subgroups of stressed and non-stressed individuals.
Nishtha Phutela,Goldie Gabrani,Ponnurangam Kumaraguru,Devanjali Relan
Multimedia Tools and Applications, MT&A, 2024
@inproceedings{bib_Effe_2024, AUTHOR = {Nishtha Phutela, Goldie Gabrani, Ponnurangam Kumaraguru, Devanjali Relan}, TITLE = {Effectiveness of Higuchi fractal dimension in differentiating subgroups of stressed and non-stressed individuals.}, BOOKTITLE = {Multimedia Tools and Applications}. YEAR = {2024}}
Stress has a significant mental health problem of the 21st century. The number of people suffering from stress is increasing rapidly. Thus, easy-to-use, inexpensive, and accurate biomarkers are needed to detect stress during its inception. Early detection of stress-related diseases allows people to access health-care services and leads to the development of new therapies. Thus, for the early detection of stress, a biomarker would be beneficial. We aim to find if there are significant differences between stressed and non-stressed groups of participants and which brain region gets impacted during stress. We conducted experiments to acquire EEG signals to identify the most significant brain region that gets affected due to stress. This research investigates if Higuchi Fractal Dimension (HFD) extracted from EEG signals can act as potential stress biomarkers. A gamified mobile application, known as Color Word and Memory Test (CWMT), was developed, inspired by the well-known Stroop Test, to elicit mental stress at two different difficulty levels. A MUSE headband with four EEG sensors (TP9, AF7, AF8, TP10) was used to collect the EEG signals at the two difficulty levels. The Higuchi Fractal Dimension (HFD) was extracted from EEG signals acquired from 32 participants while they were exposed to stress stimuli (using the proposed CWMT application) and we then performed three experimental analyses, i.e., Analysis I, II and III. Analysis I was performed on all the 32 participants performing low and high difficulty tasks on CWMT application. Analysis II was performed on sub-groups of participants. These sub-groups were made based on score obtained by participants while using CWMT application. Analysis II aimed to find the most significantly impacted brain region during stress. For Analysis III, we used the same sub-groups as Analysis II and it aimed to identify any differences between the left and right hemispheres during stress. We conducted the statistical analysis, and p-values were calculated between the two groups (non-stressed and stressed) to detect EEG channels and brain frequency significantly associated with stress. We performed three experimental analyses (Analysis I and II are intra-hemisphere, and analysis III is inter-hemisphere). In Analysis I, we inferred that beta and alpha frequencies from the AF8 region of the brain are affected during stress. In Analysis II, we inferred that beta waves from the AF8 region are a characteristic indicator of stress. In Analysis III, we identified significant differences between the left and right parts of the brain during stress. We found a significant difference (p<0.05) between HFD value in stress and non-stressed groups in the AF8 region. Our results indicated that stressed patients have a significantly higher value of HFD in the frontal areas.
Efficient Knowledge Graph Embeddings via Kernelized Random Projections
Nidhi Goyal,Anmol Goel,Tanuj Garg,Niharika Sachdeva,Ponnurangam Kumaraguru
Big Data Analytics in Astronomy, Science, and Engineering, BDA-ASE, 2024
Abs | | bib Tex
@inproceedings{bib_Effi_2024, AUTHOR = {Nidhi Goyal, Anmol Goel, Tanuj Garg, Niharika Sachdeva, Ponnurangam Kumaraguru}, TITLE = {Efficient Knowledge Graph Embeddings via Kernelized Random Projections}, BOOKTITLE = {Big Data Analytics in Astronomy, Science, and Engineering}. YEAR = {2024}}
Knowledge Graph Completion (KGC) aims to predict missing entities or relations in knowledge graph but it becomes computationally expensive as KG scales. Existing research focuses on bilinear pooling-based factorization methods (LowFER, TuckER) to solve this problem. These approaches introduce too many trainable parameters which obstruct the deployment of these techniques in many real-world scenarios. In this paper, we introduce a novel parameter-efficient framework, KGRP which a) approximates bilinear pooling using Kernelized Random Projection matrix b) employs CNN for the better fusion of entities and relations to infer missing links. Our experimental results show that KGRP has 73% fewer parameters as compared to the state-of-theart approaches (LowFER, TuckER) for the knowledge graph completion task while retaining 88% performance for the best baseline. Furthermore, we also provide novel insights on the interpretability of relation embeddings. We also test the effectiveness of KGRP on a large-scale recruitment knowledge graph of 0.25 M entities.
Sanity Checks for Evaluating Graph Unlearning
Varshita Kolipaka,Akshit Sinha,Debangan Mishra,Sumit Kumar,Arvindh A,Shashwat Goel,Ponnurangam Kumaraguru
Conference on Lifelong Learning Agents. PMLR, CoLLAs, 2024
@inproceedings{bib_Sani_2024, AUTHOR = {Varshita Kolipaka, Akshit Sinha, Debangan Mishra, Sumit Kumar, Arvindh A, Shashwat Goel, Ponnurangam Kumaraguru}, TITLE = {Sanity Checks for Evaluating Graph Unlearning}, BOOKTITLE = {Conference on Lifelong Learning Agents. PMLR}. YEAR = {2024}}
Graph neural networks (GNNs) are increasingly being used on sensitive graph-structured data, necessitating techniques for handling unlearning requests on the trained models, particularly node unlearning. However, unlearning nodes on GNNs is challenging due to the interdependence between the nodes in a graph. We compare MEGU, a state-of-the-art graph unlearning method, and SCRUB, a general unlearning method for classification, to investigate the efficacy of graph unlearning methods over traditional unlearning methods. Surprisingly, we find that SCRUB performs comparably or better than MEGU on random node removal and on removing an adversarial node injection attack. Our results suggest that 1) graph unlearning studies should incorporate general unlearning methods like SCRUB as baselines, and 2) there is a need for more rigorous behavioral evaluations that reveal the differential advantages of proposed graph unlearning methods. Our work, therefore, motivates future research into more comprehensive evaluations for assessing the true utility of graph unlearning algorithms.
Towards Infusing Auxiliary Knowledge for Distracted Driver Detection
Ishwar B Balappanawar,Ashmit Chamoli,Ruwan Wickramarachchi,Aditya Mishra,Ponnurangam Kumaraguru,Amit Sheth
KNOWLEDGE DISCOVERY AND DATA MINING WORKSHOPS, KDD-W, 2024
@inproceedings{bib_Towa_2024, AUTHOR = {Ishwar B Balappanawar, Ashmit Chamoli, Ruwan Wickramarachchi, Aditya Mishra, Ponnurangam Kumaraguru, Amit Sheth}, TITLE = {Towards Infusing Auxiliary Knowledge for Distracted Driver Detection}, BOOKTITLE = {KNOWLEDGE DISCOVERY AND DATA MINING WORKSHOPS}. YEAR = {2024}}
Distracted driving is a leading cause of road accidents globally. Identifying distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets.
In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) that infuses auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver’s pose. Specifically, we construct a unified framework that integrates scene graphs and driver pose information with visual cues from video frames to create a holistic representation of the driver’s actions.
Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating auxiliary knowledge with visual information.
InSaAF: Incorporating Safety Through
Accuracy and Fairness - Are LLMs Ready
for the Indian Legal Domain?
Yogesh Tripathi,Ponnurangam Kumaraguru,Raghav Donakanti,Sahil Girhepuje,Kavathekar Ishan Kishorkumar,Vedula Bhaskara Hanuma,Gokul S Krishnan,Shreya Goyal,Anmol Goel,Balaraman Ravindran
International Conference on Legal Knowledge and Information Systems, JURIX, 2024
@inproceedings{bib_InSa_2024, AUTHOR = {Yogesh Tripathi, Ponnurangam Kumaraguru, Raghav Donakanti, Sahil Girhepuje, Kavathekar Ishan Kishorkumar, Vedula Bhaskara Hanuma, Gokul S Krishnan, Shreya Goyal, Anmol Goel, Balaraman Ravindran}, TITLE = {InSaAF: Incorporating Safety Through
Accuracy and Fairness - Are LLMs Ready
for the Indian Legal Domain?}, BOOKTITLE = {International Conference on Legal Knowledge and Information Systems}. YEAR = {2024}}
Large Language Models (LLMs) have emerged as powerful tools to per-
form various tasks in the legal domain, ranging from generating summaries to pre-
dicting judgments. Despite their immense potential, these models have been proven
to learn and exhibit societal biases and make unfair predictions. Hence, it is es-
sential to evaluate these models prior to deployment. In this study, we explore the
ability of LLMs to perform Binary Statutory Reasoning in the Indian legal land-
scape across various societal disparities. We present a novel metric, β -weighted Le-
gal Safety Score (LSSβ ), to evaluate the legal usability of the LLMs. Additionally,
we propose a finetuning pipeline, utilising specialised legal datasets, as a potential
method to reduce bias. Our proposed pipeline effectively reduces bias in the model,
as indicated by improved LSSβ . This highlights the potential of our approach to
enhance fairness in LLMs, making them more reliable for legal tasks in socially
diverse contexts.
Put Your Money Where Your Mouth Is: Dataset and Analysis of Real World Habit Building Attempts
Hitkul Jangra,Rajiv Ratn Shah,Ponnurangam Kumaraguru
International Conference on Web and Social Media, ICWSM, 2024
@inproceedings{bib_Put__2024, AUTHOR = {Hitkul Jangra, Rajiv Ratn Shah, Ponnurangam Kumaraguru}, TITLE = {Put Your Money Where Your Mouth Is: Dataset and Analysis of Real World Habit Building Attempts}, BOOKTITLE = {International Conference on Web and Social Media}. YEAR = {2024}}
The pursuit of habit building is challenging, and most people struggle with it. Research on successful habit formation is mainly based on small human trials focusing on the same habit for all the participants as conducting long-term heterogonous habit studies can be logistically expensive. With the advent of self-help, there has been an increase in online communities and applications that are centered around habit building and logging. Habit building applications can provide large-scale data on real-world habit building attempts and unveil the commonalities among successful ones. We collect public data on stickk.com, which allows users to track progress on habit building attempts called commitments. A commitment can have an external referee, regular check-ins about the progress, and a monetary stake in case of failure. Our data consists of 742,923 users and 397,456 commitments. In addition to the dataset, rooted in theories like Fresh Start Effect, Accountablity, and Loss Aversion, we ask questions about how commitment properties like start date, external accountability, monitory stake, and pursuing multiple habits together affects the odds of success. We found that people tend to start habits on temporal landmarks, but that does not affect the probability of their success. Practices like accountability and stakes are not often used but are strong determents of success. Commitments of 6 to 8 weeks in length, weekly reporting with an external referee, and a monetary amount at stake tend to be most successful. Finally, around 40% of all commitments are attempted simultaneously with other goals. Simultaneous attempts of pursuing commitments may fail early, but if pursued through the initial phase, they are statistically more successful than building one habit at a time.
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li,Gabriel Mukobi,Nathan Helm-Burger,Rassin Lababidi,Lennart Justen,Andrew Bo Liu,Ponnurangam Kumaraguru,Alexander Pan,Anjali Gopal,Summer Yue,Daniel Berrios,Alice Gatti,Justin D. Li,Ann-Kathrin Dombrowski,Shashwat Goel
International Conference on Machine Learning, ICML, 2024
@inproceedings{bib_The__2024, AUTHOR = {Nathaniel Li, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Ponnurangam Kumaraguru, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel}, TITLE = {The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning}, BOOKTITLE = {International Conference on Machine Learning}. YEAR = {2024}}
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
Game-on: graph attention network based multimodal fusion for fake news detection
Mudit Dhawan,Shakshi Sharma,Rajesh Sharma,Kadam Aditya Santosh,Ponnurangam Kumaraguru
Social Network Analysis and Mining, SNAM, 2024
@inproceedings{bib_Game_2024, AUTHOR = {Mudit Dhawan, Shakshi Sharma, Rajesh Sharma, Kadam Aditya Santosh, Ponnurangam Kumaraguru}, TITLE = {Game-on: graph attention network based multimodal fusion for fake news detection}, BOOKTITLE = {Social Network Analysis and Mining}. YEAR = {2024}}
Fake news being spread on social media platforms has a disruptive and damaging impact on our lives. Multimedia content improves the visibility of posts more than text data but is also being used for creating fake news. Previous multimodal works have tried to address the problem of modeling heterogeneous modalities in identifying fake news. However, these works have the following limitations: (1) inefficient encoding of inter-modal relations by utilizing a simple concatenation operator on the modalities at a later stage in a model, which might result in information loss; (2) training very deep neural networks with a disproportionate number of parameters on small multimodal datasets result in higher chances of overfitting. To address these limitations, we propose GAME-ON, a Graph Neural Network based end-to-end trainable framework that allows granular interactions within and across different modalities to learn more robust data representations for multimodal fake news detection. We use two publicly available fake news datasets, Twitter and Weibo, for evaluations. GAME-ON outperforms on Twitter by an average of 11% and achieves state-of-the-art performance on Weibo while using 91% fewer parameters than the best comparable state-of-the-art baseline. For deployment in real-world applications, GAME-ON can be used as a lightweight model (less memory and latency requirements), which makes it more feasible than previous state-of-the-art models.
Counter Turing Test (CT^2): Investigating AI-Generated Text Detection for Hindi - Ranking LLMs based on Hindi AI Detectability Index (ADI_hi)
Kavathekar Ishan Kishorkumar,Anku Rani,Ashmit Chamoli,Ponnurangam Kumaraguru,Amit Sheth,Amitava Das
Empirical Methods in Natural Language Processing-Findings, EMNLP-F, 2024
@inproceedings{bib_Coun_2024, AUTHOR = {Kavathekar Ishan Kishorkumar, Anku Rani, Ashmit Chamoli, Ponnurangam Kumaraguru, Amit Sheth, Amitava Das}, TITLE = {Counter Turing Test (CT^2): Investigating AI-Generated Text Detection for Hindi - Ranking LLMs based on Hindi AI Detectability Index (ADI_hi)}, BOOKTITLE = {Empirical Methods in Natural Language Processing-Findings}. YEAR = {2024}}
The widespread adoption of Large Language
Models (LLMs) and awareness around multilingual LLMs have raised concerns regarding
the potential risks and repercussions linked to the misapplication of AI-generated text, necessitating increased vigilance. While these models are primarily trained for English, their extensive training on vast datasets covering almost the entire web, equips them with capabilities to perform well in numerous other languages. AI-Generated Text Detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by the emergence of techniques to bypass detection. In this paper, we report our investigation on AGTD for an indic language Hindi. Our major contributions are in four folds: i) examined 26 LLMs to evaluate their proficiency in generating Hindi text, ii) introducing the AI-generated news article in Hindi (AG_hi) dataset, iii) evaluated the effectiveness of five recently proposed AGTD techniques: ConDA, J-Guard, RADAR, RAIDAR and Intrinsic Dimension Estimation for detecting AI-generated Hindi text, iv) proposed Hindi AI Detectability Index (ADI_hi) which shows a spectrum to understand the evolving landscape of eloquence of AI-generated text in Hindi. The code and dataset is available at https://github.com/ishank31/Counter_Turing_Test
Improving Bias Metrics in Vision-Language Models by
Addressing Inherent Model Disabilities
Darur Lakshmipathi Balaji,Gouravarapu Shanmukha Sai Keerthi,Shashwat Goel,Ponnurangam Kumaraguru
Neural Information Processing Systems Workshops, NeurIPS-W, 2024
@inproceedings{bib_Impr_2024, AUTHOR = {Darur Lakshmipathi Balaji, Gouravarapu Shanmukha Sai Keerthi, Shashwat Goel, Ponnurangam Kumaraguru}, TITLE = {Improving Bias Metrics in Vision-Language Models by
Addressing Inherent Model Disabilities}, BOOKTITLE = {Neural Information Processing Systems Workshops}. YEAR = {2024}}
The integration of Vision-Language Models (VLMs) into various applications
has highlighted the importance of evaluating these models for inherent biases,
especially along gender and racial lines. Traditional bias assessment methods in
VLMs typically rely on accuracy metrics, assessing disparities in performance
across different demographic groups. These methods, however, often overlook the
impact of the model’s disabilities, like lack spatial reasoning, which may skew
the bias assessment. In this work, we propose an approach that systematically
examines how current bias evaluation metrics account for the model’s limitations.
We introduce two methods that circumvent these disabilities by integrating spatial
guidance from textual and visual modalities. Our experiments aim to refine bias
quantification by effectively mitigating the impact of spatial reasoning limitations,
offering a more accurate assessment of biases in VLMs.
Exposing Privacy Risks in Indoor Air Pollution Monitoring Systems
Krishna,Shreyash Narendra Gujar,Sachin Chaudhari,Ponnurangam Kumaraguru
International Conference on Environment Pollution and Prevention, ICEPP, 2024
@inproceedings{bib_Expo_2024, AUTHOR = {Krishna, Shreyash Narendra Gujar, Sachin Chaudhari, Ponnurangam Kumaraguru}, TITLE = {Exposing Privacy Risks in Indoor Air Pollution Monitoring Systems}, BOOKTITLE = {International Conference on Environment Pollution and Prevention}. YEAR = {2024}}
Indoor air pollution monitoring has been a region of interest in recent times. Multiple Internet of Things (IoT) enabled devices are available for this purpose. With the growing number of sensors in our daily environment, huge amounts of data are being collected and pushed to the servers through the Internet. This study aims to reveal that seemingly trivial indoor air pollution data containing particulate matter, carbon dioxide, and temperature can reveal complex insights about an individual’s lifestyle. Data was collected over a period of four months in a real-world environment. The study demonstrates the inference of cooking activities by using machine learning and deep learning techniques. The study further demonstrates that different food items and culinary practices have different air pollution signatures, which can be identified and distinguished with great accuracy (>90%). In the practice of inferential analysis, it is not necessary to rely on data characterised by high frequency or granularity. Less detailed data like hourly averages, can be used to make meaningful conclusions that might intrude on an individual’s privacy. With the rapid advancement in machine learning and deep learning, a proactive approach to privacy is needed to ensure that the collected data and its usage do not intentionally or unintentionally breach individual privacy.
Emergence of Text Semantics in CLIP Image Encoders
Sreeram Reddy Vennam,Shashwat Singh,Anirudh Govil,Ponnurangam Kumaraguru
Neural Information Processing Systems Workshops, NeurIPS-W, 2024
@inproceedings{bib_Emer_2024, AUTHOR = {Sreeram Reddy Vennam, Shashwat Singh, Anirudh Govil, Ponnurangam Kumaraguru}, TITLE = {Emergence of Text Semantics in CLIP Image Encoders}, BOOKTITLE = {Neural Information Processing Systems Workshops}. YEAR = {2024}}
Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Humans process text visually; our work studies the semantics of text rendered in images. We show that the semantic information captured by image representations can decisively classify the sentiment of sentences and is robust against visual attributes like font and not based on simple character frequency associations.
LLM Vocabulary Compression for Low-Compute Environments
Sreeram Reddy Vennam,Anish R Joishy,Ponnurangam Kumaraguru
Neural Information Processing Systems Workshops, NeurIPS-W, 2024
@inproceedings{bib_LLM__2024, AUTHOR = {Sreeram Reddy Vennam, Anish R Joishy, Ponnurangam Kumaraguru}, TITLE = {LLM Vocabulary Compression for Low-Compute Environments}, BOOKTITLE = {Neural Information Processing Systems Workshops}. YEAR = {2024}}
We present a method to compress the final linear layer of language models, reducing memory usage by up to 3.4 x without significant performance loss. By grouping tokens based on Byte Pair Encoding (BPE) merges, we prevent materialisation of the memory-intensive logits tensor. Evaluations on the TinyStories dataset show that our method performs on par with GPT-Neo and GPT2 while significantly improving throughput by up to 3x, making it suitable for low-compute environments.
Random Representations Outperform Online Continually Learned Representations
Ameya Prabhu,Shiven Sinha,Ponnurangam Kumaraguru,Philip H.S. Torr,Ozan Sener,Puneet K. Dokania
Neural Information Processing Systems, NeurIPS, 2024
Abs | | bib Tex
@inproceedings{bib_Rand_2024, AUTHOR = {Ameya Prabhu, Shiven Sinha, Ponnurangam Kumaraguru, Philip H.S. Torr, Ozan Sener, Puneet K. Dokania}, TITLE = {Random Representations Outperform Online Continually Learned Representations}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2024}}
Continual learning has primarily focused on the issue of catastrophic forgetting and the associated stability-plasticity tradeoffs. However, little attention has been paid to the efficacy of continually learned representations, as representations are learned alongside classifiers throughout the learning process. Our primary contribution is empirically demonstrating that existing online continually trained deep networks produce inferior representations compared to a simple pre-defined random transforms. Our approach embeds raw pixels using a fixed random transform, approximating an RBF-Kernel initialized before any data is seen. We then train a simple linear classifier on top without storing any exemplars, processing one sample at a time in an online continual learning setting. This method, called RanDumb, significantly outperforms state-of-the-art continually learned representations across all standard online continual learning benchmarks. Our study reveals the significant limitations of representation learning, particularly in low-exemplar and online continual learning scenarios. Extending our investigation to popular exemplar-free scenarios with pretrained models, we find that training only a linear classifier on top of pretrained representations surpasses most continual fine-tuning and prompt-tuning strategies. Overall, our investigation challenges the prevailing assumptions about effective representation learning in online continual learning.
Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale
Anmol Agarwal,Pratyush Priyadarshi,Shiven Sinha,Shrey Gupta,Hitkul Jangra,Ponnurangam Kumaraguru,Kiran Garimella
ACM International Conference on Knowledge Discovery and Data Mining, KDD, 2024
Abs | | bib Tex
@inproceedings{bib_Tele_2024, AUTHOR = {Anmol Agarwal, Pratyush Priyadarshi, Shiven Sinha, Shrey Gupta, Hitkul Jangra, Ponnurangam Kumaraguru, Kiran Garimella}, TITLE = {Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale}, BOOKTITLE = {ACM International Conference on Knowledge Discovery and Data Mining}. YEAR = {2024}}
In this paper, we tackle the complex task of analyzing televised debates, with a focus on a prime time news debate show from India. Previous methods, which often relied solely on text, fall short in capturing the multimedia essence of these debates [27]. To address this gap, we introduce a comprehensive automated toolkit that employs advanced computer vision and speech-to-text techniques for large-scale multimedia analysis. Utilizing state-of-the-art computer vision algorithms and speech-to-text methods, we transcribe, diarize, and analyze thousands of YouTube videos of prime-time television debates in India. These debates are a central part of Indian media but have been criticized for compromised journalistic integrity and excessive dramatization [18]. Our toolkit provides concrete metrics to assess bias and incivility, capturing a comprehensive multimedia perspective that includes text, audio utterances, and video frames. Our findings reveal significant biases in topic selection and panelist representation, along with alarming levels of incivility. This work offers a scalable, automated approach for future research in multimedia analysis, with profound implications for the quality of public discourse and democratic debate. We will make our data analysis pipeline and collected data publicly available to catalyze further research in this domain.
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh,Shauli Ravfogel,Ryan Cotterell,onathan Herzig,Roee Aharoni,Ponnurangam Kumaraguru
International Conference on Machine Learning, ICML, 2024
@inproceedings{bib_Repr_2024, AUTHOR = {Shashwat Singh, Shauli Ravfogel, Ryan Cotterell, onathan Herzig, Roee Aharoni, Ponnurangam Kumaraguru}, TITLE = {Representation Surgery: Theory and Practice of Affine Steering}, BOOKTITLE = {International Conference on Machine Learning}. YEAR = {2024}}
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
Tight Sampling in Unbounded Networks
Kshitijaa Jaglan,Meher Chaitanya Pindiprolu,Triansh Sharma,Abhijeeth Reddy Singam,Nidhi Goyal,Ponnurangam Kumaraguru,Ulrik Brandes
International Conference on Web and Social Media, ICWSM, 2024
@inproceedings{bib_Tigh_2024, AUTHOR = {Kshitijaa Jaglan, Meher Chaitanya Pindiprolu, Triansh Sharma, Abhijeeth Reddy Singam, Nidhi Goyal, Ponnurangam Kumaraguru, Ulrik Brandes}, TITLE = {Tight Sampling in Unbounded Networks}, BOOKTITLE = {International Conference on Web and Social Media}. YEAR = {2024}}
The default approach to deal with the enormous size and limited accessibility of many Web and social media networks is to sample one or more subnetworks from a conceptually unbounded unknown network. Clearly, the extracted subnetworks will crucially depend on the sampling scheme. Motivated by studies of homophily and opinion formation, we propose a variant of snowball sampling designed to prioritize the inclusion of entire cohesive communities rather than any kind of representativeness, breadth, or depth of coverage. The method is illustrated on a concrete example, and experiments on synthetic networks suggest that it behaves as desired
Corrective Machine Unlearning
Shashwat Goel,Ameya Prabhu,Philip Torr,Ponnurangam Kumaraguru,Amartya Sanyal
Transactions on Machine Learning Research, Trans Mach Learn Res, 2024
@inproceedings{bib_Corr_2024, AUTHOR = {Shashwat Goel, Ameya Prabhu, Philip Torr, Ponnurangam Kumaraguru, Amartya Sanyal}, TITLE = {Corrective Machine Unlearning}, BOOKTITLE = {Transactions on Machine Learning Research}. YEAR = {2024}}
SaGE: Evaluating Moral Consistency in Large Language Models
Vamshi Krishna Bonagiri,Sreeram Reddy Vennam,Priyanshul Govil,Ponnurangam Kumaraguru,Manas Gaur
International Conference on Computational Linguistics, COLING, 2024
@inproceedings{bib_SaGE_2024, AUTHOR = {Vamshi Krishna Bonagiri, Sreeram Reddy Vennam, Priyanshul Govil, Ponnurangam Kumaraguru, Manas Gaur}, TITLE = {SaGE: Evaluating Moral Consistency in Large Language Models}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2024}}
Despite recent advancements showcasing the impressive capabilities of Large Language Models (LLMs) in conversational systems, we show that even state-of-the-art LLMs are morally inconsistent in their generations, questioning their reliability (and trustworthiness in general). Prior works in LLM evaluation focus on developing ground-truth data to measure accuracy on specific tasks. However, for moral scenarios that often lack universally agreed-upon answers, consistency in model responses becomes crucial for their reliability. To address this issue, we propose an information-theoretic measure called Semantic Graph Entropy (SaGE), grounded in the concept of "Rules of Thumb" (RoTs) to measure a model's moral consistency. RoTs are abstract principles learned by a model and can help explain their decision-making strategies effectively. To this extent, we construct the Moral Consistency Corpus (MCC), containing 50K moral questions, responses to them by LLMs, and the RoTs that these models followed. Furthermore, to illustrate the generalizability of SaGE, we use it to investigate LLM consistency on two popular datasets -- TruthfulQA and HellaSwag. Our results reveal that task-accuracy and consistency are independent problems, and there is a dire need to investigate these issues further.
Measuring Moral Inconsistencies in Large Language Models
Vamshi Krishna Bonagiri,Sreeram Reddy Vennam,Manas Gaur,Ponnurangam Kumaraguru
EMNLP Workshop, EMNLP-W, 2024
@inproceedings{bib_Meas_2024, AUTHOR = {Vamshi Krishna Bonagiri, Sreeram Reddy Vennam, Manas Gaur, Ponnurangam Kumaraguru}, TITLE = {Measuring Moral Inconsistencies in Large Language Models}, BOOKTITLE = {EMNLP Workshop}. YEAR = {2024}}
A Large Language Model (LLM) is considered consistent if semantically equivalent prompts produce semantically equivalent responses. Despite recent advancements showcasing the impressive capabilities of LLMs in conversational systems, we show that even state-of-the-art LLMs are highly inconsistent in their generations, questioning their reliability. Prior research has tried to measure this with task-specific accuracy. However, this approach is unsuitable for moral scenarios, such as the trolley problem, with no "correct" answer. To address this issue, we propose a novel information-theoretic measure called Semantic Graph Entropy (SGE) to measure the consistency of an LLM in moral scenarios. We leverage "Rules of Thumb" (RoTs) to explain a model's decision-making strategies and further enhance our metric. Compared to existing consistency metrics, SGE correlates better with human judgments across five LLMs. In the future, we aim to investigate the root causes of LLM inconsistencies and propose improvements.
CAFIN: Centrality Aware Fairness inducing IN-processing for Unsupervised Representation Learning on Graphs
Arvindh A,Aakash Aanegola,Amul Agrawal,Ramasuri Narayanam,Ponnurangam Kumaraguru
European Conference on Artificial Intelligence, ECAI, 2023
@inproceedings{bib_CAFI_2023, AUTHOR = {Arvindh A, Aakash Aanegola, Amul Agrawal, Ramasuri Narayanam, Ponnurangam Kumaraguru}, TITLE = {CAFIN: Centrality Aware Fairness inducing IN-processing for Unsupervised Representation Learning on Graphs}, BOOKTITLE = {European Conference on Artificial Intelligence}. YEAR = {2023}}
Unsupervised Representation Learning on graphs is gaining traction due to the increasing abundance of unlabelled network data and the compactness, richness, and usefulness of the representations generated. In this context, the need to consider fairness and bias constraints while generating the representations has been well-motivated and studied to some extent in prior works. One major limitation of most of the prior works in this setting is that they do not aim to address the bias generated due to connectivity patterns in the graphs, such as varied node centrality, which leads to a disproportionate performance across nodes. In our work, we aim to address this issue of mitigating bias due to inherent graph structure in an unsupervised setting. To this end, we propose CAFIN, a centrality-aware fairness-inducing framework that leverages the structural information of graphs to tune the representations generated by existing frameworks. We deploy it on GraphSAGE (a popular framework in this domain) and showcase its efficacy on two downstream tasks – Node Classification and Link Prediction. Empirically, CAFIN consistently reduces the performance disparity across popular datasets (varying from 18 to 80% reduction in performance disparity) from various domains while incurring only a minimal cost of fairness.
Explaining Finetuned Transformers on Hate Speech Predictions Using Layerwise Relevance Propagation
Ritwik Mishra,Ajeet Yadav,Rajiv Ratn Shah,Ponnurangam Kumaraguru
International Conference on Big Data Analytics, BDA, 2023
Abs | | bib Tex
@inproceedings{bib_Expl_2023, AUTHOR = {Ritwik Mishra, Ajeet Yadav, Rajiv Ratn Shah, Ponnurangam Kumaraguru}, TITLE = {Explaining Finetuned Transformers on Hate Speech Predictions Using Layerwise Relevance Propagation}, BOOKTITLE = {International Conference on Big Data Analytics}. YEAR = {2023}}
Towards Adversarial Evaluations of Inexact Machine Unlearning
Shashwat Goel,Ameya Prabhu,Amartya Sanyal,Ser-Nam Lim,Phillip Torr,Ponnurangam Kumaraguru
Technical Report, arXiv, 2023
@inproceedings{bib_Towa_2023, AUTHOR = {Shashwat Goel, Ameya Prabhu, Amartya Sanyal, Ser-Nam Lim, Phillip Torr, Ponnurangam Kumaraguru}, TITLE = {Towards Adversarial Evaluations of Inexact Machine Unlearning}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Machine Learning models face increased concerns regarding the storage of personal user data and adverse impacts of corrupted data like backdoors or systematic bias. Machine Unlearning can address these by allowing post-hoc deletion of affected training data from a learned model. Achieving this task exactly is computationally expensive; consequently, recent works have proposed inexact unlearning algorithms to solve this approximately as well as evaluation methods to test the effectiveness of these algorithms. In this work, we first outline some necessary criteria for evaluation methods and show no existing evaluation satisfies them all. Then, we design a stronger black-box evaluation method called the Interclass Confusion (IC) test which adversarially manipulates data during training to detect the insufficiency of unlearning procedures. We also propose two analytically motivated baseline methods~(EU-k and CF-k) which outperform several popular inexact unlearning methods. Overall, we demonstrate how adversarial evaluation strategies can help in analyzing various unlearning phenomena which can guide the development of stronger unlearning algorithms.
Probing Negation in Language Models
Shashwat Singh,Shashwat Goel,Saujas Srinivasa Vaduguru,Ponnurangam Kumaraguru
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2023
@inproceedings{bib_Prob_2023, AUTHOR = {Shashwat Singh, Shashwat Goel, Saujas Srinivasa Vaduguru, Ponnurangam Kumaraguru}, TITLE = {Probing Negation in Language Models}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2023}}
Prior work has shown that pretrained language models often make incorrect predictions for negated inputs. The reason for this behaviour has remained unclear. It has been argued that since language models (LMs) don’t change their predictions about factual propositions under negation, they might not detect negation. We show encoder LMs do detect negation as their representations across layers reliably distinguish negated inputs from non-negated inputs, and when negation leads to contradictions. However, probing experiments show that these LMs indeed don’t use negation when evaluating whether a factual statement is true, even when fine-tuned with the objective of changing outputs on negated sentences (Hosseini et al., 2021). We hypothesize about why pretrained LMs are inconsistent under negation: when the statement could refer to multiple ground entities with conflicting properties, negation may not entail a change in output. This means negation minimal pairs in different training samples can have the same completion in pretraining corpora. We argue pretraining may not provide enough signal to learn the distribution of ground referents a token could have, confusing the LM on how to handle negation.
Blind Leading the Blind: A Social-Media Analysis of the Tech Industry
Tanishq Chaudhary,Pulak Malhotra,Radhika Mamidi,Ponnurangam Kumaraguru
International Conference on Natural Language Processing., ICON, 2023
@inproceedings{bib_Blin_2023, AUTHOR = {Tanishq Chaudhary, Pulak Malhotra, Radhika Mamidi, Ponnurangam Kumaraguru}, TITLE = {Blind Leading the Blind: A Social-Media Analysis of the Tech Industry}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2023}}
Representation Learning for Identifying Depression Causes in Social Media
Priyanshul Govil,Vamshi Krishna Bonagiri,Muskan Garg,Ponnurangam Kumaraguru
KNOWLEDGE DISCOVERY AND DATA MINING WORKSHOPS, KDD-W, 2023
@inproceedings{bib_Repr_2023, AUTHOR = {Priyanshul Govil, Vamshi Krishna Bonagiri, Muskan Garg, Ponnurangam Kumaraguru}, TITLE = {Representation Learning for Identifying Depression Causes in Social Media}, BOOKTITLE = {KNOWLEDGE DISCOVERY AND DATA MINING WORKSHOPS}. YEAR = {2023}}
Social media provides a supportive and anonymous environment for discussing mental health issues, including depression. Existing research on identifying the cause of depression focuses primarily on improving classifier models, while neglecting the importance of learning better data representations. To address this gap, we introduce an architecture that enhances the identification of the cause of depression by learning improved data representations. Our work enables a deeper interpretation of the cause of depression in social media contexts, emphasizing the significance of effective rep- resentation learning for this task. Our work can act as a foundation for self-help applications in the field of mental health.
Tight Sampling in Unbounded Networks
Kshitijaa Jaglan,Meher Chaitanya,Triansh Sharma,Abhijeeth Reddy Singam,Nidhi Goyal,Ponnurangam Kumaraguru,Ulrik Brandes
Computing Research Repository, CoRR, 2023
@inproceedings{bib_Tigh_2023, AUTHOR = {Kshitijaa Jaglan, Meher Chaitanya, Triansh Sharma, Abhijeeth Reddy Singam, Nidhi Goyal, Ponnurangam Kumaraguru, Ulrik Brandes}, TITLE = {Tight Sampling in Unbounded Networks}, BOOKTITLE = {Computing Research Repository}. YEAR = {2023}}
The default approach to deal with the enormous size and lim- ited accessibility of many Web and social media networks is to sample one or more subnetworks from a conceptually unbounded unknown network. Clearly, the extracted subnet- works will crucially depend on the sampling scheme. Mo- tivated by studies of homophily and opinion formation, we propose a variant of snowball sampling designed to priori- tize inclusion of entire cohesive communities rather than any kind of representativeness, breadth, or depth of coverage. The method is illustrated on a concrete example, and experiments on synthetic networks suggest that it behaves as desired.
Exploring Graph Neural Networks for Indian Legal Judgment Prediction
Mann Khatri,Mirza Yusuf,Rajiv Ratn Shah,Ponnurangam Kumaraguru
Technical Report, arXiv, 2023
@inproceedings{bib_Expl_2023, AUTHOR = {Mann Khatri, Mirza Yusuf, Rajiv Ratn Shah, Ponnurangam Kumaraguru}, TITLE = {Exploring Graph Neural Networks for Indian Legal Judgment Prediction}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
The burdensome impact of a skewed judges-to-cases ratio on the judicial system manifests in an overwhelming backlog of pending cases alongside an ongoing influx of new ones. To tackle this issue and expedite the judicial process, the proposition of an automated system capable of suggesting case outcomes based on factual evidence and precedent from past cases gains significance. This research paper centres on developing a graph neural network-based model to address the Legal Judgment Prediction (LJP) problem, recognizing the intrinsic graph structure of judicial cases and making it a binary node classification problem. We explored various embeddings as model features, while nodes such as time nodes and judicial acts were added and pruned to evaluate the model's performance. The study is done while considering the ethical dimension of fairness in these predictions, considering gender and name biases. A link prediction task is also conducted to assess the model's proficiency in anticipating connections between two specified nodes. By harnessing the capabilities of graph neural networks and incorporating fairness analyses, this research aims to contribute insights towards streamlining the adjudication process, enhancing judicial efficiency, and fostering a more equitable legal landscape, ultimately alleviating the strain imposed by mounting case backlogs. Our best-performing model with XLNet pre-trained embeddings as its features gives the macro F1 score of 75% for the LJP task. For link prediction, the same set of features is the best performing giving ROC of more than 80%
Towards Effective Paraphrasing for Information Disguise
Anmol Agarwal,Shrey Gupta,Vamshi Krishna Bonagiri,Manas Gaur,Joseph Reagle,Ponnurangam Kumaraguru
European Conference on Information Retrieval, ECIR, 2023
Abs | | bib Tex
@inproceedings{bib_Towa_2023, AUTHOR = {Anmol Agarwal, Shrey Gupta, Vamshi Krishna Bonagiri, Manas Gaur, Joseph Reagle, Ponnurangam Kumaraguru}, TITLE = {Towards Effective Paraphrasing for Information Disguise}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2023}}
Information Disguise (ID), a part of computational ethics in Natural Language Processing (NLP), is concerned with best practices of textual paraphrasing to prevent the non-consensual use of authors’ posts on the Internet. Research on ID becomes important when authors’ written online communication pertains to sensitive domains, e.g., mental health. Over time, researchers have utilized AI-based automated word spinners (e.g., SpinRewriter, WordAI) for paraphrasing content. However, these tools fail to satisfy the purpose of ID as their paraphrased content still leads to the source when queried on search engines. There is limited prior work on judging the effectiveness of paraphrasing methods for ID on search engines or
Are Models Trained on Indian Legal Data Fair?
Sahil Girhepuje,Anmol Goel,Gokul S Krishnan,Shreya Goyal,Satyendra Pandey,Ponnurangam Kumaraguru,Balaraman Ravindran
Technical Report, arXiv, 2023
@inproceedings{bib_Are__2023, AUTHOR = {Sahil Girhepuje, Anmol Goel, Gokul S Krishnan, Shreya Goyal, Satyendra Pandey, Ponnurangam Kumaraguru, Balaraman Ravindran}, TITLE = {Are Models Trained on Indian Legal Data Fair?}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Recent advances and applications of language technology and artificial intelligence have enabled much success across multiple domains like law, medical and mental health. AI-based Language Models, like Judgement Prediction, have recently been proposed for the legal sector. However, these models are strife with encoded social biases picked up from the training data. While bias and fairness have been studied across NLP, most studies primarily locate themselves within a Western context. In this work, we present an initial investigation of fairness from the Indian perspective in the legal domain. We highlight the propagation of learnt algorithmic biases in the bail prediction task for models trained on Hindi legal documents. We evaluate the fairness gap using demographic parity and show that a decision tree model trained for the bail prediction task has an overall fairness disparity of 0.237 between input features associated with Hindus and Muslims. Additionally, we highlight the need for further research and studies in the avenues of fairness/bias in applying AI in the legal sector with a specific focus on the Indian context.
Towards Effective Paraphrasing for Information Disguise
Anmol Agarwal,Shrey Gupta,Vamshi Krishna Bonagiri,Manas Gaur,Joseph Reagle,Ponnurangam Kumaraguru
European Conference on Information Retrieval, ECIR, 2023
@inproceedings{bib_Towa_2023, AUTHOR = {Anmol Agarwal, Shrey Gupta, Vamshi Krishna Bonagiri, Manas Gaur, Joseph Reagle, Ponnurangam Kumaraguru}, TITLE = {Towards Effective Paraphrasing for Information Disguise}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2023}}
Information Disguise (ID), a part of computational ethics in Natural Language Processing (NLP), is concerned with best practices of textual paraphrasing to prevent the non-consensual use of authors’ posts on the Internet. Research on ID becomes important when authors’ written online communication pertains to sensitive domains, e.g., mental health. Over time, researchers have utilized AI-based automated word spinners (e.g., SpinRewriter, WordAI) for paraphrasing content. However, these tools fail to satisfy the purpose of ID as their paraphrased content still leads to the source when queried on search engines. There is limited prior work on judging the effectiveness of paraphrasing methods for ID on search engines or their proxies, neural retriever (NeurIR) models. We propose a framework where, for a given sentence from an author’s post, we perform iterative perturbation on the sentence in the direction of paraphrasing with an attempt to confuse the search mechanism of a NeurIR system when the sentence is queried on it. Our experiments involve the subreddit “r/AmItheAsshole” as the source of public content and Dense Passage Retriever as a NeurIR system-based proxy for search engines. Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search. Our multi-phrase substitution scheme succeeds in disguising sentences 82% of the time and hence takes an essential step towards enabling researchers to disguise sensitive content effectively before making it public. We also release the code of our approach
Social Re-Identification Assisted RTO Detection for E-Commerce
Hitkul Jangra,Abinaya.K,SOHAM SAHA,Satyajit Banerjee,Muthusamy Chelliah,Ponnurangam Kumaraguru
Companion Proceedings of the ACM Web Conference, WWW- companion, 2023
Abs | | bib Tex
@inproceedings{bib_Soci_2023, AUTHOR = {Hitkul Jangra, Abinaya.K, SOHAM SAHA, Satyajit Banerjee, Muthusamy Chelliah, Ponnurangam Kumaraguru}, TITLE = {Social Re-Identification Assisted RTO Detection for E-Commerce}, BOOKTITLE = {Companion Proceedings of the ACM Web Conference}. YEAR = {2023}}
E-commerce features like easy cancellations, returns, and refunds can be exploited by bad actors or uninformed customers, leading to revenue loss for organization. One such problem faced by e-commerce platforms is Return To Origin (RTO), where the user cancels an order while it is in transit for delivery. In such a scenario platform faces logistics and opportunity costs. Traditionally, models trained on historical trends are used to predict the propensity of an order becoming RTO. Sociology literature has highlighted clear correlations between socio-economic indicators and users’ tendency to exploit systems to gain financial advantage. Social media profiles have information about location, education, and profession which have been shown to be an estimator of socio-economic condition. We believe combining social media data with e-commerce information can lead to improvements in a variety of tasks like RTO …
CiteCaseLAW: Citation Worthiness Detection in Caselaw for Legal Assistive Writing
Mann Khatri,Mann Khatri,Gitansh Satija,Reshma Sheik,Yaman Kumar,Rajiv Ratn Shah,Ponnurangam Kumaraguru
Technical Report, arXiv, 2023
@inproceedings{bib_Cite_2023, AUTHOR = {Mann Khatri, Mann Khatri, Gitansh Satija, Reshma Sheik, Yaman Kumar, Rajiv Ratn Shah, Ponnurangam Kumaraguru}, TITLE = {CiteCaseLAW: Citation Worthiness Detection in Caselaw for Legal Assistive Writing}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
In legal document writing, one of the key elements is properly citing the case laws and other sources to substantiate claims and arguments. Understanding the legal domain and identifying appropriate citation context or cite-worthy sentences are challenging tasks that demand expensive manual annotation. The presence of jargon, language semantics, and high domain specificity makes legal language complex, making any associated legal task hard for automation. The current work focuses on the problem of citation-worthiness identification. It is designed as the initial step in today's citation recommendation systems to lighten the burden of extracting an adequate set of citation contexts. To accomplish this, we introduce a labeled dataset of 178M sentences for citation-worthiness detection in the legal domain from the Caselaw Access Project (CAP). The performance of various deep learning models was examined on this novel dataset. The domain-specific pre-trained model tends to outperform other models, with an 88% F1-score for the citation-worthiness detection task.
Effect of Feedback on Drug Consumption Disclosures on Social Media
Hitkul Jangra,Rajiv Shah,Ponnurangam Kumaraguru
International Conference on Web and Social Media, ICWSM, 2023
@inproceedings{bib_Effe_2023, AUTHOR = {Hitkul Jangra, Rajiv Shah, Ponnurangam Kumaraguru}, TITLE = {Effect of Feedback on Drug Consumption Disclosures on Social Media}, BOOKTITLE = {International Conference on Web and Social Media}. YEAR = {2023}}
Deaths due to drug overdose in the US have doubled in the last decade. Drug-related content on social media has also exploded in the same time frame. The pseudo-anonymous nature of social media platforms enables users to discourse about taboo and sometimes illegal topics like drug consumption. User-generated content (UGC) about drugs on social media can be used as an online proxy to detect offline drug consumption. UGC also gets exposed to the praise and criticism of the community. Law of effect proposes that positive reinforcement on an experience can incentivize the users to engage in the experience repeatedly. Therefore, we hypothesize that positive community feedback on a user's online drug consumption disclosure will increase the probability of the user doing an online drug consumption disclosure post again. To this end, we collect data from 10 drug-related subreddits. First, we build a deep learning model to classify UGC as indicative of drug consumption offline or not, and analyze the extent of such activities. Further, we use matching-based causal inference techniques to unravel community feedback's effect on users' future drug consumption behavior. We discover that 84% of posts and 55% comments on drug-related subreddits indicate real-life drug consumption. Users who get positive feedback generate up to two times more drugs consumption content in the future. Finally, we conducted an anonymous user study on drug-related subreddits to compare members' opinions with our experimental findings and show that user tends to underestimate the effect community peers can have on their decision to interact with drugs
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents
Mehrad Moradshahi,Tianhao Shen,Kalika Bali,Monojit Choudhury,Gaël de Chalendar,Anmol Goel,Kodali Prashant,Ponnurangam Kumaraguru,Manish Shrivastava
Technical Report, arXiv, 2023
@inproceedings{bib_X-Ri_2023, AUTHOR = {Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury, Gaël De Chalendar, Anmol Goel, Kodali Prashant, Ponnurangam Kumaraguru, Manish Shrivastava}, TITLE = {X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed EnglishHindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with
PrecogIIITH@WASSA2023: Emotion Detection for Urdu-English Code-mixed Text
Vedula Bhaskara Hanuma,Kodali Prashant,Manish Shrivastava,Ponnurangam Kumaraguru
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA, 2023
@inproceedings{bib_Prec_2023, AUTHOR = {Vedula Bhaskara Hanuma, Kodali Prashant, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {PrecogIIITH@WASSA2023: Emotion Detection for Urdu-English Code-mixed Text}, BOOKTITLE = {Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis}. YEAR = {2023}}
Code-mixing refers to the phenomenon of using two or more languages interchangeably within a speech or discourse context. This practice is particularly prevalent on social media platforms, and determining the embedded affects in a code-mixed sentence remains as a challenging problem. In this submission we describe our system for WASSA 2023 Shared Task on Emotion Detection in English-Urdu code-mixed text. In our system we implement a multiclass emotion detection model with label space of 11 emotions. Samples are code-mixed English-Urdu text, where Urdu is written in romanised form. Our submission is limited to one of the subtasks - Multi Class classification and we leverage transformer-based Multilingual Large Language Models (MLLMs), XLM-RoBERTa and Indic-BERT. We fine-tune MLLMs on the released data splits, with and without pre-processing steps (translation to english), for classifying texts into the appropriate emotion category. Our methods did not surpass the baseline, and our submission is ranked sixth overall.
“Help! I need some music!”: Analysing music discourse & depression on Reddit
Bhavyajeet Singh,Kunal Mukesh Vaswani,Paruchuri Venkata Surya Sreeharsha,Suvi Saarikallio,Ponnurangam Kumaraguru,Vinoo A R
Plos One, Plos One, 2023
Abs | | bib Tex
@inproceedings{bib_“H_2023, AUTHOR = {Bhavyajeet Singh, Kunal Mukesh Vaswani, Paruchuri Venkata Surya Sreeharsha, Suvi Saarikallio, Ponnurangam Kumaraguru, Vinoo A R}, TITLE = {“Help! I need some music!”: Analysing music discourse & depression on Reddit}, BOOKTITLE = {Plos One}. YEAR = {2023}}
Individuals choose varying music listening strategies to fulfill particular mood-regulation goals. However, ineffective musical choices and a lack of cognizance of the effects thereof can be detrimental to their well-being and may lead to adverse outcomes like anxiety or depression. In our study, we use the social media platform Reddit to perform a large-scale analysis to unearth the several music-mediated mood-regulation goals that individuals opt for in the context of depression. A mixed-methods approach involving natural language processing techniques followed by qualitative analysis was performed on all music-related posts to identify the various music-listening strategies and group them into healthy and unhealthy associations. Analysis of the music content (acoustic features and lyrical themes) accompanying healthy and unhealthy associations showed significant differences. Individuals resorting to unhealthy strategies gravitate towards low-valence tracks. Moreover, lyrical themes associated with unhealthy strategies incorporated tracks with low optimism, high blame, and high self-reference. Our findings demonstrate that being mindful of the objectives of using music, the subsequent effects thereof, and aligning both for well-being outcomes is imperative for comprehensive understanding of the effectiveness of music
JobXMLC: EXtreme Multi-Label Classification of Job Skills with Graph Neural Networks
Nidhi Goyal,Jushaan Singh Kalra,Charu Sharma,Raghava Mutharaju,Niharika Sachdeva,Ponnurangam Kumaraguru
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2023
@inproceedings{bib_JobX_2023, AUTHOR = {Nidhi Goyal, Jushaan Singh Kalra, Charu Sharma, Raghava Mutharaju, Niharika Sachdeva, Ponnurangam Kumaraguru}, TITLE = {JobXMLC: EXtreme Multi-Label Classification of Job Skills with Graph Neural Networks}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2023}}
Writing a good job description is an important step in the online recruitment process to hire the best candidates. Most recruiters forget to include some relevant skills in the job description. These missing skills affect the performance of recruitment tasks such as job suggestions, job search, candidate recommendations, etc. Existing approaches are limited to contextual modelling, do not exploit inter-relational structures like job-job and job-skill relationships, and are not scalable. In this paper, we exploit these structural relationships using a graph-based approach. We propose a novel skill prediction framework called JobXMLC, which uses graph neural networks with skill attention to predict missing skills using job descriptions. JobXMLC enables joint learning over a job-skill graph consisting of 22.8K entities (jobs and skills) and 650K relationships. We experiment with real-world recruitment datasets to evaluate our proposed approach. We train JobXMLC on 20, 298 jobs and 2, 548 skills within 30 minutes on a single GPU machine. JobXMLC outperforms the state-of-the-art approaches by 6% on precision and 3% on recall. JobXMLC is 18X faster for training tasks and up to 634X faster in skill prediction on benchmark datasets enabling JobXMLC to scale up on larger datasets. We have made our code and dataset public at https://precog.iiit.ac.in/resources.html.
Warning: It’sa scam!! Towards understanding the Employment Scams using Knowledge Graphs
Nidhi Goyal,Niharika Sachdeva,Radhika Mamidi,Ponnurangam Kumaraguru
Joint International Conference on Data Science & Management of Data, CODS-COMAD, 2023
@inproceedings{bib_Warn_2023, AUTHOR = {Nidhi Goyal, Niharika Sachdeva, Radhika Mamidi, Ponnurangam Kumaraguru}, TITLE = {Warning: It’sa scam!! Towards understanding the Employment Scams using Knowledge Graphs}, BOOKTITLE = {Joint International Conference on Data Science & Management of Data}. YEAR = {2023}}
Employment scams, such as scapegoat positions, clickbait and non-existing jobs, etc., are among the top five scams registered over online platforms.1 Generally, scam complaints contain heterogeneous information (money, location, employment type, organization, email, and phone number), which can provide critical insights for appropriate interventions to avoid scams. Despite substantial efforts to analyze employment scams, integrating relevant scam-related information in structured form remains unexplored. In this work, we extract this information and construct a large-scale Employment Scam Knowledge Graph consisting of 0.1M entities and 0.2M relationships. Our findings include discovering different modes of employment scams, entities, and relationships among entities to alert job seekers. We plan to extend this work by utilizing a knowledge graph to identify and avoid potential scams in the future
A Suspect Identification Framework Using Contrastive Relevance Feedback
Drishti Bhasin,Ponnurangam Kumaraguru,Rajiv Ratn Shah
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_A_Su_2023, AUTHOR = {Drishti Bhasin, Ponnurangam Kumaraguru, Rajiv Ratn Shah}, TITLE = {A Suspect Identification Framework Using Contrastive Relevance Feedback}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Suspect Identification is one of the most pivotal aspects of a forensic and criminal investigation. A significant amount of time and skill is devoted to creating sketches for it and requires a fair amount of recollections from the witness to provide a useful sketch. We devise a method that aims to automate the process of suspect identification and model this problem by iteratively retrieving images from feedback provided by the user. Compared to standard image retrieval tasks, interactive facial image retrieval is specifically more challenging due to the high subjectivity involved in describing a person’s facial attributes and appropriately evolving with the preferences put forward by the user. Our method uses a relatively simpler form of supervision by utilizing the user’s feedback to label images as either similar or dissimilar to their mental image of the suspect based on which we propose a loss function using the contrastive learning paradigm that is optimized in an online fashion. We validate the efficacy of our proposed approach using a carefully designed testbed to simulate user feedback and a large-scale user study. We empirically show that our method iteratively improves personalization, leading to faster convergence and enhanced recommendation relevance, thereby, improving user satisfaction. Our proposed framework is being designed for real-time use in the metropolitan crime investigation department, and thus is also equipped with a user-friendly web interface with a real-time experience for suspect retrieval.
EEG Based Stress Classification in Response to Stress Stimulus
Nishtha Phutela,Devanjali Relan,Goldie Gabrani,Ponnurangam Kumaraguru
International Conference on Artificial Intelligence and Speech Technology, AIST, 2022
Abs | | bib Tex
@inproceedings{bib_EEG__2022, AUTHOR = {Nishtha Phutela, Devanjali Relan, Goldie Gabrani, Ponnurangam Kumaraguru}, TITLE = {EEG Based Stress Classification in Response to Stress Stimulus}, BOOKTITLE = {International Conference on Artificial Intelligence and Speech Technology}. YEAR = {2022}}
Stress, either physical or mental, is experienced by almost every person at some point in his lifetime. Stress is one of the leading causes of various diseases and burdens society globally. Stress badly affects an individual's well-being. Thus, stress-related study is an emerging field, and in the past decade, a lot of attention has been given to the detection and classification of stress. The estimation of stress in the individual helps in stress management before it invades the human mind and body. In this paper, we proposed a system for the detection and classification of stress. We compared the various machine learning algorithms for stress classification using EEG signal recordings. Interaxon Muse device having four dry electrodes
Stress Classification Using Brain Signals Based on LSTM Network
Nishtha Phutela,Devanjali Relan,Goldie Gabrani,Ponnurangam Kumaraguru, Mesay Samuel
Computational Intelligence and Neuroscience, CIN, 2022
Abs | | bib Tex
@inproceedings{bib_Stre_2022, AUTHOR = {Nishtha Phutela, Devanjali Relan, Goldie Gabrani, Ponnurangam Kumaraguru, Mesay Samuel}, TITLE = {Stress Classification Using Brain Signals Based on LSTM Network}, BOOKTITLE = {Computational Intelligence and Neuroscience}. YEAR = {2022}}
The early diagnosis of stress symptoms is essential for preventing various mental disorder such as depression. Electroencephalography (EEG) signals are frequently employed in stress detection research and are both inexpensive and noninvasive modality. This paper proposes a stress classification system by utilizing an EEG signal. EEG signals from thirty-five volunteers were analysed which were acquired using four EEG sensors using a commercially available 4-electrode Muse EEG headband. Four movie clips were chosen as stress elicitation material. Two clips were selected to induce stress as it contains emotionally inductive scenes. The other two clips were chosen that do not induce stress as it has many comedy scenes. The recorded signals were then used to build the stress classification model. We compared the Multilayer Perceptron (MLP) and Long Short-Term Memory (LSTM) for classifying stress and nonstress group. The maximum classification accuracy of 93.17% was achieved using two-layer LSTM architecture.
Ask It Right! Identifying Low-Quality questions on Community Question Answering Services
Udit Arora,Nidhi Goyal,Anmol Goel,Niharika Sachdeva,Ponnurangam Kumaraguru
International Joint Conference on Neural Networks, IJCNN, 2022
@inproceedings{bib_Ask__2022, AUTHOR = {Udit Arora, Nidhi Goyal, Anmol Goel, Niharika Sachdeva, Ponnurangam Kumaraguru}, TITLE = {Ask It Right! Identifying Low-Quality questions on Community Question Answering Services}, BOOKTITLE = {International Joint Conference on Neural Networks}. YEAR = {2022}}
—Stack Overflow is a Community Question Answering service that attracts millions of users to seek answers to their questions. Maintaining high-quality content is necessary for relevant question retrieval, question recommendation, and enhancing the user experience. Manually removing low-quality content from the platform is time-consuming and challenging for site moderators. Thus, it is imperative to assess the content quality by automatically detecting and ‘closing’ the low-quality questions. Previous works have explored lexical, communitybased, vote-based, and style-based features to detect low-quality questions. These approaches are limited to writing styles, textual, and handcrafted features. However, these features fall short in understanding semantic features and capturing the implicit relationships between tags and questions. In contrast, we propose LQuaD (Low-Quality Question Detection), a multi-tier hybrid framework that, a) incorporates semantic information of questions associated with each post using transformers, b) includes the question and tag information that enables learning via a graph convolutional network. LQuaD outperforms the state-of-the-art methods by a 21% higher F1-score on the dataset of 2.8 million questions. Furthermore, we apply survival analysis which acts as a proactive intervention to reduce the number of questions closed by informing users to take appropriate action. We find that the timeframe between the stages from the question’s creation till it gets ‘closed’ varies significantly for tags and different ‘closing’ reasons for these questions. Index Terms—Stack Overflow, Community Question Answering, Low-quality questions
“The Times They Are-a-Changin”: The Effect of the Covid-19 Pandemic on Online Music Sharing in India
Tanvi Kamble,Pooja Govind Desur,Amanda Krause,Ponnurangam Kumaraguru,Vinoo A R
International Conference on Social Informatics, SocInfo, 2022
Abs | | bib Tex
@inproceedings{bib_“T_2022, AUTHOR = {Tanvi Kamble, Pooja Govind Desur, Amanda Krause, Ponnurangam Kumaraguru, Vinoo A R}, TITLE = {“The Times They Are-a-Changin”: The Effect of the Covid-19 Pandemic on Online Music Sharing in India}, BOOKTITLE = {International Conference on Social Informatics}. YEAR = {2022}}
Music sharing trends have been shown to change during times of socio-economic crises. Studies have also shown that music can act as a social surrogate, helping to significantly reduce loneliness by acting as an empathetic friend. We explored these phenomena through a novel study of online music sharing during the Covid-19 pandemic in India. We collected tweets from the popular social media platform Twitter during India’s first and second wave of the pandemic (n = 1,364). We examined the different ways in which music was able to accomplish the role of a social surrogate via analyzing tweet text using Natural Language Processing techniques. Additionally, we analyzed the emotional connotations of the music shared through the acoustic features and lyrical content and compared the results between pandemic and pre-pandemic times. It was observed that the role of music shifted to a more community focused function rather than tending to a more self-serving utility. Results demonstrated that people shared music during the Covid-19 pandemic which had lower valence and shared songs with topics that reflected turbulent times such as Hardship and Exclusion when compared to songs shared during pre-Covid times. The results are further discussed in the context of individualistic versus collectivistic cultures.
Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection
Shivangi Singhal,Tanisha Pandey,Saksham Mrig,Rajiv Ratn Shah,Ponnurangam Kumaraguru
International Conference on World wide web, WWW, 2022
Abs | | bib Tex
@inproceedings{bib_Leve_2022, AUTHOR = {Shivangi Singhal, Tanisha Pandey, Saksham Mrig, Rajiv Ratn Shah, Ponnurangam Kumaraguru}, TITLE = {Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2022}}
Recent years have witnessed a massive growth in the proliferation of fake news online. User-generated content is a blend of text and visual information leading to producing different variants of fake news. As a result, researchers started targeting multimodal methods for fake news detection. Existing methods capture high-level information from different modalities and jointly model them to decide. Given multiple input modalities, we hypothesize that not all modalities may be equally responsible for decision-making. Hence, this paper presents a novel architecture that effectively identifies and suppresses information from weaker modalities and extracts relevant information from the strong modality on a per-sample basis. We also establish intra-modality relationship by extracting fine-grained image and text features. We conduct extensive experiments on real-world datasets to show that our approach outperforms the state-of-the-art by an average of 3.05% and 4.525% on accuracy and F1-score, respectively. We also release the code, implementation details, and model checkpoints for the community’s interest.1
TweetBoost: Influence of Social Media on NFT Valuation
Arnav Kapoor,Dipanwita Guhathakurta,Mehul Mathur,Rupanshu Yadav,Manish Gupta,Ponnurangam Kumaraguru
International Conference on World wide web, WWW, 2022
@inproceedings{bib_Twee_2022, AUTHOR = {Arnav Kapoor, Dipanwita Guhathakurta, Mehul Mathur, Rupanshu Yadav, Manish Gupta, Ponnurangam Kumaraguru}, TITLE = {TweetBoost: Influence of Social Media on NFT Valuation}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2022}}
NFT or Non-Fungible Token is a token that certifies a digital asset to be unique. A wide range of assets including, digital art, music, tweets, memes, are being sold as NFTs. NFT-related content has been widely shared on social media sites such as Twitter. We aim to understand the dominant factors that influence NFT asset valuation. Towards this objective, we create a first-of-its-kind dataset linking Twitter and OpenSea (the largest NFT marketplace) to capture social media profiles and linked NFT assets. Our dataset contains 245,159 tweets posted by 17,155 unique users, directly linking 62,997 NFT assets on OpenSea worth 19 Million USD. We have made the dataset public. We analyze the growth of NFTs, characterize the Twitter users promoting NFT assets, and gauge the impact of Twitter features on the virality of an NFT. Further, we investigate the effectiveness of different social media and NFT platform features by experimenting with multiple machine learning and deep learning models to predict an asset's value. Our results show that social media features improve the accuracy by 6% over baseline models that use only NFT platform features. Among social media features, count of user membership lists, number of likes and retweets are important features.
Contrastive Personalization Approach to Suspect Identification (Student Abstract)
Devansh Gupta,Drishti Bhasin,Sarthak Bhagat,Shagun Uppal,Ponnurangam Kumaraguru,Rajiv Ratn Shah
AAAI Conference on Artificial Intelligence, AAAI, 2022
@inproceedings{bib_Cont_2022, AUTHOR = {Devansh Gupta, Drishti Bhasin, Sarthak Bhagat, Shagun Uppal, Ponnurangam Kumaraguru, Rajiv Ratn Shah}, TITLE = {Contrastive Personalization Approach to Suspect Identification (Student Abstract)}, BOOKTITLE = {AAAI Conference on Artificial Intelligence}. YEAR = {2022}}
Targeted image retrieval has long been a challenging problem since each person has a different perception of different features leading to inconsistency among users in describing the details of a particular image. Due to this, each user needs a system personalized according to the way they have structured the image in their mind. One important application of this task is suspect identifcation in forensic investigations where a witness needs to identify the suspect from an existing criminal database. Existing methods require the attributes for each image or suffer from poor latency during training and inference. We propose a new approach to tackle this problem through explicit relevance feedback by introducing a novel loss function and a corresponding scoring function. For this, we leverage contrastive learning on the user feedback to generate the next set of suggested images while improving the level of personalization with each user feedback iteration.
HashSet - A Dataset For Hashtag Segmentation
Kodali Prashant,Akshala Bhatnagar,Naman Ahuja,Manish Shrivastava,Ponnurangam Kumaraguru
International Conference on Language Resources and Evaluation, LREC, 2022
@inproceedings{bib_Hash_2022, AUTHOR = {Kodali Prashant, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {HashSet - A Dataset For Hashtag Segmentation}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2022}}
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of usergenerated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We show that the performance of SOTA models for Hashtag Segmentation drops substantially on proposed dataset, indicating that the proposed dataset provides an alternate set of hashtags to train and assess models. Datasets and results are released publicly and can be accessed from https://github.com/prashantkodali/HashSet
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Kodali Prashant,Anmol Goel,Monojit Choudhury,Manish Shrivastava,Ponnurangam Kumaraguru
Findings of the Association for Computational Linguistics, FACL, 2022
@inproceedings{bib_SyMC_2022, AUTHOR = {Kodali Prashant, Anmol Goel, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing}, BOOKTITLE = {Findings of the Association for Computational Linguistics}. YEAR = {2022}}
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.
Diagnosing Data from ICTs to Provide Focused Assistance in Agricultural Adoptions
Ashwin Singh,LOKESH GARG,ERICA ARYA,Mallika Subramanian,Anmol Agarwal,Pratyush Pratap Priyadarshi,Shrey Gupta,KIRAN GARIMELLA,Ponnurangam Kumaraguru,SANJEEV KUMAR,RITESH KUMAR,
International Conference on Information and Communication Technologies and Development, ICTD, 2022
@inproceedings{bib_Diag_2022, AUTHOR = {Ashwin Singh, LOKESH GARG, ERICA ARYA, Mallika Subramanian, Anmol Agarwal, Pratyush Pratap Priyadarshi, Shrey Gupta, KIRAN GARIMELLA, Ponnurangam Kumaraguru, SANJEEV KUMAR, RITESH KUMAR, }, TITLE = {Diagnosing Data from ICTs to Provide Focused Assistance in Agricultural Adoptions}, BOOKTITLE = {International Conference on Information and Communication Technologies and Development}. YEAR = {2022}}
In the last two decades, Information and Communication Technologies (ICTs) have played a pivotal role in empowering rural populations in India by making knowledge more accessible. Digital Green is one such ICT that employs a participatory approach with smallholder farmers to produce instructional agricultural videos that encompass content specific to them. With the help of human mediators, they disseminate these videos to farmers using projectors to improve the adoption of agricultural practices. Digital Green’s web-based data tracker (CoCo) stores the attendance and adoption logs of millions of farmers, the videos screened to them and their demographic information. In our work, we leverage this data for a period of ten years between 2010-2020 across five states in India where Digital Green is most active and use it to conduct a holistic evaluation of the ICT. First, we find disparities in the adoption rates of farmers, following which we use statistical tests to identify the different factors that lead to these disparities as well as gender-based inequalities. We find that farmers with higher adoption rates adopt videos of shorter duration and belong to smaller villages. Second, to provide assistance to farmers facing challenges, we model the adoption of practices from a video as a prediction problem and experiment with different model architectures. Our classifier achieves accuracies ranging from 79% to 90% across the five states, demonstrating its potential for assisting future ethnographic investigations. Third, we use SHAP values in conjunction with our model for explaining the impact of various network, content and demographic features on adoption. Our research finds that farmers greatly benefit from past adopters of a video from their group and village. We also discover that videos with a low content-specificity benefit some farmers more than others. Next, we highlight the implications of our findings by translating them into recommendations for providing focused assistance, community building, video screening, revisiting participatory approach and mitigating inequalities. Lastly, we conclude with a discussion on how our work can assist future investigations into the lived experiences of farmers. CCS Concepts: • Human-centered computing → Empirical studies in collaborative and social computing. Additional Key Words and Phrases: Diagnosis, ICT4D, Agriculture, Social Networks
Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts
Shrey Gupta,Anmol Agarwal, Manas Gaur,Kaushik Roy,Vignesh Narayanan,Ponnurangam Kumaraguru,Amit Sheth
Workshop on Computational Linguistics and Clinical Psychology, CLPsych-w, 2022
@inproceedings{bib_Lear_2022, AUTHOR = {Shrey Gupta, Anmol Agarwal, Manas Gaur, Kaushik Roy, Vignesh Narayanan, Ponnurangam Kumaraguru, Amit Sheth}, TITLE = {Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts}, BOOKTITLE = {Workshop on Computational Linguistics and Clinical Psychology}. YEAR = {2022}}
Conversational Agents (CAs) powered with deep language models (DLMs) have shown tremendous promise in the domain of mental health. Prominently, the CAs have been used to provide informational or therapeutic services (e.g., cognitive behavioral therapy) to patients. However, the utility of CAs to assist in mental health triaging has not been explored in the existing work as it requires a controlled generation of follow-up questions (FQs), which are often initiated and guided by the mental health professionals (MHPs) in clinical settings. In the context of ‘depression’, our experiments show that DLMs coupled with process knowledge in a mental health questionnaire generate 12.54% and 9.37% better FQs based on similarity and longest common subsequence matches to questions in the PHQ-9 dataset respectively, when compared with DLMs without process knowledge support. Despite coupling with process knowledge, we find that DLMs are still prone to hallucination, i.e., generating redundant, irrelevant, and unsafe FQs. We demonstrate the challenge of using existing datasets to train a DLM for generating FQs that adhere to clinical process knowledge. To address this limitation, we prepared an extended PHQ-9 based dataset, PRIMATE, in collaboration with MHPs. PRIMATE contains annotations regarding whether a particular question in the PHQ-9 dataset has already been answered in the user’s initial description of the mental health condition. We used PRIMATE to train a DLM in a supervised setting to identify which of the PHQ-9 questions can be answered directly from the user’s
Understanding the Impact of Awards on Award Winners and the Community on Reddit
Avinash Tulasi,Mainack Mondal,Arun Balaji Buduru,Ponnurangam Kumaraguru
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2022
@inproceedings{bib_Unde_2022, AUTHOR = {Avinash Tulasi, Mainack Mondal, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {Understanding the Impact of Awards on Award Winners and the Community on Reddit}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2022}}
Non-financial incentives in the form of awards often act as a driver of positive reinforcement and elevation of social status in the offline world. The elevated social status results in people becoming more active, aligning to a change in the communities' expectations. However, the impact in terms of longevity of social influence and community acceptance of leaders of these incentives in the form of awards are not well-understood in the online world. Our work aims to shed light on the impact of these awards on the awardee and the community. We focus on three large subreddits with a snapshot of 219K posts and 5.8 million comments contributed by 88K Reddit users who received 14,146 awards. Our work establishes that the behaviour of awardees change statistically significantly for a short time after getting an award; however, the change is ephemeral since the awardees return to their pre-award behaviour within days. Additionally, via a user survey, we identified a long-lasting impact of awards-we found that the community's stance softened towards awardees.
The Pursuit of Being Heard: An Unsupervised Approach to Narrative Detection in Online Protest
Kumari Neha,Vibhu Agrawal,Arun Balaji Buduru,Ponnurangam Kumaraguru
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2022
@inproceedings{bib_The__2022, AUTHOR = {Kumari Neha, Vibhu Agrawal, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {The Pursuit of Being Heard: An Unsupervised Approach to Narrative Detection in Online Protest}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2022}}
Protests and mass mobilization are scarce; however, they may lead to dramatic outcomes when they occur. Social media such as Twitter has become a center point for the organization and development of online protests worldwide. It becomes crucial to decipher various narratives shared during an online protest to understand people’s perceptions. In this work, we propose an unsupervised clustering-based framework to understand the narratives present in a given online protest. Through a comparative analysis of tweet clusters in 3 protests around government policy bills, we contribute novel insights about narratives shared during an online protest. Across case studies of government policy-induced online protests in India and the United Kingdom, we found familiar mass mo- bilization narratives across protests. We found reports of on-ground activities and call-to-action for people’s participation narrative clusters in all three protests under study. We also found protest-centric narratives in different protests, such as skepticism around the topic. The results from our analysis can be used to understand and compare people’s perceptions of future mass mobilizations. Index Terms—Social Media Protest, Unsupervised clustering, Protests, Narratives, Twitter
Note: Urbanization and Literacy as factors in Politicians’ Social Media Use in a largely Rural State: Evidence from Uttar Pradesh, India
Asmit Kumar Singh,Jivitesh Jain,V A Lalitha Kameswari,Ponnurangam Kumaraguru,Joyojeet Pal
SIGCAS Conference on Computing and Sustainable Societies, COMPASS, 2022
@inproceedings{bib_Note_2022, AUTHOR = {Asmit Kumar Singh, Jivitesh Jain, V A Lalitha Kameswari, Ponnurangam Kumaraguru, Joyojeet Pal}, TITLE = {Note: Urbanization and Literacy as factors in Politicians’ Social Media Use in a largely Rural State: Evidence from Uttar Pradesh, India}, BOOKTITLE = {SIGCAS Conference on Computing and Sustainable Societies}. YEAR = {2022}}
With Twitter growing as a preferred channel for outreach among major politicians, there have been focused efforts on online communication, even in election campaigns in primarily rural regions. In this paper, we examine the relationship between politicians’ use of social media and the level of urbanization and literacy by compiling a comprehensive list of Twitter handles of political party functionaries and election candidates in the run-up to the 2022 State Assembly elections in Uttar Pradesh, India. We find statistically significant relationships between political Twitter presence and levels of urbanization and with levels of literacy. We also find a strong correlation between vote share and Twitter presence in the winning party, a relationship that is even stronger in urban districts. This provides empirical evidence that social media is already a central part of electoral outreach processes in the Global South, but that this is
Fake news in India: scale, diversity, solution, and opportunities
SHIVANGI SINGHAL,RISHABH KAUSHAL,RAJIV RATN SHAH,Ponnurangam Kumaraguru
Communications of the ACM, CACM, 2022
@inproceedings{bib_Fake_2022, AUTHOR = {SHIVANGI SINGHAL, RISHABH KAUSHAL, RAJIV RATN SHAH, Ponnurangam Kumaraguru}, TITLE = {Fake news in India: scale, diversity, solution, and opportunities}, BOOKTITLE = {Communications of the ACM}. YEAR = {2022}}
Fake news in India: scale, diversity, solution, and opportunities Page 1 india region hot topics 80 COMMUNICATIONS OF THE ACM | NOVEMBER 2022 | VOL. 65 | NO. 11 India is a nation that realizes unity in diversity. Indians follow different religions, practice different customs and traditions, and speak diverse languages.e These factors and others make it difficult to detect fake news in India. More specifically, we observe the following: ˲ Multilinguality. Mother tongue of Indians is diverse. There are 22 official languages and only 10.67% of the population converse in English.f The current fake news detection solutions are most effective for English and might fail to identify and process information in other languages. ˲ Instant messaging plat- form. We cannot undermine WhatsApp’s role in forming and mobilizing the online public. Since WhatsApp is end-to-end encrypted, identifying and quashing false stories is possible …
Erasing Labor with Labor: Dark Patterns and Lockstep Behaviors on Google Play
Ashwin Singh,Arvindh A,Pulak Malhotra,Pooja Govind Desur,Ayushi Jain,Duen Horng Chau,Ponnurangam Kumaraguru
ACM Conference on Hypertext and Social Media, HT&SM, 2022
@inproceedings{bib_Eras_2022, AUTHOR = {Ashwin Singh, Arvindh A, Pulak Malhotra, Pooja Govind Desur, Ayushi Jain, Duen Horng Chau, Ponnurangam Kumaraguru}, TITLE = {Erasing Labor with Labor: Dark Patterns and Lockstep Behaviors on Google Play}, BOOKTITLE = {ACM Conference on Hypertext and Social Media}. YEAR = {2022}}
Google Play’s policy forbids the use of incentivized installs, ratings, and reviews to manipulate the placement of apps. However, there still exist apps that incentivize installs for other apps on the platform. To understand how install-incentivizing apps affect users, we examine their ecosystem through a socio-technical lens and perform a mixed-methods analysis of their reviews and permissions. Our dataset contains 319K reviews collected daily over five months from 60 such apps that cumulatively account for over 160.5M installs. We perform qualitative analysis of reviews to reveal various types of dark patterns that developers incorporate in install-incentivizing apps, highlighting their normative concerns at both user and platform levels. Permissions requested by these apps validate our discovery of dark patterns, with over 92% apps accessing sensitive user information. We find evidence of fraudulent reviews on install-incentivizing apps, following which we model them as an edge stream in a dynamic bipartite graph of apps and reviewers. Our proposed reconfiguration of a state-of-the-art microcluster anomaly detection algorithm yields promising preliminary results in detecting this fraud. We discover highly significant lockstep behaviors exhibited by reviews that aim to boost the overall rating of an install-incentivizing app. Upon evaluating the 50 most suspicious clusters of boosting reviews detected by the algorithm, we find (i) near-identical pairs of reviews across 94% (47 clusters), and (ii) over 35% (1,687 of 4,717 reviews) present in the same form near-identical pairs within their cluster. Finally, we conclude with a discussion on how fraud is intertwined with labor and poses a threat to the trust and transparency of Google Play
Contrastive Personalization Approach to Suspect Identifcation (Student Abstract)
Devansh Gupta,Drishti Bhasin,Sarthak Bhagat,Shagun Uppal,Ponnurangam Kumaraguru,Rajiv Ratn Shah
Association for the Advancement of Artificial Intelligence, AAAI, 2022
@inproceedings{bib_Cont_2022, AUTHOR = {Devansh Gupta, Drishti Bhasin, Sarthak Bhagat, Shagun Uppal, Ponnurangam Kumaraguru, Rajiv Ratn Shah}, TITLE = {Contrastive Personalization Approach to Suspect Identifcation (Student Abstract)}, BOOKTITLE = {Association for the Advancement of Artificial Intelligence}. YEAR = {2022}}
Targeted image retrieval has long been a challenging problem since each person has a different perception of different features leading to inconsistency among users in describing the details of a particular image. Due to this, each user needs a system personalized according to the way they have structured the image in their mind. One important application of this task is suspect identifcation in forensic investigations where a witness needs to identify the suspect from an existing criminal database. Existing methods require the attributes for each image or suffer from poor latency during training and inference. We propose a new approach to tackle this problem through explicit relevance feedback by introducing a novel loss function and a corresponding scoring function. For this, we leverage contrastive learning on the user feedback to generate the next set of suggested images while improving the level of personalization with each user feedback iteration.
CamPros at CASE 2022 Task 1: Transformer-based Multilingual Protest News Detection
Kumari Neha,Mrinal Anand,Tushar Mohan,Arun Balaji Buduru,Ponnurangam Kumaraguru
work shop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, CASE - W, 2022
@inproceedings{bib_CamP_2022, AUTHOR = {Kumari Neha, Mrinal Anand, Tushar Mohan, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {CamPros at CASE 2022 Task 1: Transformer-based Multilingual Protest News Detection}, BOOKTITLE = {work shop on Challenges and Applications of Automated Extraction of Socio-political Events from Text}. YEAR = {2022}}
Socio-political protests often lead to grave consequences when they occur. The early detection of such protests is very important for taking early precautionary measures. However, the main shortcoming of protest event detection is the scarcity of sufficient training data for specific language categories, which makes it difficult to train data-hungry deep learning models effectively. Therefore, cross-lingual and zero-shot learning models are needed to detect events in various low-resource languages. This paper proposes a multi-lingual crossdocument level event detection approach using pre-trained transformer models developed for Shared Task 1 at CASE 2022. The shared task constituted four subtasks for event detection at different granularity levels, i.e., document level to token level, spread over multiple languages (English, Spanish, Portuguese, Turkish, Urdu, and Mandarin). Our system achieves an average F1 score of 0.73 for document-level event detection tasks. Our approach secured 2 nd position for the Hindi language in subtask 1 with an F1 score of 0.80. While for Spanish, we secure 4 th position with an F1 score of 0.69. Our code is available at https://github. com/nehapspathak/campros/.
A Tale of Two Sides: Study of Protesters and Counter-protesters on# CitizenshipAmendmentAct Campaign on Twitter
Kumari Neha,Vibhu Agrawal,Vishwesh Kumar,Tushar Mohan,Abhishek Chopra,Arun Balaji Buduru,Rajesh Sharma,Ponnurangam Kumaraguru
ACM Web Science Conference, ACMWSC, 2022
@inproceedings{bib_A_Ta_2022, AUTHOR = {Kumari Neha, Vibhu Agrawal, Vishwesh Kumar, Tushar Mohan, Abhishek Chopra, Arun Balaji Buduru, Rajesh Sharma, Ponnurangam Kumaraguru}, TITLE = {A Tale of Two Sides: Study of Protesters and Counter-protesters on# CitizenshipAmendmentAct Campaign on Twitter}, BOOKTITLE = {ACM Web Science Conference}. YEAR = {2022}}
Online social media platforms have evolved into a significant place for debate around socio-political phenomena such as government policies and bills. Studying online debates on such topics can help infer people’s perception and acceptance of the happenings. At the same time, various inauthentic users that often pollute the democratic discussion of the subject need to be weeded out from the debate. The characterization of a campaign keeping in mind various forms of involved actors thus becomes very important. On December 12, 2019, Citizenship Amendment Act (CAA) was enacted by the Indian Government, triggering a debate on whether the act was unfair. In this work, we investigate the user’s perception of the #CitizenshipAmendmentAct on Twitter, as the campaign unrolled with divergent discourse in the country. Keeping the campaign participants as the prime focus, we study 9,947,814 tweets produced by 275,111 users during the starting 3 months of protest. Our study includes the analysis of user engagement, content, and network properties with online accounts divided into authentic (genuine users) and inauthentic (bots, suspended, and deleted) users. Our findings show different themes in shared tweets among protesters and counter-protesters. We find presence of inauthentic users on both side of discourse, with counter-protesters having more inauthentic users than protesters. The follow network of the users suggests homophily among users on the same side of discourse and connection between various inauthentic and authentic users. This work contributes to filling the gap of understanding the role of users (from both sides) in a less studied geo-location, India.
An Unsupervised, Geometric and Syntax-aware Quantification of Polysemy
Anmol Goel,Charu Sharma,Ponnurangam Kumaraguru
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2022
@inproceedings{bib_An_U_2022, AUTHOR = {Anmol Goel, Charu Sharma, Ponnurangam Kumaraguru}, TITLE = {An Unsupervised, Geometric and Syntax-aware Quantification of Polysemy}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2022}}
Polysemy is the phenomenon where a single word form possesses two or more related senses. It is an extremely ubiquitous part of natural language and analyzing it has sparked rich discussions in the linguistics, psychology and philosophy communities alike. With scarce attention paid to polysemy in computational linguistics, and even scarcer attention toward quantifying polysemy, in this paper, we propose a novel, unsupervised framework to compute and estimate polysemy scores for words in multiple languages. We infuse our proposed quantification with syntactic knowledge in the form of dependency structures. This informs the final polysemy scores of the lexicon motivated by recent linguistic findings that suggest there is an implicit relation between syntax and ambiguity/polysemy. We adopt a graph based approach by computing the discrete Ollivier Ricci curvature on a graph of the contextual nearest neighbors. We test our framework on curated datasets controlling for different sense distributions of words in 3 typologically diverse languages - English, French and Spanish. The effectiveness of our framework is demonstrated by significant correlations of our quantification with expert human annotated language resources like WordNet. We observe a 0.3 point increase in the correlation coefficient as compared to previous quantification studies in English. Our research leverages contextual language models and syntactic structures to empirically support the widely held theoretical linguistic notion that syntax is intricately linked to ambiguity/polysemy
The Times They Are-a-Changin”: The Effect of the Covid-19 Pandemic on Online Music Sharing in India
Tanvi Kamble,Pooja Govind Desur,Amanda Krause,Ponnurangam Kumaraguru,Vinoo A R
International Conference on Social Informatics, SocInfo, 2022
Abs | | bib Tex
@inproceedings{bib_The__2022, AUTHOR = {Tanvi Kamble, Pooja Govind Desur, Amanda Krause, Ponnurangam Kumaraguru, Vinoo A R}, TITLE = {The Times They Are-a-Changin”: The Effect of the Covid-19 Pandemic on Online Music Sharing in India}, BOOKTITLE = {International Conference on Social Informatics}. YEAR = {2022}}
Music sharing trends have been shown to change during times of socio-economic crises. Studies have also shown that music can act as a social surrogate, helping to significantly reduce loneliness by acting as an empathetic friend. We explored these phenomena through a novel study of online music sharing during the Covid-19 pandemic in India. We collected tweets from the popular social media platform Twitter during India’s first and second wave of the pandemic (n = 1,364). We examined the different ways in which music was able to accomplish the role of a social surrogate via analyzing tweet text using Natural Language Processing techniques. Additionally, we analyzed the emotional connotations of the music shared through the acoustic features and lyrical content and compared the results between pandemic and pre-pandemic times. It was observed that the role of music shifted to a more community focused function rather than tending to a more self-serving utility. Results demonstrated that people shared music during the Covid-19 pandemic which had lower valence and shared songs with topics that reflected turbulent times such as Hardship and Exclusion when compared to songs shared during pre-Covid times. The results are further discussed in the context of individualistic versus collectivistic cultures.
SyMCoM-Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Kodali Prashant,Anmol Goel,Monojit Choudhury,Manish Shrivastava,Ponnurangam Kumaraguru
Conference of the Association of Computational Linguistics, ACL, 2022
@inproceedings{bib_SyMC_2022, AUTHOR = {Kodali Prashant, Anmol Goel, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {SyMCoM-Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2022}}
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datas
Precogiiith at hinglisheval: Leveraging code-mixing metrics & language model embeddings to estimate code-mix quality
Kodali Prashant,Tanmay Sachan,Akshay Goindani,Anmol Goel,Naman Ahuja,Manish Shrivastava,Ponnurangam Kumaraguru
Technical Report, arXiv, 2022
@inproceedings{bib_Prec_2022, AUTHOR = {Kodali Prashant, Tanmay Sachan, Akshay Goindani, Anmol Goel, Naman Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {Precogiiith at hinglisheval: Leveraging code-mixing metrics & language model embeddings to estimate code-mix quality}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine generated code-mixed text is an open problem. In our submission to HinglishEval, a sharedtask collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality. HinglishEval Shared Task consists of two sub-tasks - a) Quality rating prediction); b) Disagreement prediction. We leverage popular codemixed metrics and embeddings of multilingual large language models (MLLMs) as features, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Cohen’s Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our approach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.
HLDC: Hindi Legal Documents Corpus
Arnav Kapoor,Mudit Dhawan,Anmol Goel,T H Arjun,Akshala Bhatnagar,Vibhu Agrawal,Amul Agrawal,Arnab Bhattacharya,Ponnurangam Kumaraguru
Findings of the Association for Computational Linguistics, FACL, 2022
@inproceedings{bib_HLDC_2022, AUTHOR = {Arnav Kapoor, Mudit Dhawan, Anmol Goel, T H Arjun, Akshala Bhatnagar, Vibhu Agrawal, Amul Agrawal, Arnab Bhattacharya, Ponnurangam Kumaraguru}, TITLE = {HLDC: Hindi Legal Documents Corpus}, BOOKTITLE = {Findings of the Association for Computational Linguistics}. YEAR = {2022}}
Many populous countries including India areburdened with a considerable backlog of legalcases. Development of automated systems thatcould process legal documents and augmentlegal practitioners can mitigate this. However,there is a dearth of high-quality corpora that isneeded to develop such data-driven systems.The problem gets even more pronounced inthe case of low resource languages such asHindi. In this resource paper, we introducetheHindi Legal Documents Corpus (HLDC), acorpus of more than 900K legal documents inHindi. Documents are cleaned and structuredto enable the development of downstream ap-plications. Further, as a use-case for the cor-pus, we introduce the task of bail prediction.We experiment with a battery of models andpropose a Multi-Task Learning (MTL) basedmodel for the same. MTL models use sum-marization as an auxiliary task along with bailprediction as the main task. Experiments withdifferent models are indicative of the needfor further research in this area. We releasethe corpus and model implementation codewith this paper:https://github.com/Exploration-Lab/HLDC
On the Vulnerability of Community Structure in Complex Networks
Viraj Parimi,Arindam Pal,Sushmita Ruj,Ponnurangam Kumaraguru,Tanmoy Chakraborty
Principles of Social Networking, PSN, 2022
@inproceedings{bib_On_t_2022, AUTHOR = {Viraj Parimi, Arindam Pal, Sushmita Ruj, Ponnurangam Kumaraguru, Tanmoy Chakraborty}, TITLE = {On the Vulnerability of Community Structure in Complex Networks}, BOOKTITLE = {Principles of Social Networking}. YEAR = {2022}}
In this paper, we study the role of nodes and edges in a complex network in dictating the robustness of a community structure towards structural perturbations. Specifically, we attempt to identify all vital nodes, which, when removed, would lead to a large change in the underlying community structure of the network. This problem is critical because the community structure of a network allows us to explore deep underlying insights into how the function and topology of the network affect each other. Moreover, it even provides a way to condense large net- works into smaller modules where each community acts as a meta node and aids in more straightforward network analysis. If the community structure were to be compromised by either accidental or intentional per- turbations to the network, that would make such analysis difficult. Since identifying such vital nodes is computationally intractable, we propose a suite of heuristics that allow to find solutions close to the optimality. To show the effectiveness of our approach, we first test these heuristics on small networks and then move to more extensive networks to show that we achieve similar results. Further analysis reveals that the pro- posed approaches are useful to analyze the vulnerability of communities in networks irrespective of their size and scale. Additionally, we show the performance through an extrinsic evaluation framework – we employ two tasks, i.e., link prediction and information diffusion, and show that the effect of our algorithms on these tasks is higher than the other baselines.
COVID-19 Mask Usage and Social Distancing in Social Media Images: Large-scale Deep Learning Analysis
Asmit Kumar Singh,Paras Mehan,Divyanshu Sharma,Rohan Pandey,Tavpritesh Sethi,Ponnurangam Kumaraguru
JMIR Public Health and Surveillance, JMIR-PHS, 2022
@inproceedings{bib_COVI_2022, AUTHOR = {Asmit Kumar Singh, Paras Mehan, Divyanshu Sharma, Rohan Pandey, Tavpritesh Sethi, Ponnurangam Kumaraguru}, TITLE = {COVID-19 Mask Usage and Social Distancing in Social Media Images: Large-scale Deep Learning Analysis}, BOOKTITLE = {JMIR Public Health and Surveillance}. YEAR = {2022}}
ackground: The adoption of nonpharmaceutical interventions and their surveillance are critical for detecting and stopping possible transmission routes of COVID-19. A study of the effects of these interventions can help shape public health decisions. The efficacy of nonpharmaceutical interventions can be affected by public behaviors in events, such as protests. We examined mask use and mask fit in the United States, from social media images, especially during the Black Lives Matter (BLM) protests, representing the first large-scale public gatherings in the pandemic. Objective: This study assessed the use and fit of face masks and social distancing in the United States and events of large physical gatherings through public social media images from 6 cities and BLM protests. Methods: We collected and analyzed 2.04 million public social media images from New York City, Dallas, Seattle, New Orleans, Boston, and Minneapolis between February 1, 2020, and May 31, 2020. We evaluated correlations between online mask usage trends and COVID-19 cases. We looked for significant changes in mask use patterns and group posting around important policy decisions. For BLM protests, we analyzed 195,452 posts from New York and Minneapolis from May 25, 2020, to July 15, 2020. We looked at differences in adopting the preventive measures in the BLM protests through the mask fit score. Results: The average percentage of group pictures dropped from 8.05% to 4.65% after the lockdown week. New York City, Dallas, Seattle, New Orleans, Boston, and Minneapolis observed increases of 5.0%, 7.4%, 7.4%, 6.5%, 5.6%, and 7.1%, respectively, in mask use between February 2020 and May 2020. Boston and Minneapolis observed significant increases of 3.0% and 7.4%, respectively, in mask use after the mask mandates. Differences of 6.2% and 8.3% were found in group pictures between BLM posts and non-BLM posts for New York City and Minneapolis, respectively. In contrast, the differences in the percentage of masked faces in group pictures between BLM and non-BLM posts were 29.0% and 20.1% for New York City and Minneapolis, respectively. Across protests, 35% of individuals wore a mask with a fit score greater than 80%. Conclusions: The study found a significant drop in group posting when the stay-at-home laws were applied and a significant increase in mask use for 2 of 3 cities where masks were mandated. Although a positive trend toward mask use and social distancing was observed, a high percentage of posts showed disregard for the guidelines. BLM-related posts captured the lack of seriousness to safety measures, with a high percentage of group pictures and low mask fit scores. Thus, the methodology provides a directional indication of how government policies can be indirectly monitored through social media
‘Will I Regret for This Tweet?’—Twitter User’s Behavior Analysis System for Private Data Disclosure
R Geetha,S Karthika,Ponnurangam Kumaraguru
The Computer Journal, CJ, 2022
@inproceedings{bib_‘W_2022, AUTHOR = {R Geetha, S Karthika, Ponnurangam Kumaraguru}, TITLE = {‘Will I Regret for This Tweet?’—Twitter User’s Behavior Analysis System for Private Data Disclosure}, BOOKTITLE = {The Computer Journal}. YEAR = {2022}}
Twitter is an extensively used micro-blogging site for publishing user’s views on recent happenings. This wide reachability of messages over large audience poses a threat, as the degree of personally identifiable information disclosed might lead to user regrets. The Tweet-Scan-Post system scans the tweets contextually for sensitive messages. The tweet repository was generated using cyber- keywords for personal, professional and health tweets. The Rules of Sensitivity and Contextuality was defined based on standards established by various national regulatory bodies. The naive sensitivity regression function uses the Bag-of-Words model built from short text messages. The imbalanced classes in dataset result in misclassification with 25% of sensitive and 75% of insensitive tweets. The system opted stacked classification to combat the problem of imbalanced classes. The system initially applied various state-of-art algorithms and predicted 26% of the tweets to be sensitive. The proposed stacked classification approach increased the overall proportion of sensitive tweets to 35%. The system contributes a vocabulary set of 201 Sensitive Privacy Keyword using the boosting approach for three tweet categories. Finally, the system formulates a sensitivity scaling called TSP’s Tweet Sensitivity Scale based on Senti-Cyber features composed of Sensitive Privacy Keywords, Cyber-keywords with Non-Sensitive Privacy Keywords and Non-Cyber-keywords to detect the degree of disclosed sensitive information.
Co-WIN: Really Winning? Analysing Inequity in India's Vaccination Response
Tanvi Karandikar,Puppala Avinash Prabhu,Mehul Mathur,MEGHA ARORA,HEMANK LAMBA,Ponnurangam Kumaraguru
Technical Report, arXiv, 2022
@inproceedings{bib_Co-W_2022, AUTHOR = {Tanvi Karandikar, Puppala Avinash Prabhu, Mehul Mathur, MEGHA ARORA, HEMANK LAMBA, Ponnurangam Kumaraguru}, TITLE = {Co-WIN: Really Winning? Analysing Inequity in India's Vaccination Response}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
The COVID-19 pandemic has so far accounted for reported 5.5𝑀 deaths worldwide, with 8.7% of these coming from India. The pandemic exacerbated the weakness of the Indian healthcare system. As of January 20, 2022, India is the second worst affected country with 38.2𝑀 reported cases and 487𝐾 deaths. According to epidemiologists, vaccines are an essential tool to prevent the spread of the pandemic. India’s vaccination drive began on January 16, 2021 with governmental policies being introduced to prioritize different populations of the society. Through the course of the vaccination drive, multiple new policies were also introduced to ensure that vaccines are readily available and vaccination coverage is increased. However, at the same time, some of the government policies introduced led to unintended inequities in the populations being targeted. In this report, we enumerate and analyze the inequities that existed in India’s vaccination policy drive, and also compute the effect of the new policies that were introduced. We analyze these potential inequities not only qualitatively but also quantitatively by leveraging the data that was made available through the government portals. Specifically, (a) we discover inequities that might exist in the policies, (b) we quantify the effect of new policies introduced to increase vaccination coverage, and (c) we also point the data discrepancies that exist across different data sources
GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection
Mudit Dhawan,Shakshi Sharma,Kadam Aditya Santosh,Rajesh Sharma,Ponnurangam Kumaraguru
Technical Report, arXiv, 2022
@inproceedings{bib_GAME_2022, AUTHOR = {Mudit Dhawan, Shakshi Sharma, Kadam Aditya Santosh, Rajesh Sharma, Ponnurangam Kumaraguru}, TITLE = {GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Social media in present times has a significant and growing influence. Fake news being spread on these platforms have a disruptive and damaging impact on our lives. Furthermore, as multimedia content improves the visibility of posts more than text data, it has been observed that often multi- media is being used for creating fake content. A plethora of previous multimodal-based work has tried to address the problem of modeling heteroge- neous modalities in identifying fake content. How- ever, these works have the following limitations: (1) inefficient encoding of inter-modal relations by utilizing a simple concatenation operator on the modalities at a later stage in a model, which might result in information loss; (2) training very deep neural networks with a disproportionate number of parameters on small but complex real-life multi- modal datasets result in higher chances of over- fitting. To address these limitations, we propose GAME-ON, a Graph Neural Network based end- to-end trainable framework that allows granular in- teractions within and across different modalities to learn more robust data representations for multi- modal fake news detection. We use two publicly available fake news datasets, Twitter and Weibo, for evaluations. Our model outperforms on Twitter by an average of 11% and keeps competitive perfor- mance on Weibo, within a 2.6% margin, while us- ing 65% fewer parameters than the best comparable state-of-the-art baseline.
Differential privacy: a privacy cloak for preserving utility in heterogeneous datasets
Saurabh Gupta,Arun Balaji Buduru,Ponnurangam Kumaraguru
CSI Transactions on ICT, CSIT, 2022
@inproceedings{bib_Diff_2022, AUTHOR = {Saurabh Gupta, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {Differential privacy: a privacy cloak for preserving utility in heterogeneous datasets}, BOOKTITLE = {CSI Transactions on ICT}. YEAR = {2022}}
Data has become an integral part of day-to-day human life. Users leave behind a trail of digital footprint that includes their personal and non-personal information. A normal user puts 1.7 megabytes of data every second into the hand of service providers and trusts them to keep it safe. However, researchers have found out that in the name of improving the quality of service, the service providers, knowing or accidentally, put users’ personal information at risk of getting into the hands of an adversary. The service providers usually apply masking or anonymization before releasing the users’ data. Anonymization techniques do not guarantee privacy preservation and are proven to be prone to cross-linking attacks. In the past, researchers were able to successfully cross-link multiple datasets to leak the sensitive information of various users. Cross-linking attacks are always possible on anonymized datasets, and therefore, service providers must use a technique that guarantees privacy preservation. Differential privacy is superior for publishing sensitive information while pro- tecting privacy. It provides mathematical guarantees and prevents background knowledge attacks such that infor- mation remains private regardless of whatever information an adversary might have. This paper discusses how differential privacy can help achieve privacy guarantees for the release of sensitive heterogeneous datasets while pre- serving its utility.
FakeNewsIndia: A benchmark dataset of fake news incidents in India, collection methodology and impact assessment in social media
Apoorva Dhawan,Malvika Bhalla,Deeksha Arora,Rishabh Kaushal,Ponnurangam Kumaraguru
Computer Communications, CC, 2022
@inproceedings{bib_Fake_2022, AUTHOR = {Apoorva Dhawan, Malvika Bhalla, Deeksha Arora, Rishabh Kaushal, Ponnurangam Kumaraguru}, TITLE = {FakeNewsIndia: A benchmark dataset of fake news incidents in India, collection methodology and impact assessment in social media}, BOOKTITLE = {Computer Communications}. YEAR = {2022}}
Online Social Media platforms (OSMs) have become an essential source of information. The high speed at which OSM users submit data makes moderation extremely hard. Consequently, besides offering online networking to users, the OSMs have also become carriers for spreading fake news. Knowingly or unknowingly, users circulate fake news on OSMs, adversely affecting an individual’s offline activity. To counter fake news, several dedicated websites (referred to as fact-checkers) have sprung up whose sole purpose is to identify and report fake news incidents. There are well-known datasets of fake news; however, not much work has been done regarding credible datasets of fake news in India. Therefore, we design an automated data collection pipeline to collect fake incidents reported by fact-checkers in this work. We gather 4,803 fake news incidents from June 2016 to December 2019 reported by six popular fact-checking websites in India and make this dataset (FakeNewsIndia) available to the research community. We find 5,031 tweets on Twitter and 866 videos on YouTube mentioned in these 4,803 fake news incidents. Further, we evaluate the impact of fake new incidents on the two prominent OSM platforms, namely, Twitter and YouTube. We use popularity metrics based on engagement rate and likes ratio to measure impact and categorize impact into three levels — low, medium, and high. Our learning models use features extracted from text, images, and videos present in the fake news incident articles written by fact-checking websites. Experiments show that we can predict the impact (popularity) of videos (appearing on fake news incident articles) on YouTube more accurately (with baseline accuracy ranging from 86% to 92%) as compared to the impact (popularity) of tweets on Twitter (with baseline accuracy of 37% to 41%). We need to build more intelligent models that predict tweets’ impact, appearing in fact-checking incident articles on Twitter as future work. 1.
HashSet--A Dataset For Hashtag Segmentation
Kodali Prashant,Akshala Bhatnagar,Naman Ahuja,Manish Srivastava,Ponnurangam Kumaraguru
Technical Report, arXiv, 2022
@inproceedings{bib_Hash_2022, AUTHOR = {Kodali Prashant, Akshala Bhatnagar, Naman Ahuja, Manish Srivastava, Ponnurangam Kumaraguru}, TITLE = {HashSet--A Dataset For Hashtag Segmentation}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of usergenerated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We show that the performance of SOTA models for Hashtag Segmentation drops substantially on proposed dataset, indicating that the proposed dataset provides an alternate set of hashtags to train and assess models. Datasets and results are released publicly and can be accessed
Spy the Lie: Fraudulent Jobs Detection in Recruitment Domain using Knowledge Graphs
Ponnurangam Kumaraguru,Nidhi Goyal,Niharika Sachdeva
International Conference on Knowledge Science, Engineering and Management, KSEM, 2021
Abs | | bib Tex
@inproceedings{bib_Spy__2021, AUTHOR = {Ponnurangam Kumaraguru, Nidhi Goyal, Niharika Sachdeva}, TITLE = {Spy the Lie: Fraudulent Jobs Detection in Recruitment Domain using Knowledge Graphs}, BOOKTITLE = {International Conference on Knowledge Science, Engineering and Management}. YEAR = {2021}}
Fraudulent jobs are an emerging threat over online recruitment platforms such as LinkedIn, Glassdoor. Fraudulent job postings affect the platform’s trustworthiness and have a negative impact on user experience. Therefore, these platforms need to detect and remove these fraudulent jobs. Generally, fraudulent job postings contain untenable facts about domain-specific entities such as mismatch in skills, industries, offered compensation, etc. However, existing approaches focus on studying writing styles, linguistics, and context-based features, and ignore the relationships among domain-specific entities. To bridge this gap, we propose an approach based on the Knowledge Graph (KG) of domain-specific entities to detect fraudulent jobs. In this paper, we present a multi-tier novel end-to-end framework called FRaudulent Jobs Detection (FRJD) Engine, which considers a) fact validation module using KGs, b) contextual module using deep neural networks c) meta-data module to capture the semantics of job postings. We conduct our experiments using a fact validation dataset containing 4 million facts extracted from job postings. Extensive evaluation shows that FRJD yields a 0.96 F1-score on the curated dataset of 157,880 job postings. Finally, we provide insights on the performance of different fact-checking algorithms on recruitment domain
"A Virus Has No Religion": Analyzing Islamophobia on Twitter During the COVID-19 Outbreak
Mohit Chandra,Manvith Muthukuru Reddy,Shradha Sehgal,Saurabh Gupta,Arun Balaji Buduru,Ponnurangam Kumaraguru
ACM Conference on Hypertext and Social Media, HT&SM, 2021
Abs | | bib Tex
@inproceedings{bib_"A_V_2021, AUTHOR = {Mohit Chandra, Manvith Muthukuru Reddy, Shradha Sehgal, Saurabh Gupta, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {"A Virus Has No Religion": Analyzing Islamophobia on Twitter During the COVID-19 Outbreak}, BOOKTITLE = {ACM Conference on Hypertext and Social Media}. YEAR = {2021}}
The COVID-19 pandemic has disrupted people's lives driving them to act in fear, anxiety, and anger, leading to worldwide racist events in the physical world and online social networks. Though there are works focusing on Sinophobia during the COVID-19 pandemic, less attention has been given to the recent surge in Islamophobia. A large number of positive cases arising out of the religious Tablighi Jamaat gathering has driven people towards forming anti-Muslim communities around hashtags like #coronajihad, #tablighijamaatvirus on Twitter. In addition to the online spaces, the rise in Islamophobia has also resulted in increased hate crimes in the real world. Hence, an investigation is required to create interventions. To the best of our knowledge, we present the first large-scale quantitative study linking
“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning
Mohit Chandra,Dheeraj Pailla,Himanshu Bhatia,Aadilmehdi J Sanchawala,Manish Gupta,Manish Shrivastava,Ponnurangam Kumaraguru
WEB SCIENCE, WEBSCI, 2021
@inproceedings{bib_“S_2021, AUTHOR = {Mohit Chandra, Dheeraj Pailla, Himanshu Bhatia, Aadilmehdi J Sanchawala, Manish Gupta, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning}, BOOKTITLE = {WEB SCIENCE}. YEAR = {2021}}
The exponential rise of online social media has enabled the creation, distribution, and consumption of information at an unprecedented rate. However, it has also led to the burgeoning of various forms of online abuse. Increasing cases of online antisemitism have be- come one of the major concerns because of its socio-political con- sequences. Unlike other major forms of online abuse like racism, sexism, etc., online antisemitism has not been studied much from a machine learning perspective. To the best of our knowledge, we present the first work in the direction of automated multimodal de- tection of online antisemitism. The task poses multiple challenges that include extracting signals across multiple modalities, contex- tual references, and handling multiple aspects of antisemitism. Un- fortunately, there does not exist any publicly available benchmark corpus for this critical task. Hence, we collect and label two datasets with 3,102 and 3,509 social media posts from Twitter and Gab re- spectively. Further, we present a multimodal deep learning system that detects the presence of antisemitic content and its specific anti- semitism category using text and images from posts. We perform an extensive set of experiments on the two datasets to evaluate the ef- ficacy of the proposed system. Finally, we also present a qualitative analysis of our study.
What's kooking?: characterizing India's emerging social network, Koo
Asmit Kumar Singh,Chirag Jain,Jivitesh Jain,Rishi Raj Jain,Shradha Sehgal,anisha Pandey,Ponnurangam Kumaraguru,Ponnurangam Kumaraguru
KNOWLEDGE DISCOVERY AND DATA MINING, KDD, 2021
@inproceedings{bib_What_2021, AUTHOR = {Asmit Kumar Singh, Chirag Jain, Jivitesh Jain, Rishi Raj Jain, Shradha Sehgal, anisha Pandey, Ponnurangam Kumaraguru, Ponnurangam Kumaraguru}, TITLE = {What's kooking?: characterizing India's emerging social network, Koo}, BOOKTITLE = {KNOWLEDGE DISCOVERY AND DATA MINING}. YEAR = {2021}}
Social media has grown exponentially in a short period, coming to the forefront of communications and online interactions. Despite their rapid growth, social media platforms have been unable to scale to different languages globally and remain inaccessible to many. In this paper, we characterize Koo, a multilingual micro-blogging site that rose in popularity in 2021, as an Indian alternative to Twitter. We collected a dataset of 4.07 million users, 163.12 million follower-following relationships, and their content and activity across 12 languages. We study the user demographic along the lines of language, location, gender, and profession. The prominent presence of Indian languages in the discourse on Koo indicates the platform's success in promoting regional languages. We observe Koo's follower-following network to be much denser than Twitter's, comprising of closely-knit linguistic communities. An N-gram analysis of posts on Koo shows a #KooVsTwitter rhetoric, revealing the debate comparing the two platforms. Our characterization highlights the dynamics of the multilingual social network and its diverse Indian user base.
Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India During COVID-19 Infodemic
Palash Aggrawal,Baani Leen Kaur Jolly,Amogh Gulati,Amarjit sethi,Tavpritesh Sethi,Ponnurangam Kumaraguru
Frontiers in Communication, FIC, 2021
@inproceedings{bib_Psyc_2021, AUTHOR = {Palash Aggrawal, Baani Leen Kaur Jolly, Amogh Gulati, Amarjit Sethi, Tavpritesh Sethi, Ponnurangam Kumaraguru}, TITLE = {Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India During COVID-19 Infodemic}, BOOKTITLE = {Frontiers in Communication}. YEAR = {2021}}
COVID-19 infodemic has been spreading faster than the pandemic itself. The misinformation riding upon the infodemic wave poses a major threat to people’s health and governance systems. Managing this infodemic not only requires mitigating misinformation but also an early understanding of underlying psychological patterns. In this study, we present a novel epidemic response management strategy. We analyze the psychometric impact and coupling of COVID19 infodemic with official COVID-19 bulletins at the national and state level in India. We looked at them from the psycholinguistic lens of emotions and quantified the extent and coupling between them. We modified Empath, a deep skipgram-based lexicon builder, for effective capture of health-related emotions. Using this, we analyzed the lead-lag relationships between the time-evolution of these emotions in social media and official bulletins using Granger’s causality. It showed that state bulletins led the social media for some emotions such as Medical Emergency. In contrast, social media led the government bulletins for some topics such as hygiene, government, fun, and leisure. Further insights potentially relevant for policymakers and communicators engaged in mitigating misinformation are also discussed. We also introduce CoronaIndiaDataset, the first social-media-based Indian COVID-19 dataset at the national and state levels with over 5.6 million national and 2.6 million state-level tweets for the first wave of COVID-19 in India and 1.2 million national tweets for the second wave of COVID-19 in India.
Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring
Sezal Chug,Priya Kaushal,Ponnurangam Kumaraguru,Tavpritesh Sethi
Technical Report, arXiv, 2021
@inproceedings{bib_Stat_2021, AUTHOR = {Sezal Chug, Priya Kaushal, Ponnurangam Kumaraguru, Tavpritesh Sethi}, TITLE = {Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts provide any practical application.
Tweet-scan-post: a system for analysis of sensitive private data disclosure in online social media
R. Geetha,S. Karthika,Ponnurangam Kumaraguru
Knowledge and Information Systems, KAIS, 2021
@inproceedings{bib_Twee_2021, AUTHOR = {R. Geetha, S. Karthika, Ponnurangam Kumaraguru}, TITLE = {Tweet-scan-post: a system for analysis of sensitive private data disclosure in online social media}, BOOKTITLE = {Knowledge and Information Systems}. YEAR = {2021}}
The social media technologies are open to users who are intended in creating a community and publishing their opinions of recent incidents. The participants of the online social networking sites remain ignorant of the criticality of disclosing personal data to the public audience. The private data of users are at high risk leading to many adverse effects like cyberbullying, identity theft, and job loss. This research work aims to define the user entities or data like phone number, email address, family details, health-related information as user’s sensitive private data (SPD) in a social media platform. The proposed system, Tweet-Scan-Post (TSP), is mainly focused on identifying the presence of SPD in user’s posts under personal, professional, and health domains. The TSP framework is built based on the standards and privacy regulations established by social networking sites and organizations like NIST, DHS …
Influence of NaMo App on Twitter
SHREYA SHARMA,SAMIYA CAUR,HITKUL,Ponnurangam Kumaraguru
Technical Report, arXiv, 2021
@inproceedings{bib_Infl_2021, AUTHOR = {SHREYA SHARMA, SAMIYA CAUR, HITKUL, Ponnurangam Kumaraguru}, TITLE = {Influence of NaMo App on Twitter}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Social media plays a crucial role in today’s society. It results in paradigm changes in how people relate and communicate, convey and exchange ideas. Moreover, social media has evolved into critical knowledge networks for consumers and also affects decision-making. In elections, social media became an integral part of political campaigning to reach a greater audience and gather more support. The 2019 Lok Sabha election saw a massive spike in the usage of online social media platforms such as Twitter, Facebook, and WhatsApp; with every major political party launching its own organized social media campaigns by investing vast amounts of money. Without limiting mainstream media, almost all Indian political leaders started using social media, especially Facebook and Twitter, to express themselves. In 2014, Bhartiya Janta Party (BJP) took one step ahead in organizing the campaign by launching its app - NaMo App. We focus our research on Twitter and NaMo App during the 2019 Lok Sabha elections and Citizenship Amendment Act (CAA) protests. Twitter is one such platform where every individual can express their views and is not biased. In contrast, NaMo App is one of the first apps centered around a specific political party. It acted as a digital medium for Bhartiya Janta Party (BJP) for organizing the political campaign to make people’s opinion in their favor. This research aims to characterize the role of the NaMo App in a more traditional network as Twitter in shaping political discourse and studies the existence of an online echo chamber on Twitter. We began by analyzing the amount and type of content shared using the NaMo App on Twitter. We performed content and network analysis for the existence of the echo chamber. We also applied the Hawkes process to see the influence that NaMo App has on Twitter. Through this research, we can conclude that the users who share content using NaMo App, may be part of an online echo chamber and are likely to be BJP workers. We show the reach and influence that the NaMo App has on Twitter is significantly less, indicating its inability to break through the diverse audience and change the narrative on Twitter.
KCNet: Kernel-based Canonicalization Network for entities in Recruitment Domain
Nidhi Goyal,Niharika Sachdeva,Anmol Goel,Jushaan Singh Kalra,Ponnurangam Kumaraguru
International Conference on Artificial Neural Networks, ICANN, 2021
@inproceedings{bib_KCNe_2021, AUTHOR = {Nidhi Goyal, Niharika Sachdeva, Anmol Goel, Jushaan Singh Kalra, Ponnurangam Kumaraguru}, TITLE = {KCNet: Kernel-based Canonicalization Network for entities in Recruitment Domain}, BOOKTITLE = {International Conference on Artificial Neural Networks}. YEAR = {2021}}
Online recruitment platforms have abundant user-generated content in the form of job postings, candidate, and company profiles. This content when ingested into Knowledge bases causes redundant, ambiguous, and noisy entities. These multiple (non-standardized) representation of the entities deteriorates the performance of downstream tasks such as job recommender systems, search systems, and question answering. Therefore, making it imperative to canonicalize the entities to improve the performance of such tasks. Recent research discusses either statistical similarity measures or deep learning methods like word-embedding or siamese network-based representations for canonicalization. In this paper, we propose a Kernel-based Canonicalization Network (KCNet) that outperforms all the known statistical and deep learning methods. We also show that the use of side information such as industry type, url of websites, etc. further enhances the performance of the proposed method. Our experiments on 351,600 entities (companies, institutes, skills, and designations) from a popular online recruitment platform demonstrate that the proposed method improves the overall F1-score by 23% compared to the previous baselines, which results in coherent clusters of unique entities.
Efficient Representation of Interaction Patterns with Hyperbolic Hierarchical Clustering for Classification of Users on Twitter
Tanvi Karandikar,Puppala Avinash Prabhu,Avinash Tulasi,Arun Balaji Buduru,Ponnurangam Kumaraguru
International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, WI, 2021
@inproceedings{bib_Effi_2021, AUTHOR = {Tanvi Karandikar, Puppala Avinash Prabhu, Avinash Tulasi, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {Efficient Representation of Interaction Patterns with Hyperbolic Hierarchical Clustering for Classification of Users on Twitter}, BOOKTITLE = {International Joint Conferences on Web Intelligence and Intelligent Agent Technologies}. YEAR = {2021}}
Social media platforms play an important role in democratic processes. During the 2019 General Elections of India, political parties and politicians widely used Twitter to share their ideals, advocate their agenda and gain popularity. Twitter served as a ground for journalists, politicians and voters to interact. The organic nature of these interactions can be upended by malicious accounts on Twitter, which end up being suspended or deleted from the platform. Such accounts aim to modify the reach of content by inorganically interacting with particular handles. These interactions are a threat to the integrity of the platform, as such activity has the potential to affect entire results of democratic processes. In this work, we design a feature extraction framework which compactly captures potentially insidious interaction patterns. Our proposed features are designed to bring out communities amongst the users that work to boost the content of particular accounts. We use Hyperbolic Hierarchical Clustering (HypHC) which represents the features in the hyperbolic manifold to further separate such communities. HypHC gives the added benefit of representing these features in a lower dimensional space -- thus serving as a dimensionality reduction technique. We use these features to distinguish between different classes of users that emerged in the aftermath of the 2019 General Elections of India. Amongst the users active on Twitter during the elections, 2.8% of the users participating were suspended and 1% of the users were deleted from the platform. We demonstrate the effectiveness of our proposed features in differentiating between regular users (users who …
Diagnosing Web Data of ICTs to Provide Focused Assistance in Agricultural Adoptions
Ashwin Singh,Mallika Subramanian,Anmol Agarwal,Pratyush Pratap Priyadarshi,Shrey Gupta,Kiran Garimella,Sanjeev Kumar,Lokesh Garg,Ponnurangam Kumaraguru
Technical Report, arXiv, 2021
@inproceedings{bib_Diag_2021, AUTHOR = {Ashwin Singh, Mallika Subramanian, Anmol Agarwal, Pratyush Pratap Priyadarshi, Shrey Gupta, Kiran Garimella, Sanjeev Kumar, Lokesh Garg, Ponnurangam Kumaraguru}, TITLE = {Diagnosing Web Data of ICTs to Provide Focused Assistance in Agricultural Adoptions}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
The past decade has witnessed a rapid increase in technology ownership across rural areas of India, signifying the potential for ICT initiatives to empower rural households. In our work, we focus on the web infrastructure of one such ICT - Digital Green that started in 2008. Following a participatory approach for content production, Digital Green disseminates instructional agricultural videos to smallholder farmers via human mediators to improve the adoption of farming practices. Their web-based data tracker, CoCo, captures data related to these processes, storing the attendance and adoption logs of over 2.3 million farmers across three continents and twelve countries. Using this data, we model the components of the Digital Green ecosystem involving the past attendance-adoption behaviours of farmers, the content of the videos screened to them and their demographic features across five states in India. We use statistical tests to identify different factors which distinguish farmers with higher adoption rates to understand why they adopt more than others. Our research finds that farmers with higher adoption rates adopt videos of shorter duration and belong to smaller villages. The co-attendance and co-adoption networks of farmers indicate that they greatly benefit from past adopters of a video from their village and group when it comes to adopting practices from the same video. Following our analysis, we model the adoption of practices from a video as a prediction problem to identify and assist farmers who might face challenges in adoption in each of the five states. We experiment with different model architectures and achieve macro-f1 scores
What's kooking? characterizing India's emerging social network, Koo
Asmit Kumar Singh,Chirag Jain,Jivitesh Jain, Rishi Raj Jain,Shradha Sehgal,Tanisha Pandey,Ponnurangam Kumaraguru
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2021
@inproceedings{bib_What_2021, AUTHOR = {Asmit Kumar Singh, Chirag Jain, Jivitesh Jain, Rishi Raj Jain, Shradha Sehgal, Tanisha Pandey, Ponnurangam Kumaraguru}, TITLE = {What's kooking? characterizing India's emerging social network, Koo}, BOOKTITLE = {IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2021}}
Social media has grown exponentially in a short period, coming to the forefront of communications and online interactions. Despite their rapid growth, social media platforms have been unable to scale to different languages globally and remain inaccessible to many. In this paper, we characterize Koo, a multilingual micro-blogging site that rose in popularity in 2021, as an Indian alternative to Twitter. We collected a dataset of 4.07 million users, 163.12 million follower-following relationships, and their content and activity across 12 languages. We study the user demographic along the lines of language, location, gender, and profession. The prominent presence of Indian languages in the discourse on Koo indicates the platform's success in promoting regional languages. We observe Koo's follower-following network to be much denser than Twitter's, comprising of closely-knit linguistic communities. An N-gram analysis …
Truth and travesty intertwined: a case study of# SSR counterpublic campaign
Kumari Neha,Tushar Mohan,Arun Balaji Buduru,Ponnurangam Kumaraguru
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2021
@inproceedings{bib_Trut_2021, AUTHOR = {Kumari Neha, Tushar Mohan, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {Truth and travesty intertwined: a case study of# SSR counterpublic campaign}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2021}}
Twitter has emerged as a prominent social media platform for activism and counterpublic narratives. The counterpublics leverage hashtags to build a diverse support network and share content on a global platform that counters the dominant narrative. This paper applies the framework of connective action on the counter-narrative campaign over the cause of death of #SushantSinghRajput. We combine descriptive network, modularity, and hashtag based topical analysis to identify three major mechanisms underlying the campaign: generative role taking, hashtag-based narratives and formation of alignment network towards a common cause. Using the case study of #SushantSinghRajput, we highlight how connective action framework can be used to identify different strategies adopted by counterpublics for the emergence of connective action. Index Terms—Social Computing, Data Mining, Online Social Media, Computational Social Science, Social Media Campaign, Counterpublics, Network Analysis, Sushant Singh Rajput
I'll be back: Examining Restored Accounts On Twitter
Arnav Kapoor,Puppala Avinash Prabhu,Ponnurangam Kumaraguru
Technical Report, arXiv, 2021
@inproceedings{bib_I'll_2021, AUTHOR = {Arnav Kapoor, Puppala Avinash Prabhu, Ponnurangam Kumaraguru}, TITLE = {I'll be back: Examining Restored Accounts On Twitter}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Online social networks like Twitter actively monitor their platform to identify accounts that go against their rules. Twitter enforces account level moderation, i.e. suspension of a Twitter account in severe cases of platform abuse. A point of note is that these suspensions are sometimes temporary and even incorrect. Twitter provides a redressal mechanism to 'restore' suspended accounts. We refer to all suspended accounts who later have their suspension reversed as 'restored accounts'. In this paper, we release the firstever dataset and methodology 1 to identify restored accounts. We inspect account properties and tweets of these restored accounts to get key insights into the effects of suspension.We build a prediction model to classify an account into normal, suspended or restored. We use SHAP values to interpret this model and identify important features. SHAP (SHapley Additive exPlanations) is a method to explain individual predictions. We show that profile features like date of account creation and the ratio of retweets to total tweets are more important than content-based features like sentiment scores and Ekman emotion scores when it comes to classification of an account as normal, suspended or restored. We investigate restored accounts further in the pre-suspension and post-restoration phases. We see that the number of tweets per account drop by 53.95% in the post-restoration phase, signifying less 'spammy' behaviour after reversal of suspension. However, there was no substantial difference in the content of the tweets posted in the pre-suspension and post-restoration phases.
Inter-modality Discordance for Multimodal Fake News Detection
Shivangi Singhal,Mudit Dhawan,Rajiv R Shah,Ponnurangam Kumaraguru
ACM Multimedia Asia, MM Asia, 2021
@inproceedings{bib_Inte_2021, AUTHOR = {Shivangi Singhal, Mudit Dhawan, Rajiv R Shah, Ponnurangam Kumaraguru}, TITLE = {Inter-modality Discordance for Multimodal Fake News Detection}, BOOKTITLE = {ACM Multimedia Asia}. YEAR = {2021}}
The paradigm shift in the consumption of news via online platforms has cultivated the growth of digital journalism. Contrary to traditional media, lowering entry barriers and enabling everyone to be part of content creation have disabled the concept of centralized gatekeeping in digital journalism. This in turn has triggered the production of fake news. Current studies have made a significant effort towards multimodal fake news detection with less emphasis on exploring the discordance between the different multimedia present in a news article. We hypothesize that fabrication of either modality will lead to dissonance between the modalities, and resulting in misrepresented, misinterpreted and misleading news. In this paper, we inspect the authenticity of news coming from online media outlets by exploiting relationship (discordance) between the textual and multiple visual cues. We develop an inter-modality discordance based fake news detection framework to achieve the goal. The modal-specific discriminative features are learned, employing the cross-entropy loss and a modified version of contrastive loss that explores the inter-modality discordance. To the best of our knowledge, this is the first work that leverages information from different components of the news article (i.e., headline, body, and multiple images) for multimodal fake news detection. We conduct extensive experiments on the real-world datasets to show that our approach outperforms the state-of-the-art by an average F1-score of 6.3%.
“A Virus Has No Religion”: Analyzing Islamophobia on Twitter During the COVID-19 Outbreak
Mohit Chandra,Manvith Muthukuru Reddy,Shradha Sehgal,Saurabh Gupta,Arun Balaji Buduru,Ponnurangam Kumaraguru
ACM Conference on Hypertext and Social Media, HT&SM, 2021
@inproceedings{bib_“A_2021, AUTHOR = {Mohit Chandra, Manvith Muthukuru Reddy, Shradha Sehgal, Saurabh Gupta, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {“A Virus Has No Religion”: Analyzing Islamophobia on Twitter During the COVID-19 Outbreak}, BOOKTITLE = {ACM Conference on Hypertext and Social Media}. YEAR = {2021}}
The ACM Hypertext and Social Media Conference is a premium venue for high quality peerreviewed research on hypertext theory, systems and applications. It is concerned with all aspects of modern hypertext research including social media, semantic web, dynamic and computed hypertext and hypermedia as well as narrative systems and applications. ACM Hypertext 2021, the 32nd ACM Conference on Hypertext and Social Media was organized as a virtual event, hosted by the ADAPT Centre and Trinity College Dublin, Ireland. The theme of HT'21 is: "Hypertext in a Multimodal World". The hypertext paradigm has transformed the way we store and transfer knowledge and how we think about information and access it. We access the same sources via multiple devices, ranging from smart watches and smart phones to laptops. Information and transactions have become ever more visual, with video increasingly replacing traditional text-based web pages. Particularly, we use interactive hypertext, news feeds and videos for getting things done: planning a trip, managing our finances, interacting with colleagues, friends and family, being productive and creative. The Covid-19 situation has put a sharp focus on the interwoven disciplines needed to realise such rich applications. The research of the Hypertext community has never been more relevant. The main tracks for ACM Hypertext 2021 are: Adaptive Web and Recommender Systems, Social Web, Semantic Web and NLP, Human-information Interaction, Search and Retrieval, and Digital Humanities, Games, and Culture. In addition, the conference features a Late Breaking Results track, in which visionary Blue Sky papers were also invited, and a Doctoral Consortium.
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences
Devansh Gautam,Kodali Prashant,Kshitij Gupta,Anmol Goel,Manish Srivastava,Ponnurangam Kumaraguru
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2021
@inproceedings{bib_CoMe_2021, AUTHOR = {Devansh Gautam, Kodali Prashant, Kshitij Gupta, Anmol Goel, Manish Srivastava, Ponnurangam Kumaraguru}, TITLE = {CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2021}}
Code-mixed languages are very popular in multilingual societies around the world, yet the resources lag behind to enable robust systems on such languages. A major contributing factor is the informal nature of these languages which makes it difficult to collect code-mixed data. In this paper, we propose our system for Task 1 of CACLS 2021 to generate a machine translation system for English to Hinglish in a supervised setting. Translating in the given direction can help expand the set of resources for several tasks by translating valuable datasets from high resource languages. We propose to use mBART, a pre-trained multilingual sequence-to-sequence model, and fully utilize the pre-training of the model by transliterating the roman Hindi words in the code-mixed sentences to Devanagri script. We evaluate how expanding the input by concatenating Hindi translations of the English sentences improves mBART’s performance. Our system gives a BLEU score of 12.22 on test set. Further, we perform a detailed error analysis of our proposed systems and explore the limitations of the provided dataset and metrics.
AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts
Mohit Chandra,Ashwin Pathak,Eesha Dutta,Paryul Jain,Manish,Manish Srivastava,Ponnurangam Kumaraguru
Technical Report, arXiv, 2020
@inproceedings{bib_Abus_2020, AUTHOR = {Mohit Chandra, Ashwin Pathak, Eesha Dutta, Paryul Jain, Manish, Manish Srivastava, Ponnurangam Kumaraguru}, TITLE = {AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
While extensive popularity of online social media platforms has made information dissemination faster, it has also resulted in widespread online abuse of different types like hate speech, offensive language, sexist and racist opinions, etc. Detection and curtailment of such abusive content is critical for avoiding its psychological impact on victim communities, and thereby preventing hate crimes. Previous works have focused on classifying user posts into various forms of abusive behavior. But there has hardly been any focus on estimating the severity of abuse and the target. In this paper, we present a first of the kind dataset with 7601 posts from Gab which looks at online abuse from the perspective of presence of abuse, severity and target of abusive behavior. We also propose a system to address these tasks, obtaining an accuracy of~ 80% for abuse presence,~ 82% for abuse target detection, and~ 64% for abuse severity detection.
Elites Tweet? Characterizing the Twitter Verified User Network
INDRANEIL ARUN PAUL,Abhinav Khattar,Ponnurangam Kumaraguru,Manish Gupta,Shaan Chopra
International Conference on Data Engineering Workshops, ICDEW, 2019
@inproceedings{bib_Elit_2019, AUTHOR = {INDRANEIL ARUN PAUL, Abhinav Khattar, Ponnurangam Kumaraguru, Manish Gupta, Shaan Chopra}, TITLE = {Elites Tweet? Characterizing the Twitter Verified User Network}, BOOKTITLE = {International Conference on Data Engineering Workshops}. YEAR = {2019}}
Social network and publishing platforms, such as Twitter, support the concept of verification. Verified accounts are deemed worthy of platform-wide public interest and are separately authenticated by the platform itself. There have been repeated assertions by these platforms about verification not being tantamount to endorsement. However, a significant body of prior work suggests that possessing a verified status symbolizes enhanced credibility in the eyes of the platform audience. As a result, such a status is highly coveted among public figures and influencers. Hence, we attempt to characterize the network of verified users on Twitter and compare the results to similar analysis performed for the entire Twitter network. We extracted the entire network of verified users on Twitter (as of July 2018) and obtained 231,246 English user profiles and 79,213,811 connections. Subsequently, in the network analysis, we found that the sub-graph of verified users mirrors the full Twitter users graph in some aspects such as possessing a short diameter. However, our findings contrast with earlier findings on multiple aspects, such as the possession of a power law out- degree distribution, slight dissortativity, and a significantly higher reciprocity rate, as elucidated in the paper. Moreover, we attempt to gauge the presence of salient components within this sub-graph and detect the absence of homophily with respect to popularity, which again is in stark contrast to the full Twitter graph. Finally, we demonstrate stationarity in the time series of verified user activity levels. To the best of our knowledge, this work represents the first quantitative attempt at characterizing verified users on Twitter. Index Terms—Twitter, Social Influence, Centrality, Network Analysis, Online User Characterization, User Categorization
Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow
Arpit Merchan,Daksh Shah,Gurpreet Singh Bhatia,ANURAG GHOSH,Ponnurangam Kumaraguru
International Conference on World wide web, WWW, 2019
@inproceedings{bib_Sign_2019, AUTHOR = {Arpit Merchan, Daksh Shah, Gurpreet Singh Bhatia, ANURAG GHOSH, Ponnurangam Kumaraguru}, TITLE = {Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2019}}
Stack Overflow, a Q&A site on programming, awards reputation points and badges (game elements) to users on performing various actions. Situating our work in Digital Signaling Theory, we investigate the role of these game elements in characterizing social qualities (specifically, popularity and impact) of its users. We operationalize these attributes using common metrics and apply statistical modeling to empirically quantify and validate the strength of these signals. Our results are based on a rich dataset of 3,831,147 users and their activities spanning nearly a decade since the site's inception in 2008. We present evidence that certain non-trivial badges, reputation scores and age of the user on the site positively correlate with popularity and impact. Further, we find that the presence of costly to earn and hard to observe signals qualitatively differentiates highly impactful users from highly popular users.
What sets Verified Users apart?: Insights, Analysis and Prediction of Verified Users on Twitter
Indraneil Pau,Abhinav Khattar,Shaan Chopra,Ponnurangam Kumaraguru,Manish Gupta
WEB SCIENCE, WEBSCI, 2019
@inproceedings{bib_What_2019, AUTHOR = {Indraneil Pau, Abhinav Khattar, Shaan Chopra, Ponnurangam Kumaraguru, Manish Gupta}, TITLE = {What sets Verified Users apart?: Insights, Analysis and Prediction of Verified Users on Twitter}, BOOKTITLE = {WEB SCIENCE}. YEAR = {2019}}
Social network and publishing platforms, such as Twitter, support the concept of a secret proprietary verification process, for han- dles they deem worthy of platform-wide public interest. In line with significant prior work which suggests that possessing such a status symbolizes enhanced credibility in the eyes of the platform audience, a verified badge is clearly coveted among public figures and brands. What are less obvious are the inner workings of the verification process and what being verified represents. This lack of clarity, coupled with the flak that Twitter received by extend- ing aforementioned status to political extremists in 2017, backed Twitter into publicly admitting that the process and what the status represented needed to be rethought. With this in mind, we seek to unravel the aspects of a user’s profile which likely engender or preclude verification. The aim of the paper is two-fold: First, we test if discerning the verification status of a handle from profile metadata and content features is feasible. Second, we unravel the features which have the greatest bearing on a handle’s verification status. We collected a dataset con- sisting of profile metadata of all 231,235 verified English-speaking users (as of July 2018), a control sample of 175,930 non-verified English-speaking users and all their 494 million tweets over a one year collection period. Our proposed models are able to reliably identify verification status (Area under curve AUC > 99%). We show that number of public list memberships, presence of neutral sen- timent in tweets and an authoritative language style are the most pertinent predictors of verification status. To the best of our knowledge, this work represents the first attempt at discerning and classifying verification worthy users on
Finding Your Social Space: Empirical Study of Social Exploration in Multiplayer Online Games
ARPITA CHANDRA,Zoheb Borbora,Ponnurangam Kumaraguru,Jaideep Srivastava
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2019
@inproceedings{bib_Find_2019, AUTHOR = {ARPITA CHANDRA, Zoheb Borbora, Ponnurangam Kumaraguru, Jaideep Srivastava}, TITLE = {Finding Your Social Space: Empirical Study of Social Exploration in Multiplayer Online Games}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2019}}
Social dynamics are based on human needs for trust, support, resource sharing, irrespective of whether they operate in real life or in a virtual setting. Massively multiplayer online role-playing games (MMORPGS) serve as enablers of leisurely social activity and are important tools for social interactions. Past research has shown that socially dense gaming environments like MMORPGs can be used to study important social phenomena, which may operate in real life, too. We describe the process of social exploration to entail the following components 1) finding the balance between personal and social time 2) making choice between a large number of weak ties or few strong social ties. 3) finding a social group. In general, these are the major determinants of an individual’s social life. This paper looks into the phenomenon of social exploration in an activity based online social environment. We study this process through the lens of the following research questions, 1) What are the different social behavior types? 2) Is there a change in a player’s social behavior over time? 3) Are certain social behaviors more stable than the others? 4) Can longitudinal research of player behavior help shed light on the social dynamics and processes in the network? We use an unsupervised machine learning approach to come up with 4 different social behavior types - Lone Wolf, Pack Wolf of Small Pack, Pack Wolf of a Large Pack and Social Butterfly. The types represent the degree of socialization of players in the game. Our research reveals that social behaviors change with time. While lone wolf and pack wolf of small pack are more stable social behaviors, pack wolf of large pack and social butterflies are more transient. We also observe that players progressively move from large groups with weak social ties to settle in small groups with stronger ties. Index Terms—Social Exploration, Social Behavior Typology, MMORPG, Clusterin
ATTENTIONAL ROAD SAFETY NETWORKS
Sonu Gupta,Deepak Srivatsav,A V Subramanyam,Ponnurangam Kumaraguru
International Conference on Image Processing, ICIP, 2019
@inproceedings{bib_ATTE_2019, AUTHOR = {Sonu Gupta, Deepak Srivatsav, A V Subramanyam, Ponnurangam Kumaraguru}, TITLE = {ATTENTIONAL ROAD SAFETY NETWORKS}, BOOKTITLE = {International Conference on Image Processing}. YEAR = {2019}}
Road safety mapping using satellite images is a cost-effective but a challenging problem for smart city planning. The scarcity of labeled data, misalignment and ambiguity makes it hard to learn efficient embeddings in order to classify between safe and dangerous road segments. In this paper, we address the challenges using a region guided attention network. In our model, we extract global features from a base network and augment it with local features obtained using the region guided attention network. In addition, we perform domain adaptation for unlabeled target data. In order to bridge the gap between safe samples and dangerous samples from source and target respectively, we propose a loss function based on within and between class covariance matrices. We conduct experiments on a public dataset of London to show that the algorithm achieves significant results with the classification accuracy of 86.21%. We obtain an increase of 4% accuracy for NYC using domain adaptation network.
A Distant Supervision Based Approach to Medical Persona Classification
P NIKHIL PRIYATAM,Ponnurangam Kumaraguru,Vasudeva Varma Kalidindi
Journal of biomedical informatics, JOBM, 2019
@inproceedings{bib_A_Di_2019, AUTHOR = {P NIKHIL PRIYATAM, Ponnurangam Kumaraguru, Vasudeva Varma Kalidindi}, TITLE = {A Distant Supervision Based Approach to Medical Persona Classification}, BOOKTITLE = {Journal of biomedical informatics}. YEAR = {2019}}
Identifying medical persona from a social media post is critical for drug marketing, pharmacovigilance and patient recruitment. Medical persona classification aims to computationally model the medical persona associated with a social media post. We present a novel deep learning model for this task which consists of two parts: Convolutional Neural Networks (CNNs), which extract highly relevant features from the sentences of a social media post and average pooling, which aggregates the sentence embeddings to obtain task-specific document embedding. We compare our approach against standard baselines, such as Term Frequency - Inverse Document Frequency (TF-IDF), averaged word embedding based methods and popular neural architectures, such as CNN-Long Short Term Memory (CNN-LSTM) and Hierarchical Attention Networks (HANs). Our model achieves an improvement of 19.7% for classification accuracy and 20.1% for micro F1 measure over the current state-of-the-art. We eliminate the need for manual labeling by employing a distant supervision based method to obtain labeled examples for training the models. We thoroughly analyze our model to discover cues that are indicative of a particular persona. Particularly, we use first derivative saliency to identify the salient words in a particular social media post.
Medical persona classification in social media
Nikhil Pattisapu,Manish Gupta,Ponnurangam Kumaraguru,Vasudeva Varma Kalidindi
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2017
@inproceedings{bib_Medi_2017, AUTHOR = {Nikhil Pattisapu, Manish Gupta, Ponnurangam Kumaraguru, Vasudeva Varma Kalidindi}, TITLE = {Medical persona classification in social media}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2017}}
Identifying medical persona from a social media post is of paramount importance for drug marketing and pharmacovigilance. In this work, we propose multiple approaches to infer the medical persona associated with a social media post. We pose this as a supervised multi-label text classification problem. The main challenge is to identify the hidden cues in a post that are indicative of a particular persona. We first propose a large set of manually engineered features for this task. Further, we propose multiple neural network based architectures to extract useful features from these posts using pre-trained word embeddings. Our experiments on thousands of blogs and tweets show that the proposed approach results in 7% and 5% gain in F-measure over manual feature engineering based approach for blogs and tweets respectively.
Understanding Coordinated Communities through the Lens of Protest-Centric Narratives: A Case Study on #CAA Protest
Kumari Neha,Vibhu Agrawal,Saurav Chhatani,Rajesh Sharma,Arun Balaji Buduru,Ponnurangam Kumaraguru
International Conference on Web and Social Media, ICWSM, 2004
@inproceedings{bib_Unde_2004, AUTHOR = {Kumari Neha, Vibhu Agrawal, Saurav Chhatani, Rajesh Sharma, Arun Balaji Buduru, Ponnurangam Kumaraguru}, TITLE = {Understanding Coordinated Communities through the Lens of Protest-Centric Narratives: A Case Study on #CAA Protest}, BOOKTITLE = {International Conference on Web and Social Media}. YEAR = {2004}}
Social media platforms, particularly Twitter, have emerged as vital media for organizing online protests worldwide. During protests, users on social media share different narratives, often coordinated to share collective opinions and obtain widespread reach. In this paper, we focus on the communities formed during a protest and the collective narratives they share, using the protest on the enactment of the Citizenship Amendment Act (#CAA) by the Indian Government as a case study. Since #CAA protest led to divergent discourse in the country, we first classify the users into opposing stances, i.e., protesters (who opposed the Act) and counter-protesters (who supported it) in an unsupervised manner. Next, we identify the coordinated communities in the opposing stances and examine the collective narratives shared by coordinated communities of opposing stances. We use content-based metrics to identify user coordination, including hashtags, mentions, and retweets. Our results suggest mention as the strongest metric for coordination across the opposing stances. Next, we decipher the collective narratives in the opposing stances using an unsupervised narrative detection framework and found call-to-action, on-ground activity, grievances sharing, questioning, and skepticism narratives in the protest tweets. We analyze the strength of the different coordinated communities using network measures, and perform inauthentic activity analysis on the most coordinated communities on both sides. Our findings also suggest that coordinated communities, which were highly inauthentic, showed the highest clustering coefficient towards a greater extent of coordination.