@inproceedings{bib_Pers_2025, AUTHOR = {Ayush Goyal, Khushi, Vikram Pudi}, TITLE = {Personalized Re-Ranking of Universities and Colleges Using NIRF and JoSAA Data}, BOOKTITLE = {International Conference on Artificial Intelligence and Data Engineering}. YEAR = {2025}}
The National Institutional Ranking Framework (NIRF) provides a comprehensive evaluation of colleges across India based on various parameters such as teaching, learning, resources, research, and outreach. It integrates these parameters by giving a fixed priority to each of them. In today’s competitive academic landscape, choosing the right college is a crucial decision for students seeking higher education. This depends on several factors specific to each student such as their social background, location, economic status and rank in the qualifying examination, etc.This research introduces a novel approach to enhance the utility of NIRF data by creating user-specific rankings tailored to individual preferences. By incorporating user-defined criteria such as median salary, number of students placed, facilities for Physically Challenged Students (PCS), faculty strength, research funding, the proposed system generates personalized rankings that align with the unique needs and aspirations of each student.Our system utilizes a novel skyline-based approach and analyzes the NIRF data alongside user preferences to generate dynamic rankings that adapt to evolving student priorities. We show multiple use-cases to demonstrate the utility of our approach in real-world scenarios, and study the way colleges are ranked for different user preferences. Our study shows that the college rankings so obtained can be quite different from the standard NIRF ranking.
@inproceedings{bib_PMCS_2024, AUTHOR = {Vani Sancheti, Lini Teresa Thomas, Vikram Pudi}, TITLE = {PMCS: Partition-Based Maximal Frequent Subgraph Mining using MCS}, BOOKTITLE = {Annual Computer Software and Applications Conference}. YEAR = {2024}}
Current algorithms for Maximal Frequent Subgraph (MFS) mining do not scale to large databases with more than 300k graphs. Frequency computation is commonly done using numerous subgraph isomorphism operations, which are computationally expensive. This paper explores a partition-based secondary memory algorithm, PMCS, that can make MFS mining for large databases viable. PMCS intelligently avoids redundant computations to perform frequency computation operations optimally. Its intermediate results are also small enough to fit in the main memory. Our results demonstrate that PMCS scales to databases of up to 1000k graphs with an average of 25-30 edges per graph.
@inproceedings{bib_Wiki_2023, AUTHOR = {Manoj Sirvi, Vikram Pudi}, TITLE = {Wikipedia Real-time Updates Recommendation System}, BOOKTITLE = {Wiki Workshop}. YEAR = {2023}}
With the rapid growth of Wikipedia, regular
maintenance of its content requires correspond-
ing growth in the laborious work of Wikipedi-
ans. We present a system that filters out most of
the content and recommends reliable and wor-
thy information that is a strong candidate for
Wikipedia updation. Preliminary results show
that our model achieves close to 98% sensitivity.
Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset
@inproceedings{bib_Eval_2023, AUTHOR = {Suba S, Nita Parekh, Ramesh Loganathan, Vikram Pudi, Chinnababu Sunkavalli}, TITLE = {Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset}, BOOKTITLE = {International Conference on Bioinformatics and Data Science}. YEAR = {2023}}
Computer tomography (CT) have been routinely used for the diagnosis of lung diseases and recently, during the pandemic, for detecting the infectivity and severity of COVID-19 disease. One of the major concerns in using machine learning (ML) approaches for automatic processing of CT scan images in clinical setting is that these methods are trained on limited and biased subsets of publicly available COVID-19 data. This has raised concerns regarding the generalizability of these models on external datasets, not seen by the model during training. To address some of these issues, in this work CT scan images from confirmed COVID-19 data obtained from one of the largest public repositories, COVIDx CT 2A were used for training and internal validation of machine learning models. For the external validation we generated Indian-COVID-19 CT dataset, an open-source repository containing 3D CT volumes and 12096 chest CT images from 288 COVID-19 patients from India. Comparative performance evaluation of four state-of-the-art machine learning models, viz., a lightweight convolutional neural network (CNN), and three other CNN based deep learning (DL) models such as VGG-16, ResNet-50 and Inception-v3 in classifying CT images into three classes, viz., normal, non-covid pneumonia, and COVID-19 is carried out on these two datasets. Our analysis showed that the performance of all the models is comparable on the hold-out COVIDx CT 2A test set with 90%–99% accuracies (96% for CNN), while on the external Indian-COVID-19 CT dataset a drop in the performance is observed for all the models (8%–19%). The traditional machine learning model, CNN performed the best on the external dataset (accuracy 88%) in comparison to the deep learning models, indicating that a lightweight CNN is better generalizable on unseen data. The data and code are made available at https://github.com/aleesuss/c19.
A Novel Approach for Climate Classification Using Agglomerative Hierarchical Clustering
SRI SANKETH UPPALAPATI,Vishal Garg,Vikram Pudi,Jyotirmay Mathur,Raj Gupta,Aviruch Bhatia
@inproceedings{bib_A_No_2023, AUTHOR = {SRI SANKETH UPPALAPATI, Vishal Garg, Vikram Pudi, Jyotirmay Mathur, Raj Gupta, Aviruch Bhatia}, TITLE = {A Novel Approach for Climate Classification Using Agglomerative Hierarchical Clustering}, BOOKTITLE = {Energy Informatics.Academy Conference}. YEAR = {2023}}
Climate classification plays a significant role in the development of building codes and standards. It guides the design of buildings’ envelope and systems by considering their location’s climate conditions. Various methods, such as ASHRAE Standard 169, Köppen, Trewartha utilize climate parameters such as temperature, humidity, solar radiation, precipitation, etc., to classify climates. When establishing requirements in building codes and standards, it is crucial to validate the classification based on the building’s thermal loads. This paper introduces a novel methodology for classifying cities based on the number of similar days between them. It calculates similarity using daily mean temperature, relative humidity, and solar radiation by applying threshold values. A matrix of similar days is analyzed through agglomerative hierarchical clustering with different thresholds. A scoring system based on building thermal load, where lower scores signify better classification, is employed to select the best method. The method was tested using U.S. weather data, yielding a lower score of 54.5 compared to ASHRAE Standard 169’s score of 63.09. This suggests that the new approach results in less
Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset
@inproceedings{bib_Eval_2023, AUTHOR = {Suba S, Nita Parekh, Ramesh Loganathan, Vikram Pudi, Chinnababu Sunkavalli}, TITLE = {Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset}, BOOKTITLE = {International Conference on Bioinformatics and Data Science}. YEAR = {2023}}
Computer tomography (CT) have been routinely used for the diagnosis of lung diseases and recently, during the pandemic, for detecting the infectivity and severity of COVID-19 disease. One of the major concerns in using ma-chine learning (ML) approaches for automatic processing of CT scan images in clinical setting is that these methods are trained on limited and biased sub-sets of publicly available COVID-19 data. This has raised concerns regarding the generalizability of these models on external datasets, not seen by the model during training. To address some of these issues, in this work CT scan images from confirmed COVID-19 data obtained from one of the largest public repositories, COVIDx CT 2A were used for training and internal vali-dation of machine learning models. For the external validation we generated Indian-COVID-19 CT dataset, an open-source repository containing 3D CT volumes and 12096 chest CT images from 288 COVID-19 patients from In-dia. Comparative performance evaluation of four state-of-the-art machine learning models, viz., a lightweight convolutional neural network (CNN), and three other CNN based deep learning (DL) models such as VGG-16, ResNet-50 and Inception-v3 in classifying CT images into three classes, viz., normal, non-covid pneumonia, and COVID-19 is carried out on these two datasets. Our analysis showed that the performance of all the models is comparable on the hold-out COVIDx CT 2A test set with 90% - 99% accuracies (96% for CNN), while on the external Indian-COVID-19 CT dataset a drop in the performance is observed for all the models (8% - 19%). The traditional ma-chine
@inproceedings{bib_Acad_2022, AUTHOR = {Muthyala Anurag Reddy, Vikram Pudi}, TITLE = {Academic Curriculum Generation using Wikipedia for External Knowledge}, BOOKTITLE = {Annual Workshop of the Australasian Language Technology Association}. YEAR = {2022}}
In this paper, we address the problem of auto- matic academic curriculum generation. A curricu- lum outlines definitive topics with their sub-topics and enables teachers and students to form an over- all idea of the course outcomes and goals, and a plan of what to teach and learn to achieve those goals. Automatic curriculum generation is rel- evant in modern times with the ever increasing, rapidly changing, digitally-available academic con- tent, that is too large for manual processing by human teams. Using Wikipedia as an external knowledge-base, along with a pipeline of standard components, we show that it is possible to generate human-interpretable 2-level topic hierarchies. We show that our approach works on publicly available textbooks, by first removing their title-structure, and then automatically regenerating a 2-level title structure that is on-par.
Multilinguals at SemEval-2022 Task 11: Complex NER in Semantically Ambiguous Settings for Low Resource Languages
@inproceedings{bib_Mult_2022, AUTHOR = {Amit Pandey, Swayatta Daw, NARENDRA BABU UNNAM, Vikram Pudi}, TITLE = {Multilinguals at SemEval-2022 Task 11: Complex NER in Semantically Ambiguous Settings for Low Resource Languages}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
We leverage pre-trained language models to solve the task of complex NER for two lowresource languages: Chinese and Spanish. We use the technique of Whole Word Masking (WWM) to boost the performance of masked language modeling objective on large and unsupervised corpora. We experiment with multiple neural network architectures, incorporating CRF, BiLSTMs, and Linear Classifiers on top of a fine-tuned BERT layer. All our models outperform the baseline by a significant margin and our best performing model obtains a competitive position on the evaluation leaderboard for the blind test set
@inproceedings{bib_CitR_2022, AUTHOR = {Amit Pandey, Avani Gupta, Vikram Pudi}, TITLE = {CitRet: A Hybrid Model for Cited Text Span Retrieval}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2022}}
The paper aims to identify cited text spans in the reference paper related to the given citance in the citing paper. We refer to it as cited text span retrieval (CTSR). Most current methods attempt this task by relying on pre-trained offthe-shelf deep learning models like SciBERT. Though these models are pre-trained on large datasets, they underperform in out-of-domain settings. We introduce CitRet, a novel hybrid model for CTSR that leverages unique semantic and syntactic structural characteristics of scientific documents. This enables us to use significantly less data for finetuning. We use only 1040 documents for finetuning. Our model augments mildly-trained SBERT-based contextual embeddings with pre-trained non-contextual Word2Vec embeddings to calculate semantic textual similarity. We demonstrate the performance of our model on the CLSciSumm shared tasks. It improves the state-of-the-art results by over 15% on the F1 score evaluation
Multilinguals at SemEval-2022 Task 11: Transformer Based Architecture for Complex NER
@inproceedings{bib_Mult_2022, AUTHOR = {Amit Pandey, Swayatta Daw, Vikram Pudi}, TITLE = {Multilinguals at SemEval-2022 Task 11: Transformer Based Architecture for Complex NER}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
We investigate the task of complex NER for the English language. The task is non-trivial due to the semantic ambiguity of the textual structure and the rarity of occurrence of such entities in the prevalent literature. Using pre-trained language models such as BERT, we obtain a competitive performance on this task. We qualitatively analyze the performance of multiple architectures for this task. All our models are able to outperform the baseline by a significant margin. Our best performing model beats the baseline F1-score by over 9%.
Long Tailed Entity Extraction of Model Names using Distant Supervision
Swayatta Daw,Vikram Pudi
European Conference on Information Retreival Workshops, ECIR-W, 2022
@inproceedings{bib_Long_2022, AUTHOR = {Swayatta Daw, Vikram Pudi}, TITLE = {Long Tailed Entity Extraction of Model Names using Distant Supervision}, BOOKTITLE = {European Conference on Information Retreival Workshops}. YEAR = {2022}}
Extraction of Competing Models using Distant Supervision and Graph Ranking
Swayatta Daw,Vikram Pudi
Association for the Advancement of Artificial Intelligence Workshop, AAAI-W, 2022
@inproceedings{bib_Extr_2022, AUTHOR = {Swayatta Daw, Vikram Pudi}, TITLE = {Extraction of Competing Models using Distant Supervision and Graph Ranking}, BOOKTITLE = {Association for the Advancement of Artificial Intelligence Workshop}. YEAR = {2022}}
We introduce the task of detection of competing model entities from scientific documents. We define competing models as those models that solve a particular task that is investigated in the target research document. The task is challenging due to the fact that contextual information is required from the entire target document to predict the model entities. Hence, traditional sequence labelling approaches fail in such settings. Furthermore, model entities themselves are long-tailed in nature, i.e, their prevalence in scientific literature is limited, along with a scarcity of labelled data for training supervised learning techniques. To address the above bottlenecks, we combine an Unsupervised Graph Ranking algorithm with a SciBERT-CRF based sequence labeller to predict the entities. We introduce a strong baseline using the above mentioned pipeline. Also, to address the label scarcity of long-tailed model entities, we use distant supervision leveraging an external Knowledge Base (KB) to generate synthetic training data. We address the problem of overfitting in small sized datasets for supervised NER baselines using a simple entity replacement technique. We introduce this model as part of a starting point for an end-to-end automated framework to extract relevant model names and link them with their respective cited papers from research documents. We believe this task will serve as an important starting point to map the research landscape of computer science in a scalable manner, needing minimal human intervention. The code and dataset is available in the given link : https://github.com/Swayatta/Competing-Models
Practice Makes a Solver Perfect: Data Augmentation for Math Word Problem Solvers
Vivek Kumar,Vikram Pudi,Rishabh Maheshwary
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2022
@inproceedings{bib_Prac_2022, AUTHOR = {Vivek Kumar, Vikram Pudi, Rishabh Maheshwary}, TITLE = {Practice Makes a Solver Perfect: Data Augmentation for Math Word Problem Solvers}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2022}}
Existing Math Word Problem (MWP) solvers have achieved high accuracy on benchmark datasets. However, prior works have shown that such solvers do not generalize well and rely on superficial cues to achieve high performance. In this paper, we first conduct experiments to showcase that this behaviour is mainly associated with the limited size and diversity present in existing MWP datasets. Next, we propose several data augmentation techniques broadly categorized into Substitution and Paraphrasing based methods. By deploying these methods we increase the size of existing datasets by five folds. Extensive experiments on two benchmark datasets across three state-of-the-art MWP solvers show that proposed methods increase the generalization and robustness of existing solvers. On average, proposed methods significantly increase the state-of-the-art results by over five percentage points on benchmark datasets. Further, the solvers trained on the augmented dataset perform comparatively better on the challenge test set. We also show the effectiveness of proposed techniques through ablation studies and verify the quality of augmented samples through human evaluation.
Multilinguals at SemEval-2022 Task 11: Transformer Based Architecture for Complex NER
Amit Pandey,Swayatta Daw,Vikram Pudi
Technical Report, arXiv, 2022
@inproceedings{bib_Mult_2022, AUTHOR = {Amit Pandey, Swayatta Daw, Vikram Pudi}, TITLE = {Multilinguals at SemEval-2022 Task 11: Transformer Based Architecture for Complex NER}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
We investigate the task of complex NER for the English language. The task is non-trivial due to the semantic ambiguity of the textual structure and the rarity of occurrence of such entities in the prevalent literature. Using pretrained language models such as BERT, we obtain a competitive performance on this task. We qualitatively analyze the performance of multiple architectures for this task. All our models are able to outperform the baseline by a significant margin. Our best performing model beats the baseline F1-score by over 9%.
Cross-lingual Alignment of Knowledge Graph Triples with Sentences
Swayatta Daw,Sagare Shivprasad Rajendra,Tushar Abhishek,Vikram Pudi,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2021
@inproceedings{bib_Cros_2021, AUTHOR = {Swayatta Daw, Sagare Shivprasad Rajendra, Tushar Abhishek, Vikram Pudi, Vasudeva Varma Kalidindi}, TITLE = {Cross-lingual Alignment of Knowledge Graph Triples with Sentences}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2021}}
The pairing of natural language sentences with knowledge graph triples is essential for many downstream tasks like data-to-text generation, facts extraction from sentences (semantic parsing), knowledge graph completion, etc. Most existing methods solve these downstream tasks using neural-based end-to-end approaches that require a large amount of well-aligned training data, which is difficult and expensive to acquire. Recently various unsupervised techniques have been proposed to alleviate this alignment step by automatically pairing the structured data (knowledge graph triples) with textual data. However, these approaches are not well suited for low resource languages that provide two major challenges: (1) unavailability of pair of triples and native text with the same content distribution and (2) limited Natural language Processing (NLP) resources. In this paper, we address the unsupervised pairing of knowledge graph triples with sentences for low resource languages, selecting Hindi as the low resource language. We propose cross-lingual pairing of English triples with Hindi sentences to mitigate the unavailability of content overlap. We propose two novel approaches: NER-based filtering with Semantic Similarity and Key-phrase Extraction with Relevance Ranking. We use our best method to create a collection of 29224 well-aligned English triples and Hindi sentence pairs. Additionally, we have also curated 350 human-annotated golden test datasets for evaluation. We make the code and dataset publicly available.
Adversarial Examples for Evaluating Math Word Problem Solvers
Vivek Kumar,Rishabh Maheshwary,Vikram Pudi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2021
@inproceedings{bib_Adve_2021, AUTHOR = {Vivek Kumar, Rishabh Maheshwary, Vikram Pudi}, TITLE = {Adversarial Examples for Evaluating Math Word Problem Solvers}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2021}}
Standard accuracy metrics have shown that Math Word Problem (MWP) solvers have achieved high performance on benchmark datasets. However, the extent to which existing MWP solvers truly understand language and its relation with numbers is still unclear. In this paper, we generate adversarial attacks to evaluate the robustness of state-of-the-art MWP solvers. We propose two methods Question Reordering and Sentence Paraphrasing to generate adversarial attacks. We conduct experiments across three neural MWP solvers over two benchmark datasets. On average, our attack method is able to reduce the accuracy of MWP solvers by over 40 percentage points on these datasets. Our results demonstrate that existing MWP solvers are sensitive to linguistic variations in the problem text. We verify the validity and quality of generated adversarial examples through human evaluation.
A Strong Baseline for Query Efficient Attacks in a Black Box Setting
Rishabh Maheshwary,SAKET MAHESHWARY,Vikram Pudi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2021
@inproceedings{bib_A_St_2021, AUTHOR = {Rishabh Maheshwary, SAKET MAHESHWARY, Vikram Pudi}, TITLE = {A Strong Baseline for Query Efficient Attacks in a Black Box Setting}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2021}}
Existing black box search methods have achieved high success rate in generating adversarial attacks against NLP models. However, such search methods are inefficient as they do not consider the amount of queries required to generate adversarial attacks. Also, prior attacks do not maintain a consistent search space while comparing different search methods. In this paper, we propose a query efficient attack strategy to generate plausible adversarial examples on text classification and entailment tasks. Our attack jointly leverages attention mechanism and locality sensitive hashing (LSH) to reduce the query count. We demonstrate the efficacy of our approach by comparing our attack with four baselines across three different search spaces. Further, we benchmark our results across the same search space used in prior attacks. In comparison to attacks proposed, on an average, we are able to reduce the query count by 75% across all datasets and target models. We also demonstrate that our attack achieves a higher success rate when compared to prior attacks in a limited query setting.
Generating natural language attacks in a hard label black box setting
Rishabh Maheshwary,SAKET MAHESHWARY,Vikram Pudi
American Association for Artificial Intelligence, AAAI, 2021
@inproceedings{bib_Gene_2021, AUTHOR = {Rishabh Maheshwary, SAKET MAHESHWARY, Vikram Pudi}, TITLE = {Generating natural language attacks in a hard label black box setting}, BOOKTITLE = {American Association for Artificial Intelligence}. YEAR = {2021}}
We study an important and challenging task of attacking natural language processing models in a hard label black box setting. We propose a decision-based attack strategy that crafts high quality adversarial examples on text classification and entailment tasks. Our proposed attack strategy leverages population-based optimization algorithm to craft plausible and semantically similar adversarial examples by observing only the top label predicted by the target model. At each iteration, the optimization procedure allow word replacements that maximizes the overall semantic similarity between the original and the adversarial text. Further, our approach does not rely on using substitute models or any kind of training data. We demonstrate the efficacy of our proposed approach through extensive experimentation and ablation studies on five state-of-the-art target models across seven benchmark datasets. In comparison to attacks proposed in prior literature, we are able to achieve a higher success rate with lower word perturbation percentage that too in a highly restricted setting.
Temporal Analysis of Scientific Literature to Find Grand Challenges and Saturated Problems
Kritika Agrawal,Vikram Pudi
Bridging the Gap between Information Science, Information Retrieval and Data Science, BIRDS, 2020
@inproceedings{bib_Temp_2020, AUTHOR = {Kritika Agrawal, Vikram Pudi}, TITLE = {Temporal Analysis of Scientific Literature to Find Grand Challenges and Saturated Problems}, BOOKTITLE = {Bridging the Gap between Information Science, Information Retrieval and Data Science}. YEAR = {2020}}
As scientific communities grow and evolve, there is emergence of new techniques and decline of old ones. The tremendous amount of research publications available online aims to solve a lot of interesting problems. With time, some of the fields have been studied well and research problems solved to a great extent. However, there are few difficult research problems which are yet not solved completely and interests a lot of researchers. In this paper, we aim to find research fields which are saturated and research fields which need to be explored yet. We first extract research problems in a semi supervised manner using a proven bootstrap framework from scientific literature of the last fifty years. We show how a simple statistics based model on top of the research problems extracted can find the saturated fields and grand challenges in any domain of computer science.
Generating Natural Language Attacks in a Hard Label Black Box Setting
Rishabh Maheshwary,SAKET MAHESHWARY,Vikram Pudi
Technical Report, arXiv, 2020
@inproceedings{bib_Gene_2020, AUTHOR = {Rishabh Maheshwary, SAKET MAHESHWARY, Vikram Pudi}, TITLE = {Generating Natural Language Attacks in a Hard Label Black Box Setting}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
We study an important and challenging task of attacking natural language processing models in a hard label black box setting. We propose a decision-based attack strategy that crafts high quality adversarial examples on text classification and entailment tasks. Our proposed attack strategy leverages population-based optimization algorithm to craft plausible and semantically similar adversarial examples by observing only the top label predicted by the target model. At each iteration, the optimization procedure allow word replacements that maximizes the overall semantic similarity between the original and the adversarial text. Further, our approach does not rely on using substitute models or any kind of training data. We demonstrate the efficacy of our proposed approach through extensive experimentation and ablation studies on five state-of-the-art target models across seven benchmark datasets. In comparison to attacks proposed in prior literature, we are able to achieve a higher success rate with lower word perturbation percentage that too in a highly restricted setting.
A Context Aware Approach for Generating Natural Language Attacks
Rishabh Maheshwary,SAKET MAHESHWARY,Vikram Pudi
American Association for Artificial Intelligence, AAAI, 2020
@inproceedings{bib_A_Co_2020, AUTHOR = {Rishabh Maheshwary, SAKET MAHESHWARY, Vikram Pudi}, TITLE = {A Context Aware Approach for Generating Natural Language Attacks}, BOOKTITLE = {American Association for Artificial Intelligence}. YEAR = {2020}}
We study an important task of attacking natural language processing models in a black box setting. We propose an attack strategy that crafts semantically similar adversarial examples on text classification and entailment tasks. Our proposed attack finds candidate words by considering the information of both the original word and its surrounding context. It jointly leverages masked language modelling and next sentence prediction for context understanding. In comparison to attacks proposed in prior literature, we are able to generate high quality adversarial examples that do significantly better both in terms of success rate and word perturbation percentage.
CRICTRS: Embeddings based Statistical and Semi Supervised Cricket Team Recommendation System
Prazwal Chhabra,Rizwan Ali,Vikram Pudi
Technical Report, arXiv, 2020
@inproceedings{bib_CRIC_2020, AUTHOR = {Prazwal Chhabra, Rizwan Ali, Vikram Pudi}, TITLE = {CRICTRS: Embeddings based Statistical and Semi Supervised Cricket Team Recommendation System}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
Team Recommendation has always been a challenging aspect in team sports. Such systems aim to recommend a player combination best suited against the opposition players, resulting in an optimal outcome. In this paper, we propose a semi-supervised statistical approach to build a team recommendation system for cricket by modelling players into embeddings. To build these embeddings, we design a qualitative and quantitative rating system which considers the strength of opposition also for evaluating player performance. The embeddings obtained, describes the strengths and weaknesses of the players based on past performances of the player. We also embark on a critical aspect of team composition, which includes the number of batsmen and bowlers in the team. The team composition changes over time, depending on different factors which are tough to predict, so we take this input from the user and use the player embeddings to decide the best possible team combination with the given team composition.
A feature fusion technique for improved non-intrusive load monitoring
RAGHUNATH REDDY,Vishal Garg,Vikram Pudi
Energy Informatics, EI, 2020
@inproceedings{bib_A_fe_2020, AUTHOR = {RAGHUNATH REDDY, Vishal Garg, Vikram Pudi}, TITLE = {A feature fusion technique for improved non-intrusive load monitoring}, BOOKTITLE = {Energy Informatics}. YEAR = {2020}}
Load identification is an essential step in Non-Intrusive Load Monitoring (NILM), a process of estimating the power consumption of individual appliances using only whole-house aggregate consumption. Such estimates can help consumers and utility companies improve load management and save power. Current state-of-the-art methods for load identification generally use either steady state or transient features for load identification. We hypothesize that these are complementary features and so a hybrid combination of them will result in an improved appliance signature. We propose a novel hybrid combination that has the advantage of being low-dimensional and can thus be easily integrated with existing classification models to improve load identification. Our improved hybrid features are then used for building appliance identification models using Naive Bayes, KNN, Decision Tree and Random Forest classifiers. The proposed NILM methodology is evaluated for robustness in changing environments. An automated data collection setup is established to capture 7 home appliances aggregate data under varying voltages. Experimental results show that our proposed feature fusion based algorithms are more robust and outperform steady state and transient feature-based algorithms by at least +9% and +15% respectively.
Mining Intellectual Influence Associations.
Shah Tejas Vijay,Vikram Pudi
European Conference on Information Retrieval, ECIR, 2019
@inproceedings{bib_Mini_2019, AUTHOR = {Shah Tejas Vijay, Vikram Pudi}, TITLE = {Mining Intellectual Influence Associations.}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2019}}
Within the social system of science, citation practices characterize social functions like the conferral of recognition upon the work of others as well as the acknowledgement of one’s intellectual debt. However, the structure of intellectual influence is misrepresented when only the immediate citations and their cardinality are taken into consideration. Thus, in order to better understand the associative dissemination of influence and approximately construe the anatomy of this structure, complex interactions in the convoluted network of authors and papers need to be probed. Our study aims at understanding these heterogeneous complex interactions. For the bibliographic dataset of authors and publications, we define proxy scores that attempt to determine the associative influence of the cited author over the citing author. In order to harness structural connectivity of the network, we generate author vector representations using these influence scores. Furthermore, with a view to assess the competence of our proposed scores, we evaluate these representations and provide an empirical study of the results obtained with our algorithm against the baseline and also present a qualitative analysis.
Sequential variational autoencoders for collaborative filtering
Noveen Sachdeva,Giuseppe Manco,Ettore Ritacco,Vikram Pudi
International conference on Web search and Data Mining, WSDM, 2019
@inproceedings{bib_Sequ_2019, AUTHOR = {Noveen Sachdeva, Giuseppe Manco, Ettore Ritacco, Vikram Pudi}, TITLE = {Sequential variational autoencoders for collaborative filtering}, BOOKTITLE = {International conference on Web search and Data Mining}. YEAR = {2019}}
Variational autoencoders were proven successful in domains such as computer vision and speech processing. Their adoption for modeling user preferences is still unexplored, although recently it is starting to gain attention in the current literature. In this work, we propose a model which extends variational autoencoders by exploiting the rich information present in the past preference history. We introduce a recurrent version of the VAE, where instead of passing a subset of the whole history regardless of temporal dependencies, we rather pass the consumption sequence subset through a recurrent neural network. At each time-step of the RNN, the sequence is fed through a series of fully-connected layers, the output of which models the probability distribution of the most likely future preferences. We show that handling temporal information is crucial for improving the accuracy of the VAE: In fact, our model beats the current state-of-the-art by valuable margins because of its ability to capture temporal dependencies among the user-consumption sequence using the recurrent encoder still keeping the fundamentals of variational autoencoders intact.
Explainable Clustering Using Hyper-Rectangles for Building Energy Simulation Data
AVIRUCH BHATIA,VISHAL GARG,Philip Haves,Vikram Pudi
IOP Conference Series: Earth and Environmental Science, E&ES, 2019
@inproceedings{bib_Expl_2019, AUTHOR = {AVIRUCH BHATIA, VISHAL GARG, Philip Haves, Vikram Pudi}, TITLE = {Explainable Clustering Using Hyper-Rectangles for Building Energy Simulation Data}, BOOKTITLE = {IOP Conference Series: Earth and Environmental Science}. YEAR = {2019}}
Clustering has become a very popular machine learning technique for identifying groups of data points with common features in a set of data points. In several applications, there is a need to explain the clusters so that the user can understand the underlying commonalities. One such application is in the area of building energy simulation. There is a need to cluster solutions obtained by parametric energy simulation runs and explain the characteristics of each cluster for human consumption. This paper demonstrates how the axis-aligned hyper-rectangles based clustering, on building energy simulation data, can help identify clusters and describe the governing rules for each cluster. We are calling these rules design strategies. Instead of the distance-based clustering methods that are unable to extract simple rules from the underlying commonalities in each cluster, this method is able to overcome this limitation. This method is applied to identify design strategies from a parametric run of a simple five-zone rectangular building model. Based on a user-given threshold, low energy solutions are selected for clustering. Each axis-aligned hyper-rectangle cluster is a unique design strategy that can be easily communicated to the user.
Scalable, Semi-Supervised Extraction of Structured Information from Scientific Literature
Aakash Mittal,Kritika Agrawal,Vikram Pudi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Scal_2019, AUTHOR = {Aakash Mittal, Kritika Agrawal, Vikram Pudi}, TITLE = {Scalable, Semi-Supervised Extraction of Structured Information from Scientific Literature}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
As scientific communities grow and evolve, there is a high demand for improved methods for finding relevant papers, comparing papers on similar topics and studying trends in the research community. All these tasks involve the common problem of extracting structured information from scientific articles. In this paper, we propose a novel, scalable, semi-supervised method for extracting relevant structured information from the vast available raw scientific literature. We extract the fundamental concepts of “aim”,” method” and “result” from scientific articles and use them to construct a knowledge graph. Our algorithm makes use of domain-based word embedding and the bootstrap framework. Our experiments show that our system achieves precision and recall comparable to the state of the art. We also show the domain independence of our algorithm by analyzing the research trends of two distinct communities-computational linguistics and computer vision.
Explicit modelling of the implicit short term use preferences for music recommendation
KARTIK GUPTA,Noveen Sachdeva,Vikram Pudi
European Conference on Information Retrieval, ECIR, 2018
@inproceedings{bib_Expl_2018, AUTHOR = {KARTIK GUPTA, Noveen Sachdeva, Vikram Pudi}, TITLE = {Explicit modelling of the implicit short term use preferences for music recommendation}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2018}}
Recommender systems are a key component of music sharing platforms, which suggest musical recordings a user might like. People often have implicit preferences while listening to music, though these preferences might not always be the same while they listen to music at different times. For example, a user might be interested in listening to songs of only a particular artist at some time, and the same user might be interested in the top-rated songs of a genre at another time. In this paper we try to explicitly model the short term preferences of the user with the help of Last.fm tags of the songs the user has listened to. With a session defined as a period of activity surrounded by periods of inactivity, we introduce the concept of a subsession, which is that part of the session wherein the preference of the user does not change much. We assume the user preference might change within a session and a session might have multiple subsessions. We use our modelling of the user preferences to generate recommendations for the next song the user might listen to. Experiments on the user listening histories taken from Last.fm indicate that this approach beats the present methodologies in predicting the next recording a user might listen to.
Attentive neural architecture incorporating song features for music recommendation
Noveen Sachdeva,KARTIK GUPTA,Vikram Pudi
Conference on Recommender Systems, Recsys, 2018
@inproceedings{bib_Atte_2018, AUTHOR = {Noveen Sachdeva, KARTIK GUPTA, Vikram Pudi}, TITLE = {Attentive neural architecture incorporating song features for music recommendation}, BOOKTITLE = {Conference on Recommender Systems}. YEAR = {2018}}
Recommender Systems are an integral part of music sharing platforms. Often the aim of these systems is to increase the time, the user spends on the platform and hence having a high commercial value. The systems which aim at increasing the average time a user spends on the platform often need to recommend songs which the user might want to listen to next at each point in time. This is different from recommendation systems which try to predict the item which might be of interest to the user at some point in the user lifetime but not necessarily in the very near future. Prediction of next song the user might like requires some kind of modeling of the user interests at the given point of time. Attentive neural networks have been exploiting the sequence in which the items were selected by the user to model the implicit short-term interests of the user for the task of next item prediction, however we feel that features of the songs occurring in the sequence could also convey some important information about the short-term user interest which only the items cannot. In this direction we propose a novel attentive neural architecture which in addition to the sequence of items selected by the user, uses the features of these items to better learn the user short-term preferences and recommend next song to the user.
Sentiment and Semantic Deep Hierarchical Attention Neural Network for Fine Grained News Classification
ALLAPARTHI SRI TEJA,YAPARLA GANESH,Vikram Pudi
International Conference on Big Knowledge, ICBK, 2018
@inproceedings{bib_Sent_2018, AUTHOR = {ALLAPARTHI SRI TEJA, YAPARLA GANESH, Vikram Pudi}, TITLE = {Sentiment and Semantic Deep Hierarchical Attention Neural Network for Fine Grained News Classification}, BOOKTITLE = {International Conference on Big Knowledge}. YEAR = {2018}}
The purpose of this study is to examine the differences between different types of news stories. Given the huge impact of social networks, online content plays an important role in forming or changing the opinions of people. Unlike traditional journalism where only certain news organizations can publish content, online journalism has given chance even for individuals to publish. This has its own advantages like individual empowerment but has given a chance to a lot of malicious entities to spread misinformation for their own benefit. As reported by many organizations in recent history, this even has influence on major events like the outcome of elections. Therefore, it is of great importance now, to have some sort of automated classification of news stories. In this work, we propose a deep hierarchical attention neural architecture combining sentiment and semantic embeddings for more accurate fine grained classification of news stories. Experimental results show that the sentiment embedding along with semantic information outperform several state-of-the art methods in this task.
Sequential Variational Autoencoders for Collaborative Filtering
Noveen Sachdeva,Giuseppe Manco,Ettore Ritacco,Vikram Pudi
International conference on Web search and Data Mining, WSDM, 2018
@inproceedings{bib_Sequ_2018, AUTHOR = {Noveen Sachdeva, Giuseppe Manco, Ettore Ritacco, Vikram Pudi}, TITLE = {Sequential Variational Autoencoders for Collaborative Filtering}, BOOKTITLE = {International conference on Web search and Data Mining}. YEAR = {2018}}
Variational autoencoders were proven successful in domains such as computer vision and speech processing. Their adoption for modeling user preferences is still unexplored, although recently it is starting to gain attention in the current literature. In this work, we propose a model which extends variational autoencoders by exploiting the rich information present in the past preference history. We introduce a recurrent version of the VAE, where instead of passing a subset of the whole history regardless of temporal dependencies, we rather pass the consumption sequence subset through a recurrent neural network. At each time-step of the RNN, the sequence is fed through a series of fully-connected layers, the output of which models the probability distribution of the most likely future preferences. We show that handling temporal information is crucial for improving the accuracy of the VAE: In fact, our model beats the current state-of-the-art by valuable margins because of its ability to capture temporal dependencies among the user-consumption sequence using the recurrent encoder still keeping the fundamentals of variational autoencoders intact.
Decision tree ensemble for parts-of-speech tagging of resource-poor languages
G VAMSI KRISHNA,PRATIBHA RANI,Vikram Pudi,Dipti Mishra Sharma
Forum for Information Retrieval Evaluation, FIRE, 2018
@inproceedings{bib_Deci_2018, AUTHOR = {G VAMSI KRISHNA, PRATIBHA RANI, Vikram Pudi, Dipti Mishra Sharma}, TITLE = {Decision tree ensemble for parts-of-speech tagging of resource-poor languages}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2018}}
Ensemble POS taggers are a good choice to integrate and leverage benefits of various types of POS taggers. This can help the large number (6500+) of resource-poor languages which do not have much annotated training data by providing ways to integrate semi-supervised/unsupervised taggers with supervised taggers. In this paper we present our experiments of developing ensemble POS taggers using a decision tree. We integrate a semi-supervised data mining approach that uses context based lists (CBLs) for POS tagging with supervised (1) Support Vector Machine based POS tagger, called SVMTool and (2) Conditional Random Field based POS tagger. The results are enhanced semi-supervised ensemble POS taggers which outperform the base methods. In these POS taggers, we use a decision tree to decide when to rely on the output of supervised tagger, and when to rely on the semi-supervised CBL method. The CBL based tagger uses rich contextual information which helps in tagging both existing and unseen words and uses no domain knowledge while supervised taggers give good performance for words present in the training model and can include domain based features. Hence, these algorithms have complementary strengths and in our ensemble we are able to combine these strengths. Enhanced performance of our new POS taggers over the base methods suggests that integrating these methods combines the qualities of these in the new tagger which enhances the performance. Therefore, these new semi-supervised ensemble taggers are more suitable for resource-poor languages.
Dynamic Winner Prediction in Twenty20 Cricket: Based on Relative Team Strengths
Sasank Viswanadha,Kaustubh Sivalenka,Madan Gopal Jhawar,Vikram Pudi
The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Da, ECML PKDD, 2017
@inproceedings{bib_Dyna_2017, AUTHOR = {Sasank Viswanadha, Kaustubh Sivalenka, Madan Gopal Jhawar, Vikram Pudi}, TITLE = {Dynamic Winner Prediction in Twenty20 Cricket: Based on Relative Team Strengths}, BOOKTITLE = {The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Da}. YEAR = {2017}}
Predicting the outcome of a match has always been at the center of sports analytics. Indian Premier League (IPL), a professional Twenty20 (T20) cricket league in India, has established itself as one of the biggest tournaments in cricket history. In this paper, we propose a model to predict the winner at the end of each over in the second innings of an IPL cricket match. Our methodology not only incorporates the dynamically updating game context as the game progresses, but also includes the relative strength between the two teams playing the match. Estimating the relative strength between two teams involves modeling the individual participating players’ potentials. To model a player, we use his career as well as recent performance statistics. Using the various dynamic features, we evaluate several supervised learning algorithms to predict the winner of the match. Finally, using the Random Forest Classifier (RFC), we have achieved an accuracy of 65.79% - 84.15% over the course of second innings, with an overall accuracy of 75.68%.
Plug Load Identification using Regression based Nearest Neighbor Classifier
RAGHUNATH REDDY,NIRANJAN REDDY KEESARA,Vishal Garg,Vikram Pudi
ACM International Conference on Future Energy Systems, e-Energy, 2017
@inproceedings{bib_Plug_2017, AUTHOR = {RAGHUNATH REDDY, NIRANJAN REDDY KEESARA, Vishal Garg, Vikram Pudi}, TITLE = {Plug Load Identification using Regression based Nearest Neighbor Classifier}, BOOKTITLE = {ACM International Conference on Future Energy Systems}. YEAR = {2017}}
Energy utilization can be improved by precise plug load monitoring and control. Plug load energy consumption is nearly 30% of the total building energy consumption. Therefore, plug load identification is a key requirement for energy conservation in buildings. Intrusive load monitoring techniques identify loads precisely but have not been tested widely so far for their performance in changing operating conditions. Hence, the present research proposes a robust low frequency intrusive load monitoring technique to identify load accurately. A smart power strip using proposed load identification technique is designed and developed. Linear regression is applied on the acquired data to capture the behavioral trends of a particular device more explicitly and concisely. Further, weighted K-NN classifier is applied on the transformed data set for device. Experimental results show that the proposed algorithm performs better than the standard classifiers, and can offer tangible savings.
Mining Research Problems from Scientific Literature
AALLA CHANAKYA,Vikram Pudi
International Conference on Data Science and Advanced Analytics, DSAA, 2017
@inproceedings{bib_Mini_2017, AUTHOR = {AALLA CHANAKYA, Vikram Pudi}, TITLE = {Mining Research Problems from Scientific Literature}, BOOKTITLE = {International Conference on Data Science and Advanced Analytics}. YEAR = {2017}}
Extracting structured information from unstructured text is a critical problem. Over the past few years, various clustering algorithms have been proposed to solve this problem. In addition, various algorithms based on probabilistic topic models have been developed to find the hidden thematic structure from various corpora (i.e publications, blog setc). Both types of algorithms have been transferred to the domain of scientific literature to extract structured information to solve problems like data exploration, expert detection etc. In order to remain domain-agnostic, these algorithms do not exploit the structure present in a scientific publication. Majority of researchers interpret a scientific publication as research conducted to report progress in solving some research problems.Following this interpretation, in this paper we present a different outlook to the same problem by modelling scientific publications around research problems. By associating a scientific publication with a research problem, exploring the scientific literature becomes more intuitive. In this paper, we propose an unsupervised framework to mine research problems from titles and abstracts of scientific literature. Our framework uses weighted frequent phrase mining to generate phrases and filters them to obtain high-quality phrases.These high-quality phrases are then used to segment the scientific publication into meaningful semantic units. After segmenting publications, we apply a number of heuristics to score the phrases and sentences to identify the research problems. In a post-processing step we use a neighborhood based algorithm to merge different representations of the same problems. Experiments conducted on parts of DBLP dataset show promising results.
Honest Mirror: Quantitative Assessment of Player Performances in an ODI Cricket Match
MADAN GOPAL JHANWAR,Vikram Pudi
Machine Learning and Data Mining for Sports Analytics, MLSA, 2017
@inproceedings{bib_Hone_2017, AUTHOR = {MADAN GOPAL JHANWAR, Vikram Pudi}, TITLE = {Honest Mirror: Quantitative Assessment of Player Performances in an ODI Cricket Match}, BOOKTITLE = {Machine Learning and Data Mining for Sports Analytics}. YEAR = {2017}}
Cricket is one of the most popular team sports in the world.Players have multiple roles in a game of cricket, predominantly as bats-men and bowlers. Over the generations, statistics such as batting and bowling averages, and strike and economy rates have been used to judge the performance of individual players. These measures, however, do not take into consideration the context of the game in which a player per-formed. Furthermore, these types of statistics are incapable of comparing the performance of players across different roles. In this paper, we present an approach to quantitatively assess the performances of individual players in single match of One Day International (ODI) cricket. For this, we have developed a new measure, called the Work Index, which represents the amount of work that is yet to be done by a team to achieve its tar-get. Our approach incorporates game situations and the team strength sto measure the player contributions. This not only helps us in evaluating the individual performances, but also enables us to compare players within and across various roles on a common scale. Using the player performances in a match, we predict the player of the match award for the ODI matches played between 2006 and 2016. We have achieved an accuracy of 86.80% for the top-3 positions, which is superior to base linemodels and previous works, to the best of our knowledge.
Data Driven Feature Learning
SAKET MAHESHWARY,AMBIKA KAUL,Vikram Pudi
International Conference on Machine Learning, ICML-W, 2017
@inproceedings{bib_Data_2017, AUTHOR = {SAKET MAHESHWARY, AMBIKA KAUL, Vikram Pudi}, TITLE = {Data Driven Feature Learning}, BOOKTITLE = {International Conference on Machine Learning}. YEAR = {2017}}
We present a regression-based feature learning algorithm that generates new features from a set of available features (raw data points). Being data-driven, it requires no domain knowledge and is hence generic. Such a representation is learnt by mining pairwise feature associations, identifying the linear or non-linear relationship between each pair, applying regression and selecting those relationships that are stable.Our experimental evaluation on 20 datasets taken from UC Irvine and Gene Expression, across different domains, provides evidence that the features learnt through our model can improve the overall prediction accuracy, substantially, over the original feature space across 8 different classifiers without any domain knowledge.
WikiSeq: Mining Maximally Informative Simple Sequences from Wikipedia
GOUTAM NAIR,Vikram Pudi
American Association for Artificial Intelligence Workshops, AAAI-W, 2017
@inproceedings{bib_Wiki_2017, AUTHOR = {GOUTAM NAIR, Vikram Pudi}, TITLE = {WikiSeq: Mining Maximally Informative Simple Sequences from Wikipedia}, BOOKTITLE = {American Association for Artificial Intelligence Workshops}. YEAR = {2017}}
The problem of ordering documents in a large collection into a sequence that is efficient for learning (both human and machine) is of high practical significance,but has not yet been well-formulated. We formulate this problem as mining a maximally informative simple sequence of documents. The mined sequence should be maximally informative in the sense that the reader learns quickly by reading only a few documents, and its should be simple so that the reader is not over whelmed while trying to learn the content. The task can be posed as: Given that a reader wishes to read (at most)k documents, which documents should be selected from the repository and in what order, so as to provide maximum information. We present the WikiSeq algorithm for this purpose. We also design a metric based on information-gain to help objectively evaluate WikiSeq, and conduct experiments to compare with indicative baselines. Finally, we provide case-studies to subjectively illustrate WikiSeq’s merits.
Paper2vec: Combining Graph and Text Information for Scientific Paper Representation
SOUMYAJIT GANGULY,Vikram Pudi
European Conference on Information Retrieval, ECIR, 2017
@inproceedings{bib_Pape_2017, AUTHOR = {SOUMYAJIT GANGULY, Vikram Pudi}, TITLE = {Paper2vec: Combining Graph and Text Information for Scientific Paper Representation}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2017}}
We present Paper2vec, a novel neural network embedding based approach for creating scientific paper representations which make use of both textual and graph-based information. An academic citation network can be viewed as a graph where individual nodes contain rich textual information. With the current trend of open-access to most scientific literature, we presume that this full text of a scientific article contain vital source of information which aids in various recommendation and prediction tasks concerning this domain. To this end, we propose an approach, Paper2vec, which comprises of information from both the modalities and results in a rich representation for scientific papers. Over the recent past representation learning techniques have been studied extensively using neural networks. However, they are modeled independently for text and graph data. Paper2vec leverages recent research in the broader field of unsupervised feature learning from both graphs and text documents. We demonstrate the efficacy of our representations on three real world academic datasets in two tasks - node classification and link prediction where Paper2vec is able to outperform state-of-the-art by a considerable margin.
Controversy Detection Using Reactions on Social Media
ALLAPARTHI SRI TEJA,PRAKHAR PANDEY,Vikram Pudi
International Conference on Data Mining Workshops, ICDM-W, 2017
@inproceedings{bib_Cont_2017, AUTHOR = {ALLAPARTHI SRI TEJA, PRAKHAR PANDEY, Vikram Pudi}, TITLE = {Controversy Detection Using Reactions on Social Media}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2017}}
In this work we demonstrate a method to detect controversy on news issues. This is done by performing an analysis of people’s reaction on social media to news articles reporting these issues. Detecting controversial news topics on web is a relevant problem today. It helps to identify the issues upon which people have divided opinion and is specially useful on topics such as a presidential election, government reforms,climate change etc. We use sentiment analysis and word matching to accomplish this task. We show the application of our method for detecting controversial topics during the US Presidential elections 2016
AutoLearn - Automated Feature Generation and Selection
AMBIKA KAUL,SAKET MAHESHWARY,Vikram Pudi
International Conference on Data Mining, ICDM, 2017
@inproceedings{bib_Auto_2017, AUTHOR = {AMBIKA KAUL, SAKET MAHESHWARY, Vikram Pudi}, TITLE = {AutoLearn - Automated Feature Generation and Selection}, BOOKTITLE = {International Conference on Data Mining}. YEAR = {2017}}
n recent years, the importance of feature engineer-ing has been confirmed by the exceptional performance of deep learning techniques, that automate this task for some applications.For others, feature engineering requires substantial manual effort in designing and selecting features and is often tedious and non-scalable. We present AutoLearn, a regression-based feature learning algorithm. Being data-driven, it requires no domain knowledge and is hence generic. Such a representation is learnt by mining pairwise feature associations, identifying the linear or non-linear relationship between each pair, applying regression and selecting those relationships that are stable and improve the prediction performance. Our experimental evaluation on 18 UCIrvine and 7 Gene expression datasets, across different domains,provides evidence that the features learnt through our model can improve the overall prediction accuracy by 13.28%, compared to original feature space and 5.87% over other top performing models, across 8 different classifiers without using any domain knowledge.
Computational Core for Plant Metabolomics:A Case for Interdisciplinary Research
Vikram Pudi,PRATIBHA RANI,Abhijit Mitra,Indira Ghosh
International Conference on Big Data Analytics, BDA, 2017
@inproceedings{bib_Comp_2017, AUTHOR = {Vikram Pudi, PRATIBHA RANI, Abhijit Mitra, Indira Ghosh}, TITLE = {Computational Core for Plant Metabolomics:A Case for Interdisciplinary Research}, BOOKTITLE = {International Conference on Big Data Analytics}. YEAR = {2017}}
Computational Core for Plant Metabolomics (CCPM) is a web-based collaborative platform for researchers in the field of metabolomics to store, analyze and share their data. Metabolomics is a newly emerging field of ‘omics’ research concerned with the characterization of large numbers of metabolites using chromatography in conjunction with mass spectrometry and NMR. There is abundant volume and variety in the data, and unpredictable velocity. An interdisciplinary engagement such as this faces significant non-technical challenges solvable using a balanced approach to software management in a university setting to create an environment promoting collaborative contributions. In this paper we report on our experiences, challenges and methods in delivering a usable solution. CCPM provides a secure data repository with advanced toolsfor analysis including preprocessing, pretreatment, data filtration, statistical analysis, and pathway analysis functions; and also visualization,integration and sharing of data. As all users are not equally IT-savvy,it is essential that the user interface is robust, friendly and interactive where the user can submit and control various tasks running simultane-ously without stopping/interfering with other tasks. In each stage of its pipeline architecture, users are also allowed to upload external data that has been partially processed till the previous stage in other platforms.Use of open source softwares for development makes the maintenance and development of our modules easier than others which depend on proprietary softwares.
Semisupervied Data Driven Word Sense Disambiguation for Resource-poor Languages
PRATIBHA RANI,Vikram Pudi,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_Semi_2017, AUTHOR = {PRATIBHA RANI, Vikram Pudi, Dipti Mishra Sharma}, TITLE = {Semisupervied Data Driven Word Sense Disambiguation for Resource-poor Languages}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
n this paper, we present a generic semi-supervised Word Sense Disambiguation(WSD) method. Currently, the existingWSD methods extensively use domain re-sources and linguistic knowledge.Ourproposed method extractscontext basedlistsfrom a small sense-tagged and un-tagged training data without using do-main knowledge. Experiments on Hindiand Marathi Tourism and Health domainsshow that it gives good performance with-out using any language specific linguisticinformation except the sense IDs presentin the sense-tagged training set and workswell even with small training data by han-dling the data sparsity issue. Other ad-vantages are that domain expertise is notneeded for crafting and selecting featuresto build the WSD model and it can handlethe problem of non availability of match-ing contexts in sense-tagged training set.It also finds sense IDs of those test wordswhich are not present in sense-tag
Injecting Word Embeddings with Another Language’s Resource : An Application of Bilingual Embeddings
PRAKHAR PANDEY,Vikram Pudi,Manish Srivastava
International Joint Conference on Natural Language Processing, IJCNLP, 2017
@inproceedings{bib_Inje_2017, AUTHOR = {PRAKHAR PANDEY, Vikram Pudi, Manish Srivastava}, TITLE = {Injecting Word Embeddings with Another Language’s Resource : An Application of Bilingual Embeddings}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2017}}
Word embeddings learned from text corpus can be improved by injecting knowledge from external resources, while at the same time also specializing them for similarity or relatedness. These knowledge resources (like WordNet, Paraphrase Database) may not exist for all languages. In this work we introduce a method to inject word embeddings of a language with knowledge resource of another language by leveraging bilingual embeddings. First we improve word embeddings of German, Italian, French and Spanish using resources of English and test them on variety of word similarity tasks. Then we demonstrate the utility of our method by creating improved embeddings for Urdu and Telugu languages using Hindi WordNet, beating the previously established baseline for Urdu.
Competing Algorithm Detection from Research Papers
SOUMYAJIT GANGULY,Vikram Pudi
Conference on Data Science, CODS, 2016
@inproceedings{bib_Comp_2016, AUTHOR = {SOUMYAJIT GANGULY, Vikram Pudi}, TITLE = {Competing Algorithm Detection from Research Papers}, BOOKTITLE = {Conference on Data Science}. YEAR = {2016}}
We propose an unsupervised approach to extract all competing algorithms present in a given scholarly article. The algorithm names are treated as named entities and natural language processing techniques are used to extract them.All extracted entity names are linked with their respective original papers in the reference section by our novel entity-citation linking algorithm. Then the seen citation pairs are ranked based on the number of comparison related cue-words present in the entity-citation context. We manually annotated a small subset of DBLP Computer Science conference papers and report both qualitative and quantitative results of our algorithm on it
Predicting the Outcome of ODI Cricket Matches: A Team Composition Based Approach
MADAN GOPAL JHANWAR,Vikram Pudi
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa, PKDD/ECML, 2016
@inproceedings{bib_Pred_2016, AUTHOR = {MADAN GOPAL JHANWAR, Vikram Pudi}, TITLE = {Predicting the Outcome of ODI Cricket Matches: A Team Composition Based Approach}, BOOKTITLE = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa}. YEAR = {2016}}
With the advent of statistical modeling in sports, predicting the outcome of a game has been established as a fundamental problem.Cricket is one of the most popular team games in the world. With this article, we embark on predicting the outcome of a One Day International (ODI) cricket match using a supervised learning approach from a team composition perspective. Our work suggests that the relative team strength between the competing teams forms a distinctive feature for predicting the winner. Modeling the team strength boils down to modeling individual player’s batting and bowling performances, forming the basis of our approach. We use career statistics as well as the recent performances of a player to model him. Player independent factors have also been considered in order to predict the outcome of a match. We show that the k-Nearest Neighbor (kNN) algorithm yields better results as compared to other classifiers
Mining Keystroke Timing Pattern for User Authentication
SAKET MAHESHWARY,Vikram Pudi
International Workshop on New Frontiers in Mining Complex Patterns, NFMCP, 2016
@inproceedings{bib_Mini_2016, AUTHOR = {SAKET MAHESHWARY, Vikram Pudi}, TITLE = {Mining Keystroke Timing Pattern for User Authentication}, BOOKTITLE = {International Workshop on New Frontiers in Mining Complex Patterns}. YEAR = {2016}}
In this paper we investigate the problem of user authentication based on keystroke timing pattern. We propose a simple, robust and non parameterized nearest neighbor regression based feature ranking algorithm for anomaly detection. Our approach successfully handle drawbacks like outlier detection, scale variation and prevents over fitting.Apart from using existing keystroke timing features from the dataset liked well time and flight time, other features namely big ram time and inversion ratio time are engineered as well. The efficiency and effectiveness ofour method is demonstrated through extensive comparisons with others tate-of-the-art techniques using CMU keystroke dynamics bench markage equal error rate(EER) than other proposed techniques. We achieved an average equal error rate of0.051for the user authentication task.
A semi-supervised associative classification method for POS tagging
PRATIBHA RANI,Vikram Pudi,Dipti Mishra Sharma
International Journal of Data Science and Analytics, IJDSA, 2016
@inproceedings{bib_A_se_2016, AUTHOR = {PRATIBHA RANI, Vikram Pudi, Dipti Mishra Sharma}, TITLE = {A semi-supervised associative classification method for POS tagging}, BOOKTITLE = {International Journal of Data Science and Analytics}. YEAR = {2016}}
We present here a data mining approach for part-of-speech (POS) tagging, an important natural language processing (NLP) task, which is a classification problem. We propose a semi-supervised associative classification method for POS tagging. Existing methods for building POS taggers require extensive domain and linguistic knowledge and resources. Our method uses a combination of a small POS tagged corpus and untagged text data as training data to build the classifier model using association rules. Our tagger works well with very little training data also. The use of semi-supervised learning provides the advantage of not requiring a large high-quality annotated corpus. These properties make it especially suitable for resource-poor languages. Our experiments on various resource-rich, resource-moderate and resource-poor languages show good performance without using any language-specific linguistic information. We note that inclusion of such features in our method may further improve the performance. Results also show that for smaller training data sizes our tagger performs better than state-of-the-art conditional random field (CRF) tagger using same features as our tagger.
Author2Vec: Learning Author Representations by Combining Content and Link Information
GANESH J,SOUMYAJIT GANGULY,Manish Gupta,Vasudeva Varma Kalidindi,Vikram Pudi
International Conference on World wide web, WWW, 2016
@inproceedings{bib_Auth_2016, AUTHOR = {GANESH J, SOUMYAJIT GANGULY, Manish Gupta, Vasudeva Varma Kalidindi, Vikram Pudi}, TITLE = {Author2Vec: Learning Author Representations by Combining Content and Link Information}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2016}}
In this paper, we consider the problem of learning representations for authors from bibliographic co-authorship networks. Existing methods for deep learning on graphs, such as DeepWalk, suffer from link sparsity problem as they focus on modeling the link information only. We hypothesize that capturing both the content and link information in a unified way will help mitigate the sparsity problem. To this end, we present a novel model'Author2Vec', which learns low-dimensional author representations such that authors who write similar content and share similar network structure are closer in vector space. Such embeddings are useful in a variety of applications such as link prediction, node classification, recommendation and visualization. The author embeddings we learn are empirically shown to outperform DeepWalk by 2.35% and 0.83% for link prediction and clustering task respectively.
PLUG LOAD IDENTIFICATION IN EDUCATIONAL BUILDINGS USING MACHINE LEARNING ALGORITHMS
RAGHUNATH REDDY,NIRANJAN REDDY KEESARA,Vikram Pudi,Vishal Garg
International Building Performance Simulation Association, IBPSA, 2015
@inproceedings{bib_PLUG_2015, AUTHOR = {RAGHUNATH REDDY, NIRANJAN REDDY KEESARA, Vikram Pudi, Vishal Garg}, TITLE = {PLUG LOAD IDENTIFICATION IN EDUCATIONAL BUILDINGS USING MACHINE LEARNING ALGORITHMS}, BOOKTITLE = {International Building Performance Simulation Association}. YEAR = {2015}}
Plug loads accounts for 20% to 30% of building energy consumption and has an increasing trend. Automatic plug load identification is one of the technique for effectively managing the plug load consumption.There are several studies on Non Intrusive load monitoring (NILM) but limited studies on Intrusive load monitoring (ILM). ILM is a technique that uses a low end power meter on every plug load to monitor it’s power consumption. In this paper, machine-l earning techniques are applied on low-frequency ILM data for identification of plug load and it’s state (ON/Sleep).The results show the identification accuracies are close to 98%
Dispersion based Similarity for Mining Similar Papers in Citation Network
SAKSHAM SINGHAL,Vikram Pudi
International Conference on Data Mining Workshops, ICDM-W, 2015
@inproceedings{bib_Disp_2015, AUTHOR = {SAKSHAM SINGHAL, Vikram Pudi}, TITLE = {Dispersion based Similarity for Mining Similar Papers in Citation Network}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2015}}
Measuring “similarity” has been established as a fundamental problem and has been widely studied. In this paper we propose a novel approach for establishing similarity in context of citation network. With the rapidly growing size of academic literature, the problem of finding similar research papers has become a challenging task. Research papers in a citation network often form communities based on an under lying concept. Our research shows that dispersion based similarity measure can be used as a strong measure for finding similar papers based on similar connectivity in those communities and structural relevance of the citation network. Our results show that our approach works better than other conventional link-based similarity measures both quantitatively and qualitatively.One of the direct benefits of this research is to support the highly specialised information needs of a scholarly researcher working in a specialised field of research.
Plug Load Identification using Regression based Nearest Neighbor Classifier
RAGHUNATH REDDY,NIRANJAN REDDY KEESARA,Vishal Garg,Vikram Pudi
ACM International Conference on Future Energy Systems, e-Energy, 2015
@inproceedings{bib_Plug_2015, AUTHOR = {RAGHUNATH REDDY, NIRANJAN REDDY KEESARA, Vishal Garg, Vikram Pudi}, TITLE = {Plug Load Identification using Regression based Nearest Neighbor Classifier}, BOOKTITLE = {ACM International Conference on Future Energy Systems}. YEAR = {2015}}
Energy utilization can be improved by precise plug load monitoring and control. Plug load energy consumption is nearly 30% of the total building energy consumption. Therefore, plug load identification is a key requirement for energy conservation in buildings. Intrusive load monitoring techniques identify loads precisely but have not been tested widely so far for their performance in changing operating conditions. Hence, the present research proposes a robust low frequency intrusive load monitoring technique to identify load accurately. A smart power strip using proposed load identification technique is designed and developed. Linear regression is applied on the acquired data to capture the behavioral trends of a particular device more explicitly and concisely. Further, weighted K-NN classifier is applied on the transformed data set for device. Experimental results show that the proposed algorithm performs better than the standard classifiers, and can offer tangible savings.
Maximum Entropy Based Associative Regression for Sparse Datasets
Chivukula Sreevallabha Aneesh,Vikram Pudi
International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, WI, 2014
@inproceedings{bib_Maxi_2014, AUTHOR = {Chivukula Sreevallabha Aneesh, Vikram Pudi}, TITLE = {Maximum Entropy Based Associative Regression for Sparse Datasets}, BOOKTITLE = {International Joint Conferences on Web Intelligence and Intelligent Agent Technologies}. YEAR = {2014}}
We propose a supervised learning technique defining significant frequent patterns for associative regression. Assuming frequent patterns quantify correlations in dataset, we constrain the Generalized Iterative Scaling (GIS) convergence algorithm for Maximum Entropy (ME) models. We have used the combinations of ME parameters and GIS probabilities as discriminative weights to frequent patterns. The weighted frequent patterns then order the predictive analytics output. Experiments are conducted on sparse numeric datasets. Results suggest that condensed representations of frequent patterns allow parametric models suitable for class association rule mining.
TagMiner: A Semisupervised Associative POS Tagger Effective for Resource Poor Languages
PRATIBHA RANI,Vikram Pudi,Vasudeva Varma Kalidindi
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa, PKDD/ECML, 2014
@inproceedings{bib_TagM_2014, AUTHOR = {PRATIBHA RANI, Vikram Pudi, Vasudeva Varma Kalidindi}, TITLE = {TagMiner: A Semisupervised Associative POS Tagger Effective for Resource Poor Languages}, BOOKTITLE = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa}. YEAR = {2014}}
We present here, TagMiner, a data mining approach for part-of-speech (POS) tagging, an important Natural language process-ing (NLP) classification task. It is a semi-supervised associative clas-sification method for POS tagging. Existing methods for building POS taggers require extensive domain and linguistic knowledge and resources.Our method uses combination of a small POS tagged corpus and a raw untagged text data as training data to build the classifier model using association rules. Our tagger works well with very little training dataalso. The use of semi-supervised learning provides the advantage of notrequiring a large high quality tagged corpus. These properties make it es-pecially suitable for resource poor languages. Our experiments on various resource-rich, resource-moderate and resource-poor languages show good performance without using any language specific linguistic information.We note that inclusion of such features in our method may further im-prove the performance. Results also show that for smaller training data sizes our tagger performs better than state-of-the-art CRF tagger using same features as our tagger.
A Vectorized Implementation for Maximum Entropy Based Associative Regression
Chivukula Sreevallabha Aneesh,Vikram Pudi
International Conference on Soft Computing and Machine Intelligence, ISCMI, 2014
@inproceedings{bib_A_Ve_2014, AUTHOR = {Chivukula Sreevallabha Aneesh, Vikram Pudi}, TITLE = {A Vectorized Implementation for Maximum Entropy Based Associative Regression}, BOOKTITLE = {International Conference on Soft Computing and Machine Intelligence}. YEAR = {2014}}
We propose a supervised learning technique for associative regression. Assuming frequent patterns quantify correlations in dataset, we constrain the Sequential Conditional Generalized Iterative Scaling (SCGIS) convergence algorithm for Maximum Entropy (ME) models. We also assume prior probabilities on the ME model as a control on the step size in SCGIS. We have used the combinations of ME parameters and SCGIS probabilities as discriminative weights to frequent patterns. The weighted frequent patterns then predict the associative regression values. Experiments have been conducted on sparse numeric datasets, to find the regression error of our proposal. Our technique is comparable to the standard regression algorithms. As a concept, the proposed associative regression is useful as a parametric model in class association rule mining.
FAR-HD: A fast and efficient algorithm for mining fuzzy association rules in large high-dimensional datasets
ASHISH MANGALAMPALLI,Vikram Pudi
International Conference on Fuzzy Systems, FUZZ , 2013
@inproceedings{bib_FAR-_2013, AUTHOR = {ASHISH MANGALAMPALLI, Vikram Pudi}, TITLE = {FAR-HD: A fast and efficient algorithm for mining fuzzy association rules in large high-dimensional datasets}, BOOKTITLE = {International Conference on Fuzzy Systems}. YEAR = {2013}}
Fuzzy Association Rule Mining (ARM) has been extensively used in relational or transactional datasets having less-to-medium number of attributes/dimensions. The mined fuzzy association rules (patterns) are not only used for manual analysis by domain experts, but are also leveraged to drive further mining tasks like classification and clustering which automate decision-making. Such fuzzy association rules can also be derived from high-dimensional numerical datasets, like image datasets, in order to train fuzzy associative classifiers or clustering algorithms. Traditional Fuzzy ARM algorithms are not able to mine rules from them efficiently, since such algorithms are meant to deal with datasets with relatively much less number of attributes/dimensions. Hence, in this paper we propose FAR-HD which is a Fuzzy ARM algorithm designed specifically for large high-dimensional datasets. FAR-HD processes fuzzy frequent itemsets in a DFS manner using a two-phased multiple-partition tidlist-based strategy. It also uses a byte-vector representation of tidlists, with the tidlists stored in the main memory in a compressed form (using a fast generic compression method). Additionally, FAR-HD uses Fuzzy Clustering to convert each numerical vector of the original input dataset to a fuzzy-cluster-based representation, which is ultimately used for the actual Fuzzy ARM process. FAR-HD has been compared experimentally with Fuzzy Apriori (7-15 times faster), which is the most popular Fuzzy ARM algorithm, and a Fuzzy ARM algorithm (1.1-4 times faster) which we proposed earlier and which is designed to work with very large but traditional (with fewer attributes) datasets.
BINGR: Binary Search based Gaussian Regression
HARSHIT DUBEY,SAKET MADHUKAR BHARAMBE,Vikram Pudi
International Conference on Knowledge Discovery and Information Retrieval, KDIR, 2012
@inproceedings{bib_BING_2012, AUTHOR = {HARSHIT DUBEY, SAKET MADHUKAR BHARAMBE, Vikram Pudi}, TITLE = {BINGR: Binary Search based Gaussian Regression}, BOOKTITLE = {International Conference on Knowledge Discovery and Information Retrieval}. YEAR = {2012}}
Regression is the study of functional dependency of one variable with respect to other variables. In this paperwe propose a novel regression algorithm, BINGR, for predicting dependent variable, having the advantage oflow computational complexity. The algorithm is interesting because instead of directly predicting the valueof the response variable, it recursively narrows down the range in which response variable lies. BINGRreduces the computation order to logarithmic which is much better than that of existing standard algorithms.As BINGR is parameterless, it can be employed by any naive user. Our experimental study shows that ourtechnique is as accurate as the state of the art, and faster by an order of magnitude
BINER: Binary Search Based Efficient Regression
SAKET MADHUKAR BHARAMBE,HARSHIT DUBEY,Vikram Pudi
International Conference on Machine Learning and Data Mining, MLDM, 2012
@inproceedings{bib_BINE_2012, AUTHOR = {SAKET MADHUKAR BHARAMBE, HARSHIT DUBEY, Vikram Pudi}, TITLE = {BINER: Binary Search Based Efficient Regression}, BOOKTITLE = {International Conference on Machine Learning and Data Mining}. YEAR = {2012}}
Regression is the study of functional dependency of one nu-meric variable with respect to another. In this paper, we present a novel,efficient, binary search based regression algorithm having the advantageof low computational complexity. These desirable features make BINERa very attractive alternative to existing approaches. The algorithm isinteresting because instead of directly predicting the value of responsevariable, it recursively narrows down the range in which the responsevariable lies. Our empirical experiments with several real world datasetsshow that our algorithm, outperforms current state of art approachesand is faster by an order of magnitude.
Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering
SHASHANK PALIWAL,Vikram Pudi
International Conference on Machine Learning and Data Mining, MLDM, 2012
@inproceedings{bib_Inve_2012, AUTHOR = {SHASHANK PALIWAL, Vikram Pudi}, TITLE = {Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering}, BOOKTITLE = {International Conference on Machine Learning and Data Mining}. YEAR = {2012}}
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter-document similarities (based on inter-passage similarities) with traditional inter-document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents
RNN Based Sampling Technique for Effective Active Learning.
GAURAV MAHESHWARI,BHANUKIRAN VINZAMURI,Vikram Pudi
International Conference on Machine Learning and Data Mining, MLDM, 2011
@inproceedings{bib_RNN__2011, AUTHOR = {GAURAV MAHESHWARI, BHANUKIRAN VINZAMURI, Vikram Pudi}, TITLE = {RNN Based Sampling Technique for Effective Active Learning.}, BOOKTITLE = {International Conference on Machine Learning and Data Mining}. YEAR = {2011}}
In this paper, we address the problem of active learning using the notion of in uence sets based on Reverse Nearest Neighbor. Active learning is an area of machine learning which emphasizes on achiev-ing optimal classi cation performance using as few labeled samples as possible. Reverse nearest neighbors have been used in domains such a sclustering and outliers detection in the past e ectively. In this paper, we devise a new sampling method for instances based on the knowledge pro-vided from RNN in uence sets. To demonstrate the eectiveness of our sampling method, we compare its performance against existing sampling methods on few real life datasets. The experimental results show that our technique outperforms existing methods, particularly on multi-class datasets
Detecting Correlations between Hot Days in News Feeds.
RAGHVENDRA MALL,NAHIL JAIN,Vikram Pudi
International Conference on Knowledge Discovery and Information Retrieval, KDIR, 2011
@inproceedings{bib_Dete_2011, AUTHOR = {RAGHVENDRA MALL, NAHIL JAIN, Vikram Pudi}, TITLE = {Detecting Correlations between Hot Days in News Feeds.}, BOOKTITLE = {International Conference on Knowledge Discovery and Information Retrieval}. YEAR = {2011}}
We use text mining mechanisms to analyze Hot days in news feeds. We build upon the earlier work used to detect Hot topics and assume that we have already attained the Hot days. In this paper we identify the most relevant documents of a topic on a Hot day. We construct a similarity based technique for identifying and ranking these documents. Our aim is to automatically detect chains of hot correlated events over time.We develop a scheme using similarity measures like cosine similarity and KL-divergence to find correlation between these Hot days. For the ‘U.S. Presidential Elections’, the presidential debates which spanned over a week was one such event.
Fuzzy associative rule-based approach for pattern mining and identification and pattern-based classification
ASHISH MANGALAMPALLI,Vikram Pudi
International Conference on World wide web, WWW, 2011
@inproceedings{bib_Fuzz_2011, AUTHOR = {ASHISH MANGALAMPALLI, Vikram Pudi}, TITLE = {Fuzzy associative rule-based approach for pattern mining and identification and pattern-based classification}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2011}}
Associative Classification leverages Association Rule Min-ing (ARM) to train Rule-based classifiers. The classifiers are built on high quality Association Rules mined from the given dataset. Associative Classifiers are very accurate be-cause Association Rules encapsulate all the dominant and statistically significant relationships between items in the dataset. They are also very robust as noise in the form of insignificant and low-frequency itemsets are eliminated dur-ing the mining and training stages. Moreover, the rules are easy-to-comprehend, thus making the classifier transparent.Conventional Associative Classification and Association Rule Mining (ARM) algorithms are inherently designed to work only with binary attributes, and expect any quantita-tive attributes to be converted to binary ones using ranges,like “Age = [25, 60]”. In order to mitigate this constraint,Fuzzy logic is used to convert quantitative attributes to fuzzy binary attributes, like “Age = Middle-aged”, so as to elimi-nate any loss of information arising due to sharp partition-ing, especially at partition boundaries, and then generate Fuzzy Association Rules using an appropriate Fuzzy ARM algorithm. These Fuzzy Association Rules can then be used to train a Fuzzy Associative Classifier. In this paper, we also show how Fuzzy Associative Classifiers so built can be used in a wide variety of domains and datasets, like transactional datasets and image datasets.
A feature-pair-based associative classification approach to look-alike modeling for conversion-oriented user-targeting in tail campaigns
ASHISH MANGALAMPALLI,Adwait Ratnaparkhi,Andrew O. Hatch,Abraham Bagherjeiran,Rajesh Parekh,Vikram Pudi
International Conference on World wide web, WWW, 2011
@inproceedings{bib_A_fe_2011, AUTHOR = {ASHISH MANGALAMPALLI, Adwait Ratnaparkhi, Andrew O. Hatch, Abraham Bagherjeiran, Rajesh Parekh, Vikram Pudi}, TITLE = {A feature-pair-based associative classification approach to look-alike modeling for conversion-oriented user-targeting in tail campaigns}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2011}}
Online advertising offers significantly finer granularity, which has been leveraged in state-of-the-art targeting methods, like Behavioral Targeting (BT). Such methods have been fur-ther complemented by recent work in Look-alike Modeling(LAM) which helps in creating models which are customized according to each advertiser’s requirements and each cam-paign’s characteristics, and which show ads to users who are most likely to convert on them, not just click them. In Look-a like Modeling given data about converters and non-converters, obtained from advertisers, we would like to train models automatically for each ad campaign. Such custom models would help target more users who are similar to the set of converters the advertiser provides. The advertisers get more freedom to define their preferred sets of users which should be used as a basis to build custom targeting models.In behavioral data, the number of conversions (positive class) per campaign is very small (conversions per impression for the advertisers in our data set are much less than 10−4),giving rise to a highly skewed training dataset, which has most records pertaining to the negative class. Campaigns with very few conversions are called as tail campaigns, and those with many conversions are called head campaigns.Creation of Look-alike Models for tail campaigns is very challenging and tricky using popular classifiers like Linear SVM and GBDT, because of the very few number of posi-tive class examples such campaigns contain. In this paper,we present an Associative Classification (AC) approach to LAM for tail campaigns. Pairs of features are used to deriverules to build a Rule-based Associative Classifier, with the rules being sorted by frequency-weighted log-likelihood ratio(F-LLR). The top krules, sorted by F-LLR, are then applied to any test record to score it. Individual features can also form rules by themselves, though the number of such rules in the top krules and the whole rule-set is very small. Our algorithm is based on Hadoop, and is thus very efficient in terms of speed.
DISC: Data-Intensive Similarity Measure forCategorical Data
DESAI ADITYA MAKARAND,HIMANSHU SINGH,Vikram Pudi
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2011
@inproceedings{bib_DISC_2011, AUTHOR = {DESAI ADITYA MAKARAND, HIMANSHU SINGH, Vikram Pudi}, TITLE = {DISC: Data-Intensive Similarity Measure forCategorical Data}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2011}}
The concept of similarity is fundamentally important in al-most every scientific field. Clustering, distance-based outlier detection,classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence anotion of direct comparison between two categorical values is not pos-sible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present anew similarity measure for categorical data DISC - Data-Intensive Simi-larity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24standard real datasets, out of which 12 are used for classification and 12for regression, and shows that it is more accurate than all its competitors.
An efficient algorithm for ranking research papers based on citation network
ADITYA PRATAP SINGH,KUMAR SHUBHANKAR,Vikram Pudi
Conference on Data Mining and Optimization, DMO, 2011
@inproceedings{bib_An_e_2011, AUTHOR = {ADITYA PRATAP SINGH, KUMAR SHUBHANKAR, Vikram Pudi}, TITLE = {An efficient algorithm for ranking research papers based on citation network}, BOOKTITLE = {Conference on Data Mining and Optimization}. YEAR = {2011}}
In this paper we propose an efficient method to rank the research papers from various fields of research published in various conferences over the years. This ranking method is based on citation network. The importance of a research paper is captured well by the peer vote, which in this case is the research paper being cited in other research papers. Using a modified version of the PageRank algorithm, we rank the research papers, assigning each of them an authoritative score. Using the scores of the research papers calculated by above mentioned method, we formulate scores for conferences and authors and rank them as well. We have introduced a new metric in the algorithm which takes into account the time factor in ranking the research papers to reduce the bias against the recent papers which get less time for being studied and consequently cited by the researchers as compared to the older papers. Often a researcher is more interested in finding the top conferences in a particular year rather than the overall conference ranking. Considering the year of publication of the papers, in addition to the paper scores we also calculated the year-wise score of each conference by slight improvisation of the above mentioned algorithm.
A frequent keyword-set based algorithm for topic modeling and clustering of research papers
KUMAR SHUBHANKAR,ADITYA PRATAP SINGH,Vikram Pudi
Conference on Data Mining and Optimization, DMO, 2011
@inproceedings{bib_A_fr_2011, AUTHOR = {KUMAR SHUBHANKAR, ADITYA PRATAP SINGH, Vikram Pudi}, TITLE = {A frequent keyword-set based algorithm for topic modeling and clustering of research papers}, BOOKTITLE = {Conference on Data Mining and Optimization}. YEAR = {2011}}
In this paper we introduce a novel and efficient approach to detect topics in a large corpus of research papers. With rapidly growing size of academic literature, the problem of topic detection has become a very challenging task. We present a unique approach that uses closed frequent keyword-set to form topics. Our approach also provides a natural method to cluster the research papers into hierarchical, overlapping clusters using topic as similarity measure. To rank the research papers in the topic cluster, we devise a modified PageRank algorithm that assigns an authoritative score to each research paper by considering the sub-graph in which the research paper appears. We test our algorithms on the DBLP dataset and experimentally show that our algorithms are fast, effective and scalable.
An Efficient Algorithm for Topic Ranking andModeling Topic Evolution
KUMAR SHUBHANKAR,ADITYA PRATAP SINGH,Vikram Pudi
International Conference on Database and Expert Systems Applications, DEXA, 2011
@inproceedings{bib_An_E_2011, AUTHOR = {KUMAR SHUBHANKAR, ADITYA PRATAP SINGH, Vikram Pudi}, TITLE = {An Efficient Algorithm for Topic Ranking andModeling Topic Evolution}, BOOKTITLE = {International Conference on Database and Expert Systems Applications}. YEAR = {2011}}
In this paper we introduce a novel and efficient approach to detect and rank topics in a large corpus of research papers. With rapidly growing size of academic literature, the problem of topic detection and topic ranking has become a challenging task. We present a unique approach that uses closed frequent keyword-set to form topics. We devise a modified time independent Page Rank algorithm that assigns an authoritative score to each topic by considering the sub-graph in which the topic appears, producing a ranked list of topics.The use of citation network and the introduction of time in variance in the topic ranking algorithm reveal very interesting results. Our approach also provides a clustering technique for the research papers using topics as similarity measure. We extend our algorithms to study various aspects of topic evolution which gives interesting insight into trends in research areas over time.Our algorithms also detect hot topics and landmark topics over the years. We test our algorithms on the DBLP dataset and show that our algorithms are fast, effective and scalable.
Utilizing Term Proximity based Features to Improve Text Document Clustering.
SHASHANK PALIWAL,Vikram Pudi
International Conference on Knowledge Discovery and Information Retrieval, KDIR, 2011
@inproceedings{bib_Util_2011, AUTHOR = {SHASHANK PALIWAL, Vikram Pudi}, TITLE = {Utilizing Term Proximity based Features to Improve Text Document Clustering.}, BOOKTITLE = {International Conference on Knowledge Discovery and Information Retrieval}. YEAR = {2011}}
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs fromthe given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding term-pair features.
Fuzzy association rule mining algorithm for fast and efficient performance on very large datasets
ASHISH MANGALAMPALLI,Vikram Pudi
International Conference on Fuzzy Systems, FUZZ , 2010
@inproceedings{bib_Fuzz_2010, AUTHOR = {ASHISH MANGALAMPALLI, Vikram Pudi}, TITLE = {Fuzzy association rule mining algorithm for fast and efficient performance on very large datasets}, BOOKTITLE = {International Conference on Fuzzy Systems}. YEAR = {2010}}
Fuzzy association rules use fuzzy logic to convert numerical attributes to fuzzy attributes, like ldquoIncome = Highrdquo, thus maintaining the integrity of information conveyed by such numerical attributes. On the other hand, crisp association rules use sharp partitioning to transform numerical attributes to binary ones like ldquoIncome = [100 K and above]rdquo, and can potentially introduce loss of information due to these sharp ranges. Fuzzy A priori and its different variations are the only popular fuzzy association rule mining (ARM) algorithms available today. Like the crisp version of Apriori, fuzzy A priori is a very slow and inefficient algorithm for very large datasets (in the order of millions of transactions). Hence, we have come up with a new fuzzy ARM algorithm meant for fast and efficient performance on very large datasets. As compared to fuzzy A priori, our algorithm is 8-19 times faster for the very large standard real-life dataset we have used for testing with various mining workloads, both typical and extreme ones. A novel combination of features like two-phased multiple-partition tidlist-style processing, byte-vector representation of tidlists, and fast compression of tidlists contribute a lot to the efficiency in performance. In addition, unlike most two-phased ARM algorithms, the second phase is totally different from the first one in the method of processing (individual itemset processing as opposed to simultaneous itemset processing at each k-level), and is also many times faster. Our algorithm also includes an effective preprocessing technique for converting a crisp dataset to a fuzzy dataset
SEAR Scalable, Efficient, Accurate, Robust kNN - based Regression
DESAI ADITYA MAKARAND,HIMANSHU SINGH,Vikram Pudi
International Conference on Knowledge Discovery and Information Retrieval, KDIR, 2010
@inproceedings{bib_SEAR_2010, AUTHOR = {DESAI ADITYA MAKARAND, HIMANSHU SINGH, Vikram Pudi}, TITLE = {SEAR Scalable, Efficient, Accurate, Robust kNN - based Regression}, BOOKTITLE = {International Conference on Knowledge Discovery and Information Retrieval}. YEAR = {2010}}
Regression algorithms are used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. Statistical approaches although popular, are not generic in that they require the user to make an intelligent guess about the form of the regression equation. In this paper we present a new regression algorithm SEAR – Scalable, Efficient, Accurate kNN-based Regression. In addition to this, SEAR is simple and outlier-resilient. These desirable features make SEAR a very attractive alternative to existing approaches. Our experimental study compares SEAR with fourteen other algorithms on five standard real datasets, and shows that SEAR is more accurate than all its competitors.
A Robust Active Learning Framework Using Itemset Based Dynamic Rule Sampling.
BHANUKIRAN VINZAMURI,Vikram Pudi
India Joint International Conference on Data Science & Management of Data, COMAD/CODS, 2010
@inproceedings{bib_A_Ro_2010, AUTHOR = {BHANUKIRAN VINZAMURI, Vikram Pudi}, TITLE = {A Robust Active Learning Framework Using Itemset Based Dynamic Rule Sampling.}, BOOKTITLE = {India Joint International Conference on Data Science & Management of Data}. YEAR = {2010}}
Active learning is a rapidly growing field of machine learning which aims at reducing the labeling effortof the oracle (human expert) in acquiring informa-tive training samples in domains where the costof labeling is high. Associative classification is a well established prediction method which possesses the advantages of high accuracy and faster learning rates in classification. In this paper, we propose a novel algorithm which unifies associative classification with active learning. The algorithm has two major procedures of Rule generation and rule pruning.The algorithm selects unlabeled instances from the pool of available samples and uses a unique dynamic rule sampling procedure for updating the model.The rules are dynamically sampled class association rules (CAR) which are generated using the mined Minimal infrequent itemsets. The results derived over 10 datasets from the UCI-ML repository for our approach have been compared with those from the ACTIVE-DECORATE algorithm. We also analyze our sampling method against the state of art sampling frameworks and show that our method performs better.
Mining Landmark Papers
ANNU TULI,Vikram Pudi
Workshop on Emerging Research Trends in Artificial Intelligence, ERTAI, 2010
@inproceedings{bib_Mini_2010, AUTHOR = {ANNU TULI, Vikram Pudi}, TITLE = {Mining Landmark Papers}, BOOKTITLE = {Workshop on Emerging Research Trends in Artificial Intelligence}. YEAR = {2010}}
In recent years, the number of electronic journal articles is growing faster than ever before; information is generated faster than people can deal with it. In order to handle this problem, many electronic periodical databases have proposed keyword search methods to decrease the effort and time spent by users in searching the journal's archives. However, the users still have to deal with a huge number of search results. In this paper, we present the problem of mining landmark papers. We treat papers that introduce important key phrases for the first time as landmark papers. Our approach combines simple ideas from text mining, information extraction and information retrieval to identify landmark papers. We show that existing related techniques such as first story detection, mining hot topics and theme mining do not effectively handle the landmark paper mining problem. Our approach is simpler and more direct for this task. We experimentally evaluate our approach on a large dataset of papers in the database or data mining areas downloaded using DBLP.
ProMax: A Profit Maximizing Recommendation System for Market Baskets
LYDIA MANIKONDA,ANNU TULI,Vikram Pudi
Workshop on Emerging Research Trends in Artificial Intelligence, ERTAI, 2010
@inproceedings{bib_ProM_2010, AUTHOR = {LYDIA MANIKONDA, ANNU TULI, Vikram Pudi}, TITLE = {ProMax: A Profit Maximizing Recommendation System for Market Baskets}, BOOKTITLE = {Workshop on Emerging Research Trends in Artificial Intelligence}. YEAR = {2010}}
Most data mining research has focused on devel-oping algorithms to discover statistically significant patternsin large datasets. However, utilizing the resulting patterns indecision support is more of an art, and the utility of such patterns is often questioned. In this paper we formalize a technique that utilizes data mining concepts to recommend an optimal set of items to customers in a retail store based on the contents of their market baskets. The recommended set of items maximizes the expected profit of the store and is decided based on patterns learnt from past transactions. In addition to concepts of clustering and frequent itemsets, the proposed method also combines the idea of knapsack problems to decide on the items to recommend. We empirically compare our approach with existing methods on both real and synthetic datasets and show that our method yields better profits while being faster and simpler.
PERICASA
RAGHVENDRA MALL,PRAKHAR JAIN,Vikram Pudi,Bipin Indurkhya
International Conference on COGNITIVE INFORMATICS, ICCI, 2010
@inproceedings{bib_PERI_2010, AUTHOR = {RAGHVENDRA MALL, PRAKHAR JAIN, Vikram Pudi, Bipin Indurkhya}, TITLE = {PERICASA}, BOOKTITLE = {International Conference on COGNITIVE INFORMATICS}. YEAR = {2010}}
This paper presents a novel architecture PERIC-ASA, PER turbed frequent Itemset based classification for Com-putational Auditory Scene Analysis(CASA). A novel approach for perception of sound waves has been developed. Our aim is to develop a classifier whichcan correctly identify sound waves from noisy sound mixturesi.e. to solve the classical‘Cocktail Party Problem’. The architecture is based on Gestalt principles of grouping like Pragnanz, Proximity, Common Fateand Similarity. These grouping cues are incorporated into a new Classification approach which is based on a concept namely Perturbed Frequent Itemsets. The primary idea is more the ease with which we can identify different feature values, easier it is to identify the sound wave.
FPrep: Fuzzy clustering driven efficient automated pre-processing for fuzzy association rule mining
ASHISH MANGALAMPALLI,Vikram Pudi
International Conference on Fuzzy Systems, FUZZ , 2010
@inproceedings{bib_FPre_2010, AUTHOR = {ASHISH MANGALAMPALLI, Vikram Pudi}, TITLE = {FPrep: Fuzzy clustering driven efficient automated pre-processing for fuzzy association rule mining}, BOOKTITLE = {International Conference on Fuzzy Systems}. YEAR = {2010}}
Conventional Association Rule Mining (ARM) algorithms usually deal with datasets with binary values, and expect any numerical values to be converted to binary ones using sharp partitions, like Age = 25 to 60. In order to mitigate this constraint, Fuzzy logic is used to convert quantitative values of attributes to binary ones, so as to eliminate any loss of information arising due to sharp partitioning, especially at partition boundaries, and then generate fuzzy association rules. But, before any fuzzy ARM algorithm can be used, the original dataset (with crisp attributes) needs to be transformed into a form with fuzzy attributes. This paper describes a methodology, called FPrep, to do this pre-processing, which first involves using fuzzy clustering to generate fuzzy partitions, and then uses these partitions to get a fuzzy version (with fuzzy records) of the original dataset. Ultimately, the fuzzy data (fuzzy records) are represented in a standard manner such that they can be used as input to any kind of fuzzy ARM algorithm, irrespective of how it works and processes fuzzy data. We also show that FPrep is much faster than other such comparable transformation techniques, which in turn depend on non-fuzzy techniques, like hard clustering (CLARANS and CURE). Moreover, we illustrate the quality of the fuzzy partitions generated using FPrep, and the number of frequent itemsets generated by a fuzzy ARM algorithm when preceded by FPrep.
FACISME: Fuzzy associative classification using iterative scaling and maximum entropy
ASHISH MANGALAMPALLI,Vikram Pudi
International Conference on Fuzzy Systems, FUZZ , 2010
@inproceedings{bib_FACI_2010, AUTHOR = {ASHISH MANGALAMPALLI, Vikram Pudi}, TITLE = {FACISME: Fuzzy associative classification using iterative scaling and maximum entropy}, BOOKTITLE = {International Conference on Fuzzy Systems}. YEAR = {2010}}
All associative classifiers developed till now are crisp in nature, and thus use sharp partitioning to transform numerical attributes to binary ones like “Income = [100K and above]”. On the other hand, the novel fuzzy associative classification algorithm called FACISME, which we propose in this paper, uses fuzzy logic to convert numerical attributes to fuzzy attributes, like “Income = High”, thus maintaining the integrity of information conveyed by such numerical attributes. Moreover, FACISME is based on maximum entropy, and uses iterative scaling, both of which lend a very strong theoretical foundation to the algorithm. Entropy is one of the best measures of information, and maximum-entropy-based algorithms do not assume independence of parameters in the classification process. Thus, FACISME provides very goodaccuracy, and can work with all types of datasets (irrespective of size and type of attributes – numerical or binary) and domains.
Evolutionary Clustering using Frequent Itemsets
RAVI SHANKAR PRASAD,G V R KIRAN,Vikram Pudi
SIGKDD Workshop on Novel Data Stream Pattern Mining Techniques, StreamKDD, 2010
@inproceedings{bib_Evol_2010, AUTHOR = {RAVI SHANKAR PRASAD, G V R KIRAN, Vikram Pudi}, TITLE = {Evolutionary Clustering using Frequent Itemsets}, BOOKTITLE = {SIGKDD Workshop on Novel Data Stream Pattern Mining Techniques}. YEAR = {2010}}
Evolutionary clustering is an emerging research area ad-dressing the problem of clustering dynamic data. An evolu-tionary clustering should take care of two conflicting criteria:preserving the current cluster quality and not deviating too much from the recent history. In this paper we propose an al-gorithm for evolutionary clustering using frequent itemsets.A frequent itemset based approach for evolutionary cluster-ing is natural and it automatically satisfy the two criteria of evolutionary clustering. We provide theoretical as well as experimental proofs to support our claims. We performed experiments on our approach using different datasets and the results show that our approach is comparable to most of the existing algorithms for evolutionary clustering
PAGER: Parameterless, Accurate, Generic, Efficient kNN-Based Regression
HIMANSHU SINGH,DESAI ADITYA MAKARAND,Vikram Pudi
International Conference on Database and Expert Systems Applications, DEXA, 2010
@inproceedings{bib_PAGE_2010, AUTHOR = {HIMANSHU SINGH, DESAI ADITYA MAKARAND, Vikram Pudi}, TITLE = {PAGER: Parameterless, Accurate, Generic, Efficient kNN-Based Regression}, BOOKTITLE = {International Conference on Database and Expert Systems Applications}. YEAR = {2010}}
The problem of regression is to estimate the value of a de-pendent numeric variable based on the values of one or more independent variables. Regression algorithms are used for prediction (including fore-casting of time-series data), inference, hypothesis testing, and modeling of causal relationships. Although this problem has been studied exten-sively, most of these approaches are not generic in that they require the user to make an intelligent guess about the form of the regression equa-tion. In this paper we present a new regression algorithm PAGER – Pa-rameterless, Accurate, Generic, Efficient kNN-based Regression. PAGER is also simple and outlier-resilient. These desirable features make PAGER a very attractive alternative to existing approaches. Our experimental study compares PAGER with 12 other algorithms on 4 standard real datasets, and shows that PAGER is more accurate than its competitors.
Specialty mining
HANUMA KUMAR ANUMANULA,Rohit Paravastu,Vikram Pudi
International Conference on Big Data Analysis and Knowledge Discovery, BDAKD, 2010
@inproceedings{bib_Spec_2010, AUTHOR = {HANUMA KUMAR ANUMANULA, Rohit Paravastu, Vikram Pudi}, TITLE = {Specialty mining}, BOOKTITLE = {International Conference on Big Data Analysis and Knowledge Discovery}. YEAR = {2010}}
In this paper, we consider the problem of mining the special proper-ties of a given record in a relational dataset. In our formulation, a property is a combination of multiple attribute-value pairs. The support of a property is the number of records that satisfy it. We consider a property as special if its support occurs to us as a shockand the measure of this shock factor is more than a userdefined thresholdη. We provide a way to define this notion of shock based on en-tropy. We also output the shock factor for records in the dataset in a convenient,easily-interpretable manner. An illustrated example is provided on how users can interpret the results. Experiments on real and synthetic data sets reveal interesting properties of data records that cannot be mined using traditional approaches.
Frequent Itemset based Hierarchical Document Clustering using Wikipedia as External Knowledge
G V R KIRAN,K RAVI SHANKAR,Vikram Pudi
International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, KES, 2010
@inproceedings{bib_Freq_2010, AUTHOR = {G V R KIRAN, K RAVI SHANKAR, Vikram Pudi}, TITLE = {Frequent Itemset based Hierarchical Document Clustering using Wikipedia as External Knowledge}, BOOKTITLE = {International Conference on Knowledge-Based and Intelligent Information & Engineering Systems}. YEAR = {2010}}
High dimensionality is a major challenge in document clus-tering. Some of the recent algorithms address this problem by using frequent itemsets for clustering. But, most of these algorithms neglect the semantic relationship between the words. On the other hand there are algorithms that take care of the semantic relations between the words by making use of external knowledge contained in WordNet, Mesh,Wikipedia, etc but do not handle the high dimensionality. In this paper we present an efficient solution that addresses both these problems. We propose a hierarchical clustering algorithm using closed frequent itemsets that use Wikipedia as an external knowledge to enhance the document representation. We evaluate our methods based on F-Score on standard datasets and show our results to be better than existing approaches.
Gear: Generic, efficient, accurate kNN-based regression
DESAI ADITYA MAKARAND,HIMANSHU SINGH,Vikram Pudi
International Conference on Knowledge Discovery and Information Retrieval, KDIR, 2010
@inproceedings{bib_Gear_2010, AUTHOR = {DESAI ADITYA MAKARAND, HIMANSHU SINGH, Vikram Pudi}, TITLE = {Gear: Generic, efficient, accurate kNN-based regression}, BOOKTITLE = {International Conference on Knowledge Discovery and Information Retrieval}. YEAR = {2010}}
Regression algorithms are used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. Statistical approaches although popular, are are not generic in that they require the user to make an intelligent guess about the form of the regression equation. In this paper we present a new regression algorithm GEAR – Generic, Efficient, Accurate kNN-based Regression. In addition to this, GEAR is simple and outlier-resilient. These desirable features make GEAR a very attractive alternative to existing approaches. Our experimental study compares GEAR with fourteen other algorithms on five standard real datasets, and shows that GEAR is more accurate than all its competitors.
PERFICT: Perturbed Frequent Itemset based Classification Technique
RAGHVENDRA MALL,PRAKHAR JAIN,Vikram Pudi
International Conference on Tools with Artificial Intelligence, ICTAI, 2010
@inproceedings{bib_PERF_2010, AUTHOR = {RAGHVENDRA MALL, PRAKHAR JAIN, Vikram Pudi}, TITLE = {PERFICT: Perturbed Frequent Itemset based Classification Technique}, BOOKTITLE = {International Conference on Tools with Artificial Intelligence}. YEAR = {2010}}
This paper presents Perturbed Frequent Itemset based Classification Technique (PERFICT),a novel associative classification approach based on perturbed frequent itemsets. Most of the existing associative classifiers work well on transactional data where each record contains a set of boolean items.They are not very effective in general for relational data that typically contains real valued attributes.In PERFICT, we handle real attributes by treating items as (attribute,value) pairs, where the value is not the original one, but is perturbed by a small amount and is a range based value. We also propose our own similarity measure which captures the nature of real valued attributes and provide effective weights for the itemsets. The probabilistic contributions of different itemsets is taken into considerations during classification. Some of the applications where such a technique is useful are in signal classification, medical diagnosis and handwriting recognition. Experiments conducted on the UCI Repository datasets show that PERFICT is highly competitive in terms of accuracy in comparison with popular associative classification methods.
UACI: Uncertain associative classifier for object class identification in images
LYDIA MANIKONDA,ASHISH MANGALAMPALLI,Vikram Pudi
International Conference of Image and Vision Computing, IVCNZ, 2010
@inproceedings{bib_UACI_2010, AUTHOR = {LYDIA MANIKONDA, ASHISH MANGALAMPALLI, Vikram Pudi}, TITLE = {UACI: Uncertain associative classifier for object class identification in images}, BOOKTITLE = {International Conference of Image and Vision Computing}. YEAR = {2010}}
Uncertainty is inherently present in many real-world domains like images. Analyses of such uncertain data using traditional certain-data-oriented techniques do not achieve best possible accuracy. UACI introduces the concept of representing images in the form of a probabilistic or uncertain model using interest points in images. This model is an uncertain-data-based adaptation of Bag of Words, with each image not only represented by the visual words that it contains, but also their respective probabilities of occurrence in the image. UACI uses an Associative Classification approach to leverage latent frequent patterns in images for the identification of object classes. Unlike most image classifiers, which rely on positive and negative class sets (generally very vague) for training, UACI uses only positive class images for training. We empirically compare UACI with three other state-of-the-art image classifiers, and show that UACI performs much better than the other classifying approaches.
Variations and Trends in Hot Topics in News Feeds.
RAGHVENDRA MALL,NEERAJ BAGDIA,Vikram Pudi
International Conference on Management of Data, COMAD, 2009
@inproceedings{bib_Vari_2009, AUTHOR = {RAGHVENDRA MALL, NEERAJ BAGDIA, Vikram Pudi}, TITLE = {Variations and Trends in Hot Topics in News Feeds.}, BOOKTITLE = {International Conference on Management of Data}. YEAR = {2009}}
We describe improved mechanisms to accurately clas-sify days when news for topics receive unexpectedly high amount of coverage. We further investigate the factors which influence this classification using ‘Pres-idential Elections’ as the topic of interest. This help sin bringing out useful trends and relations between days with hot topics by varying variables like history window size,van-ratio etc. We also propose a statisti-cal scheme to approximate major events related to the topic. We then try to approximate the chain of event srelated to the major events. This can support a news alert service and also serve the purpose of automati-cally tracking news which follow up major events.
Uniqueness Mining
Rohit Paravastu,HANUMA KUMAR ANUMANULA,Vikram Pudi
International Conference on Database Systems for Advanced Applications, DASFAA, 2008
@inproceedings{bib_Uniq_2008, AUTHOR = {Rohit Paravastu, HANUMA KUMAR ANUMANULA, Vikram Pudi}, TITLE = {Uniqueness Mining}, BOOKTITLE = {International Conference on Database Systems for Advanced Applications}. YEAR = {2008}}
In this paper we consider the problem of extracting the spe-cial properties of any given record in a dataset. We are interested in determining what makes a given record unique or different from the ma-jority of the records in a dataset. In the real world, records typically represent objects or people and it is often worthwhile to know what spe-cial properties are present in each object or person, so that we can make the best use of them. This problem has not been considered earlier inthe research literature. We approach this problem using ideas from clus-tering, attribute oriented induction (AOI) and frequent itemset mining.Most of the time consuming work is done in a preprocessing stage and the online computation of the uniqueness of a given record is instantaneous
FISH:A Practical System for Fast Interactive Image Search in Huge Databases
PRADHEE TANDON,PIYUSH NIGAM,Vikram Pudi,Jawahar C V
International Conference on Image and Video Retrieval, CIVR, 2008
@inproceedings{bib_FISH_2008, AUTHOR = {PRADHEE TANDON, PIYUSH NIGAM, Vikram Pudi, Jawahar C V}, TITLE = {FISH:A Practical System for Fast Interactive Image Search in Huge Databases}, BOOKTITLE = {International Conference on Image and Video Retrieval}. YEAR = {2008}}
The problem of search and retrieval of images using relevance feedback has attracted tremendous attention in recent years from the research community. A real-world-deployable interactive image retrieval system must (1) be accurate, (2) require minimal user-interaction, (3) be efficient, (4) be scalable to large collections (millions) of images, and (5) support multi-user sessions. For good accuracy, we need effective methods for learning the relevance of image features based on user feedback, both within a user-session and across sessions. Efficiency and scalability require a good index structure for retrieving results. The index structure must allow for the relevance of image features to continually change with fresh queries and user-feedback. The state-of-the-art methods available today each address only a subset of these issues. In this paper, we build a complete system FISH - Fast Image Search in Huge databases. In FISH, we integrate selected techniques available in the literature, while adding a few of our own. We perform extensive experiments on real datasets to demonstrate the accuracy, efficiency and scalability of FISH. Our results show that the system can easily scale to millions of images while maintaining interactive response time.
REBMEC: Repeat Based Maximum Entropy Classifier for Biological Sequences.
PRATIBHA RANI,Vikram Pudi
International Conference on Management of Data, COMAD, 2008
@inproceedings{bib_REBM_2008, AUTHOR = {PRATIBHA RANI, Vikram Pudi}, TITLE = {REBMEC: Repeat Based Maximum Entropy Classifier for Biological Sequences.}, BOOKTITLE = {International Conference on Management of Data}. YEAR = {2008}}
An important problem in biological data analysis is to pre-dict the family of a newly discovered sequence like a pro-tein or DNA sequence, using the collection of available se-quences. In this paper we tackle this problem and presentREBMEC, a Repeat Based Maximum Entropy Classifier of biological sequences. Maximum entropy models are known to be theoretically robust and yield high accuracy,but are slow. This makes them useful as benchmarks to evaluate other classifiers. Specifically, REBMEC is based on the classical Generalized Iterative Scaling (GIS) al-gorithm and incorporates repeated occurrences of subse-quences within each sequence. REBMEC uses maximal frequent subsequences as features but can support other types of features as well. Our extensive experiments on two collections of protein families show that REBMEC performs as well as existing state-of-the-art probabilistic classifiers for biological sequences without using domain-specific background knowledge such as multiple align-ment, data transformation and complex feature extraction methods. The design of REBMEC is based on genericideas that can apply to other domains where data is orga-nized as collections of sequences.
RBNBC: Repeat Based Naive Bayes Classifier for Biological Sequences
PRATIBHA RANI,Vikram Pudi
International Conference on Data Mining, ICDM, 2008
@inproceedings{bib_RBNB_2008, AUTHOR = {PRATIBHA RANI, Vikram Pudi}, TITLE = {RBNBC: Repeat Based Naive Bayes Classifier for Biological Sequences}, BOOKTITLE = {International Conference on Data Mining}. YEAR = {2008}}
In this paper, we present RBNBC, a Repeat Based Naive Bayes Classifier of bio-sequences that uses maximal fre-quent subsequences as features. RBNBC’s design is based on generic ideas that can apply to other domains where the data is organized as collections of sequences. Specifically,RBNBC uses a novel formulation of Naive Bayes that incor-porates repeated occurrences of subsequences within each sequence. Our extensive experiments on two collections of protein families show that it performs as well as existing state-of-the-art probabilistic classifiers for bio-sequences.This is surprising as it is a pure data mining based generic classifier that does not require domain-specific background knowledge. We note that domain-specific ideas could fur-ther increase its performance.
Efficient search with changing similarity measures on large multimedia datasets
NATARAJ J,Vikram Pudi,Jawahar C V
International Conference on MultiMedia Modeling, MMM, 2007
@inproceedings{bib_Effi_2007, AUTHOR = {NATARAJ J, Vikram Pudi, Jawahar C V}, TITLE = {Efficient search with changing similarity measures on large multimedia datasets}, BOOKTITLE = {International Conference on MultiMedia Modeling}. YEAR = {2007}}
In this paper, we consider the problem of finding the k most similar objects given a query object, in large multimedia datasets. We focus on scenarios where the similarity measure itself is not fixed, but is continuously being refined with user feedback. Conventional database techniques for efficient similarity search are not effective in this environment as they take a specific similarity/distance measure as input and build index structures tuned for that measure. Our approach works effectively in this environment as validated by the experimental study where we evaluate it over a wide range of datasets. The experiments show it to be efficient and scalable. In fact, on all our datasets, the response times were within a few seconds, making our approach suitable for interactive applications.
Using prefix-trees for efficiently computing set joins
J RAVINDRANATH CHOWDARY,Vikram Pudi
International Conference on Database Systems for Advanced Applications, DASFAA, 2005
@inproceedings{bib_Usin_2005, AUTHOR = {J RAVINDRANATH CHOWDARY, Vikram Pudi}, TITLE = {Using prefix-trees for efficiently computing set joins}, BOOKTITLE = {International Conference on Database Systems for Advanced Applications}. YEAR = {2005}}
Joins on set-valued attributes (set joins) have numerous database applications. In this paper we propose PRETTI (PREfix Tree based seT joIn) – a suite of set join algorithms for containment, overlap and equality join predicates.Our algorithms use prefix trees and inverted indices. These structures are constructed on-the-fly if they are not already precomputed. This feature makes ouralgorithms usable for relations without indices and when joining intermediate results during join queries with more than two relations. Another feature of ouralgorithms is that results are output continuously during their execution and notjust at the end. Experiments on real life datasets show that the total executiontime of our algorithms is significantly less than that of previous approaches, even when the indices required by our algorithms are not precomputed.
ACME: An associative classifier based on maximum entropy principle
RISI VARDHAN THONANGI,Vikram Pudi
Algorithmic Learning Theory, ALT, 2005
@inproceedings{bib_ACME_2005, AUTHOR = {RISI VARDHAN THONANGI, Vikram Pudi}, TITLE = {ACME: An associative classifier based on maximum entropy principle}, BOOKTITLE = {Algorithmic Learning Theory}. YEAR = {2005}}
Generalized Closed Itemsets for Association Rule Mining
Vikram Pudi,Jayant R. Haritsa
International Conference on Data Engineering, ICDE, 2003
@inproceedings{bib_Gene_2003, AUTHOR = {Vikram Pudi, Jayant R. Haritsa}, TITLE = {Generalized Closed Itemsets for Association Rule Mining}, BOOKTITLE = {International Conference on Data Engineering}. YEAR = {2003}}
The output of boolean association rule mining algorithms is often too large for manual examination. For dense datasets, it is often impractical to even generate all frequent itemsets. The closed itemset approach handles this information overload by pruning “uninteresting” rules following the observation that most rules can be derived from other rules. In this paper, we propose a new framework, namely, the generalized closed (or g-closed) itemset framework. By allowing for a small tolerance in the accuracy of itemset supports, we show that the number of such redundant rules is far more than what was previously estimated. Our scheme can be integrated into both levelwise algorithms (Apriori) and two-pass algorithms (ARMOR). We evaluate its performance by measuring the reduction in output size as well as in response time. Our experiments show that incorporating gclosed itemsets provides significan
Reducing rule covers with deterministic error bounds
Vikram Pudi,Jayant R. Harits
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2003
@inproceedings{bib_Redu_2003, AUTHOR = {Vikram Pudi, Jayant R. Harits}, TITLE = {Reducing rule covers with deterministic error bounds}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2003}}
The output of boolean association rule mining algorithms is often too large for manual examination. For dense datasets, it is often impractical to even generate all frequent itemsets. The closed itemset approach handles this information overload by pruning “uninteresting” rules following the observation that most rules can be derived from other rules. In this paper, we propose a new framework, namely, the generalized closed (or -closed) itemset framework. By allowing for a small tolerance in the accuracy of itemset supports, we show that the number of such redundant rules is far more than what was previously estimated. Our scheme can be integrated into both levelwise algorithms (Apriori) and two-pass algorithms (ARMOR). We evaluate its performance by measuring the reduction in output size as well as in response time. Our experiments show that incorporating g-closed itemsets provides significant performance improvements on a variety of databases.
ARMOR: Association Rule Mining based on ORacle.
Vikram Pudi,Jayant R. Haritsa
International Conference on Data Mining Workshops, ICDM-W, 2003
@inproceedings{bib_ARMO_2003, AUTHOR = {Vikram Pudi, Jayant R. Haritsa}, TITLE = {ARMOR: Association Rule Mining based on ORacle.}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2003}}
In this paper, we first focus our attention on the question of how much space remains for performance improvement over current association rule mining algorithms. Our strategy is to compare their performance against an “Oracle algorithm” that knows in advance the identities of all frequent itemsets in the database and only needs to gather their actual supports to complete the mining process. Our experimental results show that current mining algorithms do not perform uniformly well with respect to the Oracle for all database characteristics and support thresholds. In many cases there is a substantial gap between the Oracle’s performance and that of the current mining algorithms. Second, we present a new mining algorithm, called ARMOR, that is constructed by making minimal changes to the Oracle algorithm. ARMOR consistently performs within a factor of two of the Oracle on both real and synthetic datasets over practical ranges of support specifications.