Nisansa Dilushan de Silva




Graduate Student
Department of Computer and Information Science,
University of Oregon
 

Lecturer (On study leave)
Department of Computer Science & Engineering,
University of Moratuwa
 
I am a doctoral candidate at the Department of Computer and Information Science, University of Oregon, OR, USA. My research interests are in the area of Artificial Intelligence. Specifically, in the sub-domains of; Natural Language Processing, Machine Learning, and Data mining.

I am currently on study leave from my position of, lecturer at the Dept. of Computer Science & Engineering, University of Moratuwa, Sri Lanka.

I received a BSc (Hons) in Computer Science & Engineering from University of Moratuwa in 2011 with First Class distinction. A pdf of the academic transcript is available here.

I received an MSc in Computer & Information Science from University of Oregon in 2016 with a GPA of 4.13. A pdf of the academic transcript is available here.



Curriculum Vitae

.

Publications

|| Citations: 151 || h-index: 8 || i10-index: 4 ||

Show All
Natural Language Processing
Machine Learning / Deep Learning
Ontologies
Big Data
Bioinformatics
Education

Book chapters

Relational Databases and Biomedical Big Data

Bioinformatics in MicroRNA Research
2017

In various biomedical applications that collect, handle, and manipulate data, the amounts of data tend to build up and venture into the range identified as bigdata. In such occurrences, a design decision has to be taken as to what type of database would be used to handle this data. More often than not, the default and classical solution to this in the biomedical domain according to past research is relational databases. While this used to be the norm for a long while, it is evident that there is a trend to move away from relational databases in favor of other types and paradigms of databases. However, it still has paramount importance to understand the interrelation that exists between biomedical big data and relational databases. This chapter will review the pros and cons of using relational databases to store biomedical big data that previous researches have discussed and used.

N. H. Nisansa D de Silva, "Relational Databases and Biomedical Big Data," Bioinformatics in MicroRNA Research, pp. 69--81, 2017, [pdf] [bib]
Citation count: 1

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download


Journal Papers

Word Vector Embeddings and Domain Specific Semantic based Semi-Supervised Ontology Instance Population

ICTer
2018

An ontology defines a set of representational primitives which model a domain of knowledge or discourse. With the arising fields such as information extraction and knowledge management, the role of ontology has become a driving factor of many modern day systems. Ontology population, on the other hand, is a inherently problematic process, as it needs manual intervention to prevent the conceptual drift. The semantic sensitive word embedding has become a popular topic in natural language processing with its capability to cope with the semantic challenges. Incorporating domain specific semantic similarity with the word embeddings could potentially improve the performance in terms of semantic similarity in specific domains. Thus, in this study we propose a novel way of semi-supervised ontology population through word embeddings and domain specific semantic similarity as the basis. We built several models including traditional benchmark models and new types of models which are based on word embeddings. Finally, we ensemble them together to come up with a synergistic model which outperformed the candidate models by 33% in comparison to the best performed candidate model.

V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, B. Ayesha, and M. Perera, "Word Vector Embeddings and Domain Specific Semantic based Semi-Supervised Ontology Instance Population," ICTer, vol. 11, no. 1, 2018, [pdf] [bib]
Citation count: 1

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Concept and Attention-Based CNN for Question Retrieval in Multi-View Learning

ACM Transactions on Intelligent Systems and Technology (TIST)
2018

Question retrieval, which aims to find similar versions of a given question, is playing a pivotal role in various question answering (QA) systems. This task is quite challenging, mainly in regard to five aspects: synonymy, polysemy, word order, question length, and data sparsity. In this article, we propose a unified framework to simultaneously handle these five problems. We use the word combined with corresponding concept information to handle the synonymy problem and the polysemous problem. Concept embedding and word embedding are learned at the same time from both the context-dependent and context-independent views. To handle the word-order problem, we propose a high-level feature-embedded convolutional semantic model to learn question embedding by inputting concept embedding and word embedding. Due to the fact that the lengths of some questions are long, we propose a value-based convolutional attentional method to enhance the proposed high-level feature-embedded convolutional semantic model in learning the key parts of the question and the answer. The proposed high-level feature-embedded convolutional semantic model nicely represents the hierarchical structures of word information and concept information in sentences with their layer-by-layer convolution and pooling. Finally, to resolve data sparsity, we propose using the multi-view learning method to train the attention-based convolutional semantic model on question-answer pairs. To the best of our knowledge, we are the first to propose simultaneously handling the above five problems in question retrieval using one framework. Experiments on three real question-answering datasets show that the proposed framework significantly outperforms the state-of-the-art solutions.

P. Wang, L. Ji, J. Yan, D. Dou, N. de Silva, Y. Zhang, and L. Jin, "Concept and Attention-Based CNN for Question Retrieval in Multi-View Learning," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 9, no. 4, pp. 41, 2018, [pdf] [bib]
Citation count: 2

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Online Reasoning for Semantic Error Detection in Text

Journal on Data Semantics
2017

Identifying incorrect content (i.e., semantic error) in text is a difficult task because of the ambiguous nature of written natural language and the many factors that can make a statement semantically erroneous. Current methods identify semantic errors in a sentence by determining whether it contradicts the domain to which the sentence belongs. However, because these methods are constructed on expected logic contradictions, they cannot handle new or unexpected semantic errors. In this paper, we propose a new method for detecting semantic errors that is based on logic reasoning. Our proposed method converts text into logic clauses, which are later analyzed against a domain ontology by an automatic reasoner to determine its consistency. This approach can provide a complete analysis of the text, since it can analyze a single sentence or sets of multiple sentences. When there are multiple sentences to analyze, in order to avoid the high complexity of reasoning over a large set of logic clauses, we propose rules that reduce the set of sentences to analyze, based on the logic relationships between sentences. In our evaluation, we have found that our proposed method can identify a significant percentage of semantic errors and, in the case of multiple sentences, it does so without significant computational cost. We have also found that both the quality of the information extraction output and modeling elements of the ontology (i.e., property domain and range) affect the capability of detecting errors.

F. Gutierrez, D. Dou, N. de Silva, and S. Fickas, "Online Reasoning for Semantic Error Detection in Text," Journal on Data Semantics, 2017, [pdf] [bib]
Citation count: 1

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data

Journal of biomedical semantics
2016

As a special class of non-coding RNAs (ncRNAs), microRNAs (miRNAs) perform important roles in numerous biological and pathological processes. The realization of miRNA functions depends largely on how miRNAs regulate specific target genes. It is therefore critical to identify, analyze, and cross-reference miRNA-target interactions to better explore and delineate miRNA functions. Semantic technologies can help in this regard. We previously developed a miRNA domain-specific application ontology, Ontology for MIcroRNA Target (OMIT), whose goal was to serve as a foundation for semantic annotation, data integration, and semantic search in the miRNA field. In this paper we describe our continuing effort to develop the OMIT, and demonstrate its use within a semantic search system, OmniSearch, designed to facilitate knowledge capture of miRNA-target interaction data. Important changes in the current version OMIT are summarized as: (1) following a modularized ontology design (with 2559 terms imported from the NCRO ontology); (2) encoding all 1884 human miRNAs (vs. 300 in previous versions); and (3) setting up a GitHub project site along with an issue tracker for more effective community collaboration on the ontology development. The OMIT ontology is free and open to all users, accessible at: http://purl.obolibrary.org/obo/omit.owl. The OmniSearch system is also free and open to all users, accessible at: http://omnisearch.soc.southalabama.edu/index.php/Software.

J. Huang, F. Gutierrez, H. Strachan, D. Dou, W. Huang, B. Smith, J. Blake, K. Eilbeck, D. Natale, Y. Lin, B. Wu, N. de Silva, and . others, "OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data," Journal of biomedical semantics, vol. 7, no. 1, pp. 25, 2016, [pdf] [bib]
Citation count: 25

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

The development of non-coding RNA ontology

International journal of data mining and bioinformatics
2016

Identification of non-coding RNAs (ncRNAs) has been significantly improved over the past decade. On the other hand, semantic annotation of ncRNA data is facing critical challenges due to the lack of a comprehensive ontology to serve as common data elements and data exchange standards in the field. We developed the Non-Coding RNA Ontology (NCRO) to handle this situation. By providing a formally defined ncRNA controlled vocabulary, the NCRO aims to fill a specific and highly needed niche in semantic annotation of large amounts of ncRNA biological and clinical data.

J. Huang, K. Eilbeck, B. Smith, J. Blake, D. Dou, W. Huang, D. Natale, A. Ruttenberg, J. Huan, M. Zimmermann, Gu. Jiang, Y. Lin, B. Wu, H. Strachan, Ni. de Silva, and . others, "The development of non-coding RNA ontology," International journal of data mining and bioinformatics, vol. 15, no. 3, pp. 214--232, 2016, [pdf] [bib]
Citation count: 13

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

The Potential of Mobile Network Big Data as a Tool in Colombo's Transportation and Urban Planning

Information Technologies & International Development
2016

Rapid urban population growth is straining transportation systems. A big data-centric approach to transportation management is already a reality in many developed economies, with transportation systems being fed a large quantity of sensor data. Developing countries, by contrast, rely heavily on infrequent and expensive surveys. With mobile phone use becoming ubiquitous, even in developing countries, there is potential to leverage data from citizens' mobile phone use for transportation planning. Such data can allow planners to produce insights quickly, without waiting for the proliferation of sensors. Using mobile network big data (MNBD) from Sri Lanka, our article explores this potential, producing mobility-related insights for the capital city of Colombo. MNBD-based insights cannot produce all the insights needed, but the high frequency and spatial resolution of the insights that they do provide can complement existing infrequent surveys. For resource-constrained developing economies, even an incremental advance in their ability to produce timely and actionable knowledge can improve existing transportation and urban planning. However, more research will be required before such techniques can be mainstreamed.

S. Lokanathan, G. Kreindler, N. H. Nisana de Silva, Y. Miyauchi, D. Dhananjaya, and R. Samarajiva, "The Potential of Mobile Network Big Data as a Tool in Colombo's Transportation and Urban Planning," Information Technologies \& International Development, vol. 12, no. 2, pp. pp--63, 2016, [pdf] [bib]
Citation count: 4

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download


Conference Papers

Identifying Relationships Among Sentences in Court Case Transcripts Using Discourse Relations

Advances in ICT for Emerging Regions (ICTer), 2018 Eighteenth International Conference on
2018

Case Law has a significant impact on the proceedings of legal cases. Therefore, the information that can be obtained from previous court cases is valuable to lawyers and other legal officials when performing their duties.This paper describes a methodology of applying discourse relations between sentences when processing text documents related to the legal domain. In this study, we developed a mechanism to classify the relationships that can be observed among sentences in transcripts of United States court cases. First, we defined relationship types that can be observed between sentences in court case transcripts. Then we classified pairs of sentences according to the relationship type by combining a machine learning model and a rule-based approach. The results obtained through our system were evaluated using human judges. To the best of our knowledge, this is the first study where discourse relationships between sentences have been used to determine relationships among sentences in legal court case transcripts.

G. Ratnayaka, T. Rupasinghe, N. de Silva, M. Warushavithana, V. Gamage, and A. Perera, "Identifying Relationships Among Sentences in Court Case Transcripts Using Discourse Relations," in Advances in ICT for Emerging Regions (ICTer), 2018 Eighteenth International Conference on, IEEE, 2018, pp. 13--20, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Legal Document Retrieval using Document Vector Embeddings and Deep Learning

arXiv preprint arXiv:1805.10685
2018

Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire process to be time consuming and cumbersome. In this study, we have developed three novel models which are compared against a golden standard generated via the on line repositories provided, specifically for the legal domain. The three different models incorporated vector space representations of the legal domain, where document vector generation was done in two different mechanisms and as an ensemble of the above two. This study contains the research being carried out in the process of representing legal case documents into different vector spaces, whilst incorporating semantic word measures and natural language processing techniques. The ensemble model built in this study, shows a significantly higher accuracy level, which indeed proves the need for incorporation of domain specific semantic similarity measures into the information retrieval process. This study also shows, the impact of varying distribution of the word similarity measures, against varying document vector dimensions, which can lead to improvements in the process of legal information retrieval.

K. Sugathadasa, B. Ayesha, N. de Silva, A. Perera, V. Jayawardana, D. Lakmal, and M. Perera, "Legal Document Retrieval using Document Vector Embeddings and Deep Learning," arXiv preprint arXiv:1805.10685, 2018, [pdf] [bib]
Citation count: 4

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity

IEEE International Conference on Industrial and Information Systems (ICIIS)
2017

Semantic similarity measures are an important part in Natural Language Processing tasks. However Semantic similarity measures built for general use do not perform well within specific domains. Therefore in this study we introduce a domain specific semantic similarity measure that was created by the synergistic union of word2vec, a word embedding method that is used for semantic similarity calculation and lexicon based (lexical) semantic similarity methods. We prove that this proposed methodology outperforms both, word embedding methods trained on a generic corpus and word embedding methods trained on a domain specific corpus, which do not use lexical semantic similarity methods to augment the results. Further, we prove that text lemmatization can improve the performance of word embedding methods.

K. Sugathadasa, B. Ayesha, N. de Silva, A. Perera, V. Jayawardana, D. Lakmal, and M. Perera, "Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity," IEEE International Conference on Industrial and Information Systems (ICIIS), 2017, [pdf] [bib]
Citation count: 10

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Semi-Supervised Instance Population of an Ontology using Word Vector Embeddings

Advances in ICT for Emerging Regions (ICTer), 2017 Seventeenth International Conference on
2017

Word lists that contain closely related sets of words is a critical requirement in machine understanding and processing of natural languages. Creating and maintaining such closely related word lists is a critical and complex process that requires human input and carried out manually in the absence of tools. We describe a supervised learning mechanism which employs a word ontology to expand word lists containing closely related sets of words. The approach described in this paper uses two novel supervised learning techniques that complement each other for the purpose of expanding existing lists of related words. Expanding concept variable lists of RelEx2Frame component of OpenCog Artificial General Intelligence Framework using WordNet is used as a proof of concept. Intervention of this project would enable OpenCog applications to attempt to understand words that they were not able to understand before, due to the limited size of existing lists of related words.

V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, B. Ayesha, and M. Perera, "Semi-Supervised Instance Population of an Ontology using Word Vector Embeddings," in Advances in ICT for Emerging Regions (ICTer), 2017 Seventeenth International Conference on, IEEE, Sep. 2017, [pdf] [bib]
Citation count: 8

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Discovering Inconsistencies in PubMed Abstracts Through Ontology-Based Information Extraction

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
2017

Searching for a cure for cancer is one of the most vital pursuits in modern medicine. In that aspect microRNA research plays a key role. Keeping track of the shifts and changes in established knowledge in the microRNA domain is very important. In this paper, we introduce an Ontology-Based Information Extraction method to detect occurrences of inconsistencies in microRNA research paper abstracts. We propose a method to first use the Ontology for MIcroRNA Targets (OMIT) to extract triples from the abstracts. Then we introduce a new algorithm to calculate the oppositeness of these candidate relationships. Finally we present the discovered inconsistencies in an easy to read manner to be used by medical professionals. To our best knowledge, this study is the first ontology-based information extraction model introduced to find shifts in the established knowledge in the medical domain using research paper abstracts. We downloaded 36877 abstracts from the PubMed database. From those, we found 102 inconsistencies relevant to the microRNA domain.

N. de Silva, D. Dou, and J. Huang, "Discovering Inconsistencies in PubMed Abstracts Through Ontology-Based Information Extraction," in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2017, pp. 362--371, [pdf] [bib]
Citation count: 9

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings

arXiv preprint arXiv:1706.02909
2017

Selecting a representative vector for a set of vectors is a very common requirement in many algorithmic tasks. Traditionally, the mean or median vector is selected. Ontology classes are sets of homogeneous instance objects that can be converted to a vector space by word vector embeddings. This study proposes a methodology to derive a representative vector for ontology classes whose instances were converted to the vector space. We start by deriving five candidate vectors which are then used to train a machine learning model that would calculate a representative vector for the class. We show that our methodology out-performs the traditional mean and median vector representations.

V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, and B. Ayesha, "Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings," arXiv preprint arXiv:1706.02909, 2017, [pdf] [bib]
Citation count: 14

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case

Advances in ICT for Emerging Regions (ICTer), 2015 Fifteenth International Conference on
2015

Sentiment analysis on movie reviews is a topic of interest for artists and businessmen alike for the purpose of gauging the reception of an artwork or to understand the trends in the market for the benefit of future productions. In this study we introduce an algorithm (SAFS3) to classify documents into multiple classes. This paper then evaluates the SAFS3 algorithm through the use case of analysing a set of reviews from Rotten Tomatoes. Thenovel algorithm results in an accuracy of 53.6%. SAFS3 algorithm outperforms the benchmark for this context as well as the set of generic machine learning algorithms commonly used for tasks of this nature.

N. H. N. D. de Silva, "SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case," in Advances in ICT for Emerging Regions (ICTer), 2015 Fifteenth International Conference on, IEEE, Aug. 2015, pp. 77--83, [pdf] [bib]
Citation count: 10

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Comparison between performance of various database systems for implementing a language corpus

International Conference: Beyond Databases, Architectures and Structures
2015

Data storage and information retrieval are some of the most important aspects when it comes to the development of a language corpus. Currently most corpora use either relational databases or indexed file systems. When selecting a data storage system, most important facts to consider are the speeds of data insertion and information retrieval. Other than the aforementioned two approaches, currently there are various database systems which have different strengths that can be more useful. This paper compares the performance of data storage and retrieval mechanisms which use relational databases, graph databases, column store databases and indexed file systems for various steps such as inserting data into corpus and retrieving information from it, and tries to suggest an optimal storage architecture for a language corpus.

D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. de Silva, and G. Dias, "Comparison between performance of various database systems for implementing a language corpus," in International Conference: Beyond Databases, Architectures and Structures, Springer, May 2015, pp. 82--91, [pdf] [bib]
Citation count: 1

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Implementing a Corpus for Sinhala Language

Symposium on Language Technology for South Asia 2015
2015

This paper presents the project we did to develop a corpus, which is continuously updating, dynamic and covers a wide range of topics for Sinhala language. This paper will introduce the technologies we have used in the project and will discuss its design features.

D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. De Silva, and G. Dias, "Implementing a Corpus for Sinhala Language," in Symposium on Language Technology for South Asia 2015, 2015, [pdf] [bib]
Citation count: 2

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Sentence similarity measuring by vector space model

Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on
2014

In Natural Language Processing and Text mining related works, one of the important aspects is measuring the sentence similarity. When measuring the similarity between sentences there are three major branches which can be followed. One procedure is measuring the similarity based on the semantic structure of sentences while the other procedures are based on syntactic similarity measure and hybrid measures. Syntactic similarity based methods take into account the co-occurring words in strings. Semantic similarity measures consider the semantic similarity between words based on a Semantic Net. In most of the time, easiest way to calculate the sentence similarity is using the syntactic measures, which do not consider grammatical structure of sentences. There are sentences which have the same meaning with different words. By considering both semantic and syntactic similarity we can improve the quality of the similarity measure rather than depending only on semantic or syntactic similarity. This paper follows the sentence similarity measure algorithm which is developed based on both syntactic and semantic similarity measures. This algorithm is based on measuring the sentence similarity by adhering to a vector space model generated for the word nodes in the sentences. In this implementation we consider two types of relationships. One of them is relationship between verbs in the sentence pairs while the other one is the relationship between nouns in the sentence pairs. One of the major advantages of this method is, it can be used for variable length sentences. In the experiment and results section we have been included our gain with this algorithm for a selected set of sentence pairs and have been compared with the actual human ratings for the similarity of the sentence pairs.

U. L. D. N. Gunasinghe, W. A. M. De Silva, N. H. N. D. de Silva, A. S. Perera, W. A. D. Sashika, and W. D. T. P. Premasiri, "Sentence similarity measuring by vector space model," in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on, IEEE, Dec. 2014, pp. 185--189, [pdf] [bib]
Citation count: 4

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Novel approach for perception analysis in a learning environment

Teaching, Assessment and Learning (TALE), 2014 International Conference on
2014

With the advancement of information technology, we have got an opportunity to change the learning experience of students. E-learning systems such as Learning Management Systems (LMS), Course Management Systems and Students Management Systems have been emerged as a result of this, aiding a vast variety of education institutes around the globe. Yet, the quality assurance processes are still dependent on conventional student feedback mechanisms that have their inherent drawbacks. This has become a major barrier for quality improvement of the learning experience of students. We have introduced a solution for this through a crowdsourcing based perception capturing and analysis platform that can be adapted in many scenarios seamlessly. The design and implementation of backend analysis algorithms have been discussed along with the results gained from this project.

S. Ponnamperuma, A. Gunawardana, T. Shanika, P. Pathirana, S. Markus, and N. H. Nisansa Dilushan de Silva, "Novel approach for perception analysis in a learning environment," in Teaching, Assessment and Learning (TALE), 2014 International Conference on, IEEE, Dec. 2014, pp. 148--154, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Using Mobile Network Big Data for Informing Transportation and Urban Planning in Colombo

Available at SSRN
2014

Road congestion is proving to be an increasing problem for countries experiencing rapid growth. Data are needed to identify the choke points and prioritize additions and enhancements. A data-centric approach to transportation management based on sensor data is already a reality in many developed economies, with transportation systems being fed with a multitude of sensor data such as loop detectors, axel counters, parking occupancy monitors, CCTV, integrated public transport card readers as well as GPS data, from phones as well as public and private transport (Amini, Bouillet, Calabrese, Gasparini, & Verscheure, 2011). Developing economies however are more reliant on more traditional forms of data collection such as questionnaires. Such survey based methods administered at peak hours can be very costly, not only in terms of personnel and processing, but also in terms of traffic disruption. Other less intrusive methods (e.g., automatic traffic recorders) do not yield information such as routes taken and parking. Mobile network Big Data has enormous potential for traffic planning. Because the data streams are continuously flowing, the effects of changes in traffic channels such as one-way schemes and new roads can potentially be easily tracked. Though additional costs of data storage may be involved, BTS [Base Transceiver Station] hand-off data can even serve as sensors of speed of traffic and of disruptions. As the proportion of GPS enabled smartphones increases, it may be possible to achieve the same objective from a smaller sample without collecting masses of BTS hand-off data. Hence the primary research question addressed by this paper is to understand if mobile network Big Data can inform transportation planning for the city of Colombo, Sri Lanka. To do this we attempt to understand where the daytime commuting population of Colombo comes from and thereby creating Origin Destination (OD) matrices that explicates the flow between different areas.

S. Lokanathan, N. de Silva, G. Kreindler, Y. Miyauchi, and D. Dhananjaya, "Using Mobile Network Big Data for Informing Transportation and Urban Planning in Colombo," Available at SSRN, 2014, [pdf] [bib]
Citation count: 7

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Building a WordNet for Sinhala

Proceedings of the Seventh Global WordNet Conference
2014

Sinhala is one of the official languages of Sri Lanka and is used by over 19 million people. It belongs to the Indo-Aryan branch of the Indo-European languages and its origins date back to at least 2000 years. It has developed into its current form over a long period of time with influences from a wide variety of languages including Tamil, Portuguese and English. As for any other language, a WordNet is extremely important for Sinhala to take it into the digital era. This paper is based on the project to develop a WordNet for Sinhala based on the English (Princeton) WordNet. It describes how we overcame the challenges in adding Sinhala specific characteristics which were deemed important by Sinhala language experts to the WordNet while keeping the structure of the original English WordNet. It also presents the details of the crowdsourcing system we developed as a part of the project consisting of a NoSQL database in the backend and a web-based frontend. We conclude by discussing the possibility of adapting this architecture for other languages and the road ahead for the Sinhala WordNet and Sinhala NLP.

T. I. I. Wijesiri, M. P. Gallage, D. G. B. P. Gunathilaka, D. M. M. Lakjeewa, D. C. Wimalasuriya, G. Dias, R. Paranavithana, and N. de Silva, "Building a WordNet for Sinhala," in Proceedings of the Seventh Global WordNet Conference, Jan. 2014, pp. 100--108, [pdf] [bib]
Citation count: 14

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Semi-supervised algorithm for concept ontology based word set expansion

Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on
2013

Word lists that contain closely related sets of words is a critical requirement in machine understanding and processing of natural languages. Creating and maintaining such closely related word lists is a critical and complex process that requires human input and carried out manually in the absence of tools. We describe a supervised learning mechanism which employs a word ontology to expand word lists containing closely related sets of words. The approach described in this paper uses two novel supervised learning techniques that complement each other for the purpose of expanding existing lists of related words. Expanding concept variable lists of RelEx2Frame component of OpenCog Artificial General Intelligence Framework using WordNet is used as a proof of concept. Intervention of this project would enable OpenCog applications to attempt to understand words that they were not able to understand before, due to the limited size of existing lists of related words.

N. H. N. D. De Silva, A. S. Perera, and M. K. D. T. Maldeniya, "Semi-supervised algorithm for concept ontology based word set expansion," in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on, IEEE, Dec. 2013, pp. 125--131, [pdf] [bib]
Citation count: 9

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Document analysis based Automatic Concept Map generation for Enterprises

International Conference on Advances in ICT for Emerging Regions
2013

Ever growing knowledge bases of enterprises present the demanding challenge of proper organization of information that would enable fast retrieval of related and intended information. Document repositories of enterprises consist of large collections of documents of varying size, format and writing styles. This diversified and unstructured nature of documents restrict the possibilities of developing uniform techniques for extracting important concepts and relationships for summarization, structured representation and fast retrieval. The documented textual content is used as the input for the construction of a concept map. Here a rule based approach is used to extract concepts and relationships among them. Sentence level breakdown enables these rules to identify those concepts and relationships. These rules are based on elements in a phase structure tree of a sentence. For improving accuracy and the relevance of the extracted concepts and relationships, the special features such as titles, bold and upper case texts are used. This paper discusses how to overcome the above mentioned challenges by utilizing high level natural language processing techniques, document pre-processing techniques and developing easily understandable and extractable compact representation of concept maps. Each document in the repository is converted to a concept map representation to capture concepts and relationships among concepts described in the said document. This organization would represent a summary of the document. These individual concept maps are utilized to generate concept maps that represent sections of the repository or the entire document repository. This paper discusses how statistical techniques are used to calculate certain metrics which are used to facilitate certain requirements of the solution. Principle component analysis is used in ranking the documents by importance. The concept map is visualized using force directed type graphs which represent concepts by nodes and relationships by edges.

E. L. Karannagoda, H. M. T. C. Herath, K. N. J. Fernando, M. W. I. D. Karunarathne, N. H. N. D. de Silva, and A. S. Perera, "Document analysis based Automatic Concept Map generation for Enterprises," in International Conference on Advances in ICT for Emerging Regions, Dec. 2013, [pdf] [bib]
Citation count: 1

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Enabling effective synoptic assessment via algorithmic constitution of review panels

Teaching, Assessment and Learning for Engineering (TALE), 2013 IEEE International Conference on
2013

This paper presents an algorithmic tool that was used to create panels of experts for the synoptic assessment of a software engineering project course that is targeted towards fostering innovation and creativity in software engineering students. Synoptic assessments succeed with the ability to formulate expert evaluation panels. Yet many industry experts are busy professionals, and hence, the process of constituting appropriately balanced evaluation panels for project demonstrations is a significant challenge. The discussion includes the outcomes of using the algorithm to automate the panel composition and scheduling process for synoptic assessments of project demonstrations for a batch of 100 students following the Bachelor of Engineering (Honors) degree program in the Department of Computer Science and Engineering at the University of Moratuwa in Sri Lanka. We describe the challenges we faced and our approach towards addressing the issues as well as our encouraging successes.

N. H. Nisansa Dilushan de Silva, S. Weerawarana, and A. Perera, "Enabling effective synoptic assessment via algorithmic constitution of review panels," in Teaching, Assessment and Learning for Engineering (TALE), 2013 IEEE International Conference on, IEEE, Aug. 2013, pp. 776--781, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Automating the Composition and Scheduling Process for Synoptic Assessment Panels

9th SDC-SLAIHEE Higher Educationm Conference 2013
2013

The Bachelor of Engineering (Honours) program of the Department of Computer Science and Engineering at the University of Moratuwa has a compulsory software engineering project course in the 5th semester. This course has been designed to foster creativity and software engineering rigor. Since the design of this course (CS3202) straddles several program ILOs at a 5th semester level, along with a strong emphasis on creativity, a synoptic assessment approach was selected in the course evaluation framework. This involved constituting expert evaluation panels. Previously, compilation and scheduling of evaluator panels aligned with the heterogeneous technology profiles of the student projects was done manually. However, it was a highly tedious and time consuming task which was further complicated due to the limited number of evaluators and conflicting time constraints. The objective of current research was to evaluate the efficiency of automating the process of constituting synoptic assessment panels to evaluate the above mentioned student projects (n=101). An action research methodology was followed in the study. In the research 'planning' phase, data on student project technology profiles, competency and availability of the evaluators, and course dependent restrictions was gathered. In the research 'action' phase, an algorithm was devised with the following primary objective: "Each student will be assigned a 'best-fit' panel of evaluators considering the technologies used in the student's project." The research 'observation phase' showed that 120 of the total 141 feasible student project profiles were successfully matched by the algorithm to the areas of technology expertise of the evaluators, resulting in an 85.12% success rate. Thus it can be concluded that this approach is a very important improvement over the manual assignment of panels. The future work is to implement an online application. and our recommendation is that other educators could use this application for a similar purpose.

N. H. N. D. de Silva, S. M. Weerawarana, and A. S. Perera, "Automating the Composition and Scheduling Process for Synoptic Assessment Panels," in 9th SDC-SLAIHEE Higher Educationm Conference 2013, Jun. 2013, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

SeMap-mapping dependency relationships into semantic frame relationships

Engineering Research Unit Research Seminar
2011

We describe the refacloring process of the RelEx2Frame component of OpenCog AGI Framework, a method for expanding concept variables used in RelEx and automatic generation of a common sense knowledge base specifically with relation to concept relationships. The well-known Drools rule engine is used instead of handcoded rulses; an asynchronous concurrent architecture and an indexing mechanism are designed to gain performance of re-factored RelEx2Frame. WordNet aided supervised learning mechanism is applied to expand concept variables. Association mining is used on semantic frames acquired through processing an instance of Wikipedia in order to generate a common sense knowledge base.

N. H. N. D. de Silva, C. S. N. J. Fernando, M. K. D. T. Maldeniya, D. N. C. Wijeratne, A. S. Perera, and B. Goertze, "SeMap-mapping dependency relationships into semantic frame relationships," in Engineering Research Unit Research Seminar, Dec. 2011, [pdf] [bib]
Citation count: 8

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Text Normalization in Social Media by using Spell Correction and Dictionary Based Approach

Systems Learning
2012

Daily, massive number of pieces of textual information is gathered into Social Media. They comprise a challenging style as they are formed with both slang and formal words. This has become an obstacle for processing texts in Social Media. In this paper we address this issue by introducing a pre-processing pipeline for social media text. For the solution we have proposed, we are focused on English texts from famous micro blogging site, Twitter. Our major source is a set of common slang words which we gathered by incorporating various other sources. Apart from that we are resolving derivations of slang words by following spell correction based approach.

E. Mapa, L. Wattaladeniya, C. Chathuranga, S. Dassanayake, N. de Silva, U. Kohomban, and D. Maldeniya, "Text Normalization in Social Media by using Spell Correction and Dictionary Based Approach," Systems Learning, vol. 1, pp. 1--6, 2012, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download


Workshop Papers

Shift-of-Perspective Identification Within Legal Cases

Proceedings of the 3rd Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts
2019

Arguments, counter-arguments, facts, and evidence obtained via previous court case transcripts are of essential need for individuals handling legal scenarios. Therefore, the process of automatic information extraction from court case transcripts can be considered to be of significant importance. This study is focused on the identification of sentences in court case transcripts which convey different perspectives on the same topic or entity. We combined several approaches based on semantic analysis, open information extraction, and sentiment analysis to achieve our objective. Then our methodology was evaluated with the help of human judges. The outcomes of the evaluation demonstrate that our system is successful in detecting situations where two sentences deliver different opinions on the same topic or entity. The proposed methodology can be used to facilitate other information extraction tasks related to the legal domain such as the detection of counter arguments and identification of opponent parties in a court case.

G. Ratnayaka, T. Rupasinghe, N. de Silva, V. Gamage, M. Warushavithana, and A. Perera, "Shift-of-Perspective Identification Within Legal Cases," in Proceedings of the 3rd Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts, 2019, pp. to appear, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Fast Approach to Build an Automatic Sentiment Annotator for Legal Domain using Transfer Learning

Proceedings of the 9th workshop on computational approaches to subjectivity, sentiment and social media analysis
2018

This study proposes a novel way of identifying the sentiment of the phrases used in the legal domain. The added complexity of the language used in law, and the inability of the existing systems to accurately predict the sentiments of words in law are the main motivations behind this study. This is a transfer learning approach, which can be used for other domain adaptation tasks as well. The proposed methodology achieves an improvement of over 6% compared to the source model's accuracy in the legal domain.

V. Gamage, M. Warushavithana, N. de Silva, A. Perera, G. Ratnayaka, and T. Rupasinghe, "Fast Approach to Build an Automatic Sentiment Annotator for Legal Domain using Transfer Learning," in Proceedings of the 9th workshop on computational approaches to subjectivity, sentiment and social media analysis, 2018, pp. 260-265, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download


Preprints

Natural Language Processing for Government: Problems and Potential

LIRNEasia
2019

Natural Language Processing (NLP) is a broad umbrella of technologies used for computationally studying large amounts of text and extracting meaning - both syntactic and semantic information. Software using NLP technologies, if engineered for that purpose, generally have the advantage of being able to process large amounts of text at rates greater than humans. A large number of the functions of a government today revolve around vast amounts of text data - from interactions with citizens to examining archives to passing orders, acts, and bylaws. Under ideal conditions, NLP technologies can assist in the processing of these texts, thus potentially providing significant improvements in speed and efficiency to various departments of government. Many proposals and examples exist illustrating how this can be done for multiple domains - from registering public complaints, to conversing with citizens, to tracking policy changes across bills and Acts. This whitepaper seeks to examine both the current state of the art of NLP and illustrate government-oriented use cases that are feasible among resource-rich languages.

Y. Wijeratne, N. de Silva, and Y. Shanmugarajah, "Natural Language Processing for Government: Problems and Potential," LIRNEasia, 2019, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Logic Rules Powered Knowledge Graph Embedding

arXiv preprint arXiv:1903.03772
2019

Large scale knowledge graph embedding has attracted much attention from both academia and industry in the field of Artificial Intelligence. However, most existing methods concentrate solely on fact triples contained in the given knowledge graph. Inspired by the fact that logic rules can provide a flexible and declarative language for expressing rich background knowledge, it is natural to integrate logic rules into knowledge graph embedding, to transfer human knowledge to entity and relation embedding, and strengthen the learning process. In this paper, we propose a novel logic rule-enhanced method which can be easily integrated with any translation based knowledge graph embedding model, such as TransE . We first introduce a method to automatically mine the logic rules and corresponding confidences from the triples. And then, to put both triples and mined logic rules within the same semantic space, all triples in the knowledge graph are represented as first-order logic. Finally, we define several operations on the first-order logic and minimize a global loss over both of the mined logic rules and the transformed first-order logics. We conduct extensive experiments for link prediction and triple classification on three datasets: WN18, FB166, and FB15K. Experiments show that the rule-enhanced method can significantly improve the performance of several baselines. The highlight of our model is that the filtered Hits@1, which is a pivotal evaluation in the knowledge inference task, has a significant improvement (up to 700% improvement).

P. Wang, D. Dou, F. Wu, N. de Silva, and L. Jin, "Logic Rules Powered Knowledge Graph Embedding," arXiv preprint arXiv:1903.03772, 2019, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Subject Specific Stream Classification Preprocessing Algorithm for Twitter Data Stream

arXiv preprint arXiv:1705.09995
2017

Micro-blogging service Twitter is a lucrative source for data mining applications on global sentiment. But due to the omnifariousness of the subjects mentioned in each data item, it is inefficient to run a data mining algorithm on the raw data. This paper discusses an algorithm to accurately classify the entire stream in to a given number of mutually exclusive collectively exhaustive streams upon each of which the data mining algorithm can be run separately yielding more relevant results with a high efficiency.

N. de Silva, D. Maldeniya, and C. Wijeratne, "Subject Specific Stream Classification Preprocessing Algorithm for Twitter Data Stream," arXiv preprint arXiv:1705.09995, 2017, [pdf] [bib]
Citation count: 3

Open pdf to downloadCoppy bib entry to clipboardOpen bib to download

Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language

2015

Sinhala, despite its several millennia long history, remains a resource poor language. The objective of this study was to explore the possibility of enhancing the text classification process of a resource poor language by means of data and tools from a resource rich language. However, it was discovered that if the feature space is based on an n-gram model, Sinhala, being a a highly inflected language, naturally performs better than English, which is a weakly inflected language. This result held true even when Sinhala was only utilizing the basic lexical level models and English was utilizing advanced semantic level models.

N. de Silva, "Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language," , 2015, [pdf] [bib]
Open pdf to downloadCoppy bib entry to clipboardOpen bib to download



Data Sets

SigmaLaw DataSets

Large Legal Text Corpus and Word Embeddings dataset

This dataset is comprised of data gathered for and created in the process of the paper Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity. It contains a large legal data text corpus, several word2vec embedding models of the words in the said corpus, and a set of legal domain gazetteer lists.

The entire dataset is hosted at OSF. Direct links to the files are as follows:

  1. Legal Case Corpus: This corpus contains 39,155 legal cases including 22,776 taken from the United States supreme court. For the convenience of the future researchers, we have also included 29,404 cases after some preprocessing. A map (key) is included for the folder numbering in the provided zip file.
  2. Legal Domain Word2Vec models: Two word2vec models trained on the above corpus are included. One trained on raw legal text and one trained on the same text after lemmatization.
  3. Legal Domain gazetteer lists: A number of gazetteer lists built by a legal professional to indicate domain specific semantic grouping is included.
  4. Word2Vec results: Finally the results obtained by this paper using the trained word2vec models are included. [100x100] [100x200] [100x210] [100x500]

Citing this DataSet:

If you are using our Large Legal Text Corpus and Word Embeddings dataset, please cite this paper:

K. Sugathadasa, B. Ayesha, N. de Silva, A. Perera, V. Jayawardana, D. Lakmal, and M. Perera, "Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity," IEEE International Conference on Industrial and Information Systems (ICIIS), 2017, [pdf] [bib]

Legal Information Retrieval dataset

This dataset is comprised of data gathered for and created in the process of the paper Legal Document Retrieval using Document Vector Embeddings and Deep Learning. Other than the files provided here, it uses the large legal text corpus dataset mentioned above, out of which it takes a set of raw cases. In addition, this dataset contains a mention map, edge list, and the output of legal text ranking.

The entire dataset is hosted at OSF. Direct links to the files are as follows:

  1. Text Corpus and Map: This corpus contains 2,500 cases extracted from the large legal text corpus dataset given above. With the set of these case files is provided a mention map which indicates which case have cited which other case within the corpus.
  2. Edge List: This is the edge list of the citation graph generated by the above mention map.
  3. Outputs: Finally the results obtained by this paper are included in the text rank form and as a serialized file.

Citing this DataSet:

If you are using our Legal Information Retrieval dataset, please cite this paper:

K. Sugathadasa, B. Ayesha, N. de Silva, A. Perera, V. Jayawardana, D. Lakmal, and M. Perera, "Legal Document Retrieval using Document Vector Embeddings and Deep Learning," arXiv preprint arXiv:1805.10685, 2018, [pdf] [bib]

Legal Ontology Building dataset

This dataset is comprised of data gathered for and created in the process of the paper Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings. Other than the files provided here, it uses the large legal text corpus dataset mentioned above, out of which it takes a set of raw cases. In addition, this dataset contains the created ontology, a gazetteer list, and the result vectors.

The entire dataset is hosted at OSF. Direct links to the files are as follows:

  1. Legal Ontology: This is the limited legal ontology built for the purpose of this study.
  2. Case Files: This corpus contains X cases extracted from the large legal text corpus dataset given above.
  3. Legal Domain gazetteer lists: A set of gazetteer lists built by a legal professional and by data collection are included.
  4. Results: Finally the result vectors obtained by this paper are included.

Citing this DataSet:

If you are using our Legal Ontology Building dataset, please cite this paper:

V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, and B. Ayesha, "Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings," arXiv preprint arXiv:1706.02909, 2017, [pdf] [bib]

Legal Ontology Population dataset

This dataset is comprised of data gathered for and created in the process of the paper Word Vector Embeddings and Domain Specific Semantic based Semi-Supervised Ontology Instance Population. Other than the files provided here, it uses the large legal text corpus dataset out of which it takes a set of raw cases and the small legal ontology from the Legal Ontology Building dataset. However, we do not include that ontology in this dataset. Please download it from above. The domain specific semantic is based on the result models built by the large legal text corpus study. This dataset contains the class instances by the proposed models, a gazetteer list of legal words, and the result vectors.

The entire dataset is hosted at OSF. Direct links to the files are as follows:

  1. Class instances by 5 models: These are the instances to be used to populate the classes in the ontology according to the 5 proposed models.
  2. Legal words: This is a set of gazetteer lists of words in the legal domain prepared with the help of a legal professional.
  3. Results: Finally the result vectors obtained by this paper are included.

Citing this DataSet:

If you are using our Legal Ontology Population dataset, please cite one or both of these papers:

V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, B. Ayesha, and M. Perera, "Word Vector Embeddings and Domain Specific Semantic based Semi-Supervised Ontology Instance Population," ICTer, vol. 11, no. 1, 2018, [pdf] [bib]
V. Jayawardana, D. Lakmal, N. de Silva, A. Perera, K. Sugathadasa, B. Ayesha, and M. Perera, "Semi-Supervised Instance Population of an Ontology using Word Vector Embeddings," in Advances in ICT for Emerging Regions (ICTer), 2017 Seventeenth International Conference on, IEEE, Sep. 2017, [pdf] [bib]

PubMed DataSet

PubMed dataset contains the data collected for and produced by the project of 'Discovering Inconsistencies in PubMed Abstracts Through Ontology-Based Information Extraction'. This includes three levels of data: source files, Intermediate Outputs, and Output Files. The source files contains a list of PubMed ids and 36,877 PubMed abstracts. The intermediate outputs contains: full Stanford core NLP results and OLLIE triple extractor results for all the above abstracts and finalized set of triples compatible with the OMIT Ontology. Finally, the output files are contained of: a medical dictionary built out of the above PubMed abstract corpus, discovered raw inconsistencies, and the Discovered Final inconsistencies in the expanded form.

The entire dataset is hosted at OSF. Direct links to the files are as follows:

Source Files (collected data):

Intermediate Outputs:

  • PubMed abstracts parsed with Stanford Parser: tar.gz files
  • OLLIE triples created from PubMed abstracts: tar.gz file (27.3MB)
  • Created final triples: tar.gz file (2.9MB)

Output Files:

  • Created final triples: tar.gz file (2.9MB)
  • Created medical dictionary: text file (4.8MB)
  • Discovered raw inconsistencies: text file (65kB)
  • Discovered Final inconsistencies (expanded): text file (63kB)

Source Code:

Java has been used as the programming language in implimentations for the entire project. The Java source files are available from the GitHub organization OMIT-PubMed-Inconsistencies. It contains source code for the following projects:

Citing PubMed DataSet paper:

If you are using our PubMed DataSet, please cite this paper:

N. de Silva, D. Dou, and J. Huang, "Discovering Inconsistencies in PubMed Abstracts Through Ontology-Based Information Extraction," in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2017, pp. 362--371, [pdf] [bib]

SinMin DataSet

Sinmin contains texts of different genres and styles of the modern and old Sinhala language. The main sources of electronic copies of texts for the corpus are online Sinhala newspapers, online Sinhala news sites, Sinhala school textbooks available in online, online Sinhala magazines, Sinhala Wikipedia, Sinhala fictions available in online, Mahawansa, Sinhala Blogs, Sinhala subtitles and Sri lankan gazette.

The entire Sinhala text corpus is hosted at OSF in compressed and uncompressed versions. Direct links to the compressed files are as follows:

Citing SinMin DataSet paper:

If you are using our SinMin DataSet, please cite this paper:

D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. De Silva, and G. Dias, "Implementing a Corpus for Sinhala Language," in Symposium on Language Technology for South Asia 2015, 2015, [pdf] [bib]

SiClaEn DataSet

SiClaEn dataset contains a Reuters English News DataSet and a Sinhala News DataSet. The Sinhala News DataSet was collected from bi-lingual Sinhala and English news sources such as AdaDerana and NewsFirst. The Reuters English News DataSet has 7103 sentences in 383 posts and the Sinhala News DataSet has 5221 sentences in 471 posts. All datasets are categorized pertaining to the following topics; business, entertainment, politics, Science& technology, and sports.

The entire Sinhala and English text corpus is hosted at OSF. Direct links to the files are as follows:

English Data (Reuters):

Sinhla Data:

Citing SiClaEn DataSet paper:

If you are using our SiClaEn DataSet, please cite this paper:

N. de Silva, "Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language," , 2015, [pdf] [bib]

Lecture Slides

University of Moratuwa

CS4460 Advanced Algorithms (2012-2013)

Note: The slides of this course are based on the slides by Prof. Sanath Jayasena who taught this course before me. Dr. Chinthana Wimalasuriya co-taught this subject with me and conducted the lectures 1-5. Here I have listed only the lectures which I conducted. (ie: lectures 6-12).
Advanced Design and Analysis Techniques
Graph Algorithms
Other Topics

CS4490 Bioinformatics (2012-2013)

Note: The slides of this course are based on the slides by Dr. Mahendra Piraveenan who taught this course before me. Dr. Charith Chitraranjan co-taught this subject with me and conducted the lectures 1-7 and 11-12. Here I have listed only the lectures which I conducted. (ie: lectures 8-10).

CS2212 Programming Challenge II (2012-2013)

Note: This course was conducted as a series of workshops.

CS3202 Software Engineering Project (2012)

Note: This course was conducted as a series of workshops.

CS4622 Machine Learning Sem 7 (2013)

Note: The slides of this course are loosly based on the slides by Dr. Upali Kohomban who taught this course before me. Dr. Chinthana Wimalasuriya co-taught this subject with me and conducted the lectures 1-5 and 10-12. Here I have listed only the lectures which I conducted. (ie: lectures 6-12).

CS4522 Advanced Algorithms Sem 7 (2013)

Note: The slides of this course are based on the slides by Prof. Sanath Jayasena who taught this course before me. Prof. Sanath Jayasena and Dr. Adeesha Wijayasiri co-taught this subject with me and conducted the lectures 1-5, 8, and 11-12. Here I have listed only the lectures which I conducted. (ie: lectures 6-7, 9-10).

CS4742 Bioinformatics Sem 7 (2013)

Note: The slides of this course are based on the slides by Dr. Mahendra Piraveenan who taught this course before me. Dr. Charith Chitraranjan co-taught this subject with me and conducted the lectures 1-7 and 10-12. Here I have listed only the lectures which I conducted. (ie: lectures 8-9).


Northshore College of Business and Technology

Data Schemas and Applications

Note: The slides of this course are based on the slides by Mr. Prakash Chatterjee who teaches this course at University of the West of England, Bristol.

Other Interests

Toastmasters club
I am a member of the University of Moratuwa Toastmaters club (Club number 598998, Area 3, Division J, in District 82. ) I have been an active member since 2012 and am currently serving as the Immediate Past President. As the President of 2013-2014 board, I was able to lead the team to make our club a President's Distinguished club. I was the Sergeant-At-Arms in the 2012-2013 board. The scripts of the speeches that I did can be downloaded from the following links in the pdf format.

Competent Communicator

Advanced Communicator Bronze

The Entertaining Speaker

Technical Presentations



Classical music
Classical music has been a lifelong interest for me. Not only do I own a vast number of classical music recordings but also I make it a habit to visit when there is a major classical music event in Colombo, Sri Lanka. Not restricting myself to enjoying classical music, I maintain a blog that explains choice classical music pieces to the lay individual who is a novice to the world of classical music. The blog is written is both Sinhala and English so that it can reach the local audience as well as the international audience. Listed below are the links to a few blog posts from my Classical music blog.

Contact

Mailing Address: N. H. N. D. de Silva
1420 Villard St. Apt 307,
Eugene, OR, 97403

Office Location: Room 360,
Deschutes Hall,
University of Oregon

Phone: (+1)-541-346 3935

Email: nisansa@cs.uoregon.edu

Research Profiles: Google Scholar   ResearchGate   AcademicTree   Erdős Number

Social media: Twitter   Facebook   Google Plus   LinkedIn

Blogs: blogger_logo_music - Small   blogger_logo_tech - Small   Philosophy


Please refer my public calendar for office hours, classes, and events. 


Calendar