Using Semantic Clustering for Detecting Bengali Multiword Expressions

Tanmoy Chakraborty


Multiword Expressions (MWEs), a known nuisance for both linguistics and NLP, blur the lines between
syntax and semantics. The semantic of a MWE cannot be expressed after combining the semantic of its
constituents. In this study, we propose a novel approach called "semantic clustering" as an instrument for
extracting the MWEs especially for resource constraint languages like Bengali. At the beginning, it tries to
locate clusters of the synonymous noun tokens present in the document. These clusters in turn help measure
the similarity between the constituent words of a potential candidate using a vector space model. Finally
the judgment for the suitability of this phrase to be a MWE is carried out based on a predefined threshold.
In this experiment, we apply the semantic clustering approach only for noun-noun bigram MWEs; however
we believe that it can be extended to any types of MWEs. We compare our approach with the state-ofthe-
art statistical approach. The evaluation results show that the semantic clustering outperforms all other
competing methods. As a byproduct of this experiment, we have started developing a standard lexicon in
Bengali that serves as a productive Bengali linguistic thesaurus.

Full Text:


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.