Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods

Authors

  • Mahir Mahbub Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh
  • Suravi Akhter Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh
  • Ahmedul Kabir Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh
  • Zerina Begum Institute of Information Technology, University of Dhaka, Dhaka-1000, Bangladesh

Keywords:

Context-based next word prediction, word embedding, sequence model, word2vec, fastText

Abstract

Next word prediction is a helpful feature for various typing subsystems. It is also convenient to have suggestions while typing to speed up the writing of digital documents. Therefore, researchers over time have been trying to enhance the capability of such a prediction system. Knowledge regarding the inner meaning of the words along with the contextual understanding of the sequence can be helpful in enhancing the next word prediction capability. Theoretically, these reasonings seem to be very promising. With the advancement of Natural Language Processing (NLP), these reasonings are found to be applicable in real scenarios. NLP techniques like Word embedding and sequential contextual modeling can help us to gain insight into these points. Word embedding can capture various relations among the words and explain their inner knowledge. On the other hand, sequence modeling can capture contextual information. In this paper, we figure out which embedding method works better for Bengali next word prediction. The embeddings we have compared are word2vec skip-gram, word2vec CBOW, fastText skip-gram and fastText CBOW. We have applied them in a deep learning sequential model based on LSTM which was trained on a large corpus of Bengali texts. The results reveal some useful insights about the contextual and sequential information gathering that will help to implement a context-based Bengali next word prediction system.

Downloads

Download data is not yet available.

Downloads

Published

2023-04-04