Skip to Content
  • +1 555-555-5556
The Nest
  • Sign in
  • Contact Us
  • Home
  • About
  • Services
  • Facilities
  • Programs
  • Resources & Library
  • Events
  • Appointment
  • Contact us
  • Book Now
The Nest
      • Home
      • About
      • Services
      • Facilities
      • Programs
      • Resources & Library
      • Events
      • Appointment
      • Contact us
      • Book Now
    • +1 555-555-5556
    • Sign in
    • Contact Us

    Khasi Language Gets a Technology Boost

  • All Blogs
  • Our blog
  • Khasi Language Gets a Technology Boost
  • 18 May 2025 by
    Khasi Language Gets a Technology Boost
    The Nest

    Natural language processing or NLP is the application of computational technique to the analysis and synthesis of human language both speech and text. The development of corpus, which is a collection of machine-readable text that is sampled to be representative of a particular language, is an essential step in building of NLP systems for a language. Such corpora exist for languages such as English, German, Chinese, Hindi, Bengali, Punjabi, etc. However, not all of these corpora are easily accessible. In English the most widely used corpora is the British National Corpus (BNC) and it is popular among researchers due to its accessibility. Where Khasi is concerned, there are no such publicly available corpus and hence it is referred to as a resource poor language in so far as the application of NLP is concerned. A major contribution in this field has been made with the release of the Khasi annotated corpus titled “Tham Khasi annotated corpus” which is freely accessible through the European Language Resources Association (ELRA) via the link http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/. The corpus is manually tagged using the formulated BIS (Bureau of Indian Standards) POS (Parts-of-Speech) tagset to ensure standardised tagging with other Indian languages. The corpus was developed by Dr. Medari Janai Tham who recently was awarded Ph.D. from the department of Computer Science and Engineering, Assam Don Bosco University for her thesis ‘Shallow Parsing for Khasi’ under the supervision of Prof. Pushpak Bhattarcharyya of IIT Bombay. The details of the corpus including the annotation scheme and the development of the Khasi NLP tools are available in research papers published as part of her Ph.D. and available in https://grammarkhasi.in, which also a companion website of the book “Ka Grammar Khasi Da Ka Jingdro” by the same author published by Macmillan Education, India. The other contributions made by the scholar include the BIS Khasi tagset, a Hybrid Khasi POS tagger, an HMM Khasi POS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using bidirectional gated recurrent unit, seminar report on ‘Towards Standardization of Khasi language for Computational Purposes’ available in the above-mentioned website. Some of the NLP tools for Khasi are available online for users and researchers to run any Khasi sentence and verify the response of the taggers and parser in https://medaritham.pythonanywhere.com.

    in Our blog
    Social Work, ADBU start a Children Tutorial Center

    Come, Rest and Renew

    In a world that moves too fast, The Nest offers a place to pause, breathe, and rediscover what truly matters. We invite you to step into a sanctuary of healing, silence, and hope.

    Get in touch

    The Nest
    8000 Marina Blvd, Suite 300
    Brisbane CA 94005
    United States

    • +1 555-555-5556
    • info@nestforwellness.org
    Follow us
    Copyright © The Nest
    Powered by Odoo - Create a free website