Performance benchmarking of efficient tiny-llms (phi-3, gemma-2b, tinyllama-1.1b) on Bengali reasoning and comprehension tasks

Zahid Hasan

doi:https://doi.org/10.61577/jmla.2025.100003

Original Article

Performance benchmarking of efficient tiny-llms (phi-3, gemma-2b, tinyllama-1.1b) on Bengali reasoning and comprehension tasks

Zahid Hasan

Abstract

Recent progress in large language models (LLMs) has significantly advanced natural language processing, particularly in reasoning, comprehension, and text generation. However, as these models continue to grow in scale and architectural complexity, their practical deployment becomes increasingly challenging for low-resource languages such as Bengali. While state-of-the-art systems like GPT-4, Gemini Ultra, and Claude demonstrate strong reasoning capabilities, their reliance on high-end computational infrastructure restricts accessibility and limits adoption in resource-constrained environments. To address this limitation, compact alternatives known as Tiny Language Models (Tiny-LLMs) have gained increasing attention. Typically containing fewer than four billion parameters, these models aim to retain meaningful reasoning ability while remaining deployable on consumer-grade hardware. In this study, we present a systematic benchmark of three representative Tiny-LLMs Microsoft Phi-3 Mini (3.8B), Google Gemma-2B, and TinyLlama-1.1B focusing specifically on Bengali language tasks.Our evaluation covers logical reasoning, reading comprehension, and factual knowledge retrieval in Bengali. All experiments were conducted using 4-bit quantization to reflect realistic low-power deployment scenarios. The results indicate that Phi-3 Mini achieves the strongest overall reasoning performance, with an average score of 4.6 out of 5, producing coherent and logically structured Bengali responses. Gemma-2B, while slightly weaker in deep reasoning, delivers the fastest inference speed (2.85 seconds), making it suitable for latency-sensitive and interactive applications. TinyLlama-1.1B demonstrates the lowest computational cost but exhibits limitations in contextual understanding and multi-step reasoning. Qualitative analysis further highlights a clear trade-off between reasoning depth and computational efficiency across the evaluated models. Overall, this study provides one of the first focused evaluations of Tiny-LLMs for Bengali reasoning and comprehension, offering a practical foundation for future research on scalable and locally deployable Bengali AI systems.

Keywords

Large language modelsTiny-LLMsBengalispeaking communityParameter pruningIndicQACompact language models

Corresponding Author

Dr. Zahid Hasan

Department of Business Administration (BBA), Presidency University, Edusmart AI Laboratory, Dhaka, Bangladesh

zh0569@gmail.com

Article History

Received Date : 15 January 2025

Revised Date : 05 February 2025

Accepted Date : 12 February 2025