Resources

Resource	Description	Publication
ILSiC	A benchmark dataset designed for the task of Legal Statute Identification (LSI) in the Indian legal domain. It focuses on identifying the relevant legal statutes from queries written by laypeople, which are typically informal and differ significantly from the structured language found in court judgments. The dataset contains layperson-generated legal queries mapped to relevant statutes from more than 500 Indian laws, enabling research on how well NLP models can connect everyday legal problems with the appropriate legal provisions.	ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople, EACL 2026 [Link]
IL-PCSR	Dataset for joint identification of both relevant precedents (prior case judgements) as well as relevant legal statutes for a given situation (query). While precedent identification and statute identification have long been studied separately, this is the first dataset for addressing both tasks together.	IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval, EMNLP 2025 [Link]
MARRO	A dataset of more Indian Supreme Court judgements and a set of UK Supreme Court judgements, where the rhetorical role of every sentence is labeled (by Law students).	MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents, Artificial Intelligence and Law 2025. [Link]
IL-TUR	It contains monolingual (English, Hindi) and multi-lingual (9 Indian languages) domain-specific tasks from the point of view of understanding and reasoning over Indian legal documents.	IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning, ACL 2024. [Link]
Legal Statute Identification	Identifies relevant statutes given the natural language (English) description of a situation. Experiments on Indian and European cases and statutes.	Legal Statute Identification: A Case Study using State-of-the-Art Datasets and Methods, SIGIR 2024. [Link]
InLegalTrans-En2Indic-1B [1000+ downloads]	A fine-tuned version of the IndicTrans2 model specifically tailored for translating Indian legal texts from English to Indian languages.	MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages, ACM TALLIP 2024. [Link]
MILPaC	The first parallel corpus for evaluating Machine Translation systems on translating legal text from English to nine Indian languages. Can also be used to evaluate MT systems on translating from one Indian language to another.	MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages, ACM TALLIP 2024. [Link]
CustomInLawBERT InCaseLawBERT InLegalBERT [1.8 million downloads]	BERT-based language models pre-trained extensively over Indian legal text. These foundational models can be fine-tuned for many task-specific applications.	Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law, ICAIL 2023. [Link]
MILDSum	A novel dataset of 3,122 Indian court judgments in English along with their summaries in both English and Hindi, drafted by legal practitioners. Can be used for training/evaluating models for cross-lingual summarization and translation in the legal domain.	MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of Indian Legal Case Judgments, EMNLP 2023. [Link]
TransDocAnalyser	First hybrid (containing both handwritten and printed text) semi-structured document analysis dataset consisting of Indian legal documents (First Information Reports from several police stations). Can be used for document image segmentation, handwriting recognition, etc.	TransDocAnalyser: A framework for semi-structured offline handwritten documents analysis with an application to legal domain, ICDAR 2023. [Link]
Legal Case Document Similarity	Two datasets for the task of estimating the semantic similarity between two court case judgements, in the range [0, 1]. The datasets contain case document-pairs and a similarity value assigned by Law experts.	Legal Case Document Similarity: You Need Both Network and Text, Information Processing and Management 2022. [Link]
LeSICiN	Identifies relevant Indian Penal Code (IPC) Sections, given the natural language (English) description of a situation.	LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification from Indian Legal Documents, AAAI 2022. [Link]
Summarization of court case judgements	Three datasets for summarizing legal case judgements; implementations of several summarization algorithms and pretrained models for summarizing legal case judgements.	A Comparative Study of Summarization Algorithms applied to Legal Case Judgements, ECIR 2019. [Link] Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation, AACL-IJCNLP 2022. [Link]
Catchphrase Identification	A supervised algorithm for extracting legal catchphrases from court case judgements.	A Sequence Labeling Model for Catchphrase Identification from Legal Case Documents. Artificial Intelligence and Law 2021. [Link]
Automatic Charge Identification from Facts	Identifies charges/crimes in Indian Penal Code, given the natural language (English) description of a situation.	Automatic Charge Identification from Facts: A Few Sentence-Level Charge Annotations is All You Need, COLING 2020. [Link]
AILA	Dataset for two tasks -- (1) Identifying relevant prior cases for a given situation, (2) Identifying most relevant statutes for a given situation. The datasets are based on legal documents (cases, statutes) from the Indian judicial system.	Overview of the FIRE 2019 AILA track: Artificial Intelligence for Legal Assistance, FIRE 2019. [Link]
Identifying the rhetorical role of sentences in court case judgements	A dataset of 50 case judgments of the Indian Supreme Court, where the rhetorical role of every sentence is labeled (by Law students), and implementation of our proposed model for identifying rhetorical role of sentences.	Identification of Rhetorical Roles of Sentences in Indian Legal Judgments, JURIX 2019. [Link]
IRLed	For two tasks -- (1) Catchphrase extraction from Indian legal documents, (2) Identifying prior cases relevant to a given case.	Overview of the FIRE 2017 IRLeD Track: Information Retrieval from Legal Documents, FIRE 2017. [Link]
Automatic Catchphrase Identification	An unsupervised algorithm for extracting legal catchphrases from court case judgements.	Automatic Catchphrase Identification from Legal Court Case Documents, CIKM 2017. [Link]

AILaw-Lab

Indian Institute of Technology Kharagpur

Datasets and AI model implementations developed by our group