AILaw-Lab

Indian Institute of Technology Kharagpur

Resource Description Publication
IL-PCSR Dataset for joint identification of both relevant precedents (prior case judgements) as well as relevant legal statutes for a given situation (query). While precedent identification and statute identification have long been studied separately, this is the first dataset for addressing both tasks together. Accepted in EMNLP 2025 (to appear)
MARRO A dataset of more Indian Supreme Court judgements and a set of UK Supreme Court judgements, where the rhetorical role of every sentence is labeled (by Law students). MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents, Artificial Intelligence and Law 2025. [Link]
IL-TUR It contains monolingual (English, Hindi) and multi-lingual (9 Indian languages) domain-specific tasks from the point of view of understanding and reasoning over Indian legal documents. IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning, ACL 2024. [Link]
Legal Statute Identification Identifies relevant statutes given the natural language (English) description of a situation. Experiments on Indian and European cases and statutes. Legal Statute Identification: A Case Study using State-of-the-Art Datasets and Methods, SIGIR 2024. [Link]
InLegalTrans-En2Indic-1B A fine-tuned version of the IndicTrans2 model specifically tailored for translating Indian legal texts from English to Indian languages. MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages, ACM TALLIP 2024. [Link]
MILPaC The first parallel corpus for evaluating Machine Translation systems on translating legal text from English to nine Indian languages. Can also be used to evaluate MT systems on translating from one Indian language to another. MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages, ACM TALLIP 2024. [Link]
CustomInLawBERT
InCaseLawBERT
InLegalBERT [1.75 million downloads]
BERT-based language models pre-trained extensively over Indian legal text. These foundational models can be fine-tuned for many task-specific applications. Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law, ICAIL 2023. [Link]
MILDSum A novel dataset of 3,122 Indian court judgments in English along with their summaries in both English and Hindi, drafted by legal practitioners. Can be used for training/evaluating models for cross-lingual summarization and translation in the legal domain. MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of Indian Legal Case Judgments, EMNLP 2023. [Link]
TransDocAnalyser First hybrid (containing both handwritten and printed text) semi-structured document analysis dataset consisting of Indian legal documents (First Information Reports from several police stations). Can be used for document image segmentation, handwriting recognition, etc. TransDocAnalyser: A framework for semi-structured offline handwritten documents analysis with an application to legal domain, ICDAR 2023. [Link]
Legal Case Document Similarity Two datasets for the task of estimating the semantic similarity between two court case judgements, in the range [0, 1]. The datasets contain case document-pairs and a similarity value assigned by Law experts. Legal Case Document Similarity: You Need Both Network and Text, Information Processing and Management 2022. [Link]
LeSICiN Identifies relevant Indian Penal Code (IPC) Sections, given the natural language (English) description of a situation. LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification from Indian Legal Documents, AAAI 2022. [Link]
Summarization of court case judgements Three datasets for summarizing legal case judgements; implementations of several summarization algorithms and pretrained models for summarizing legal case judgements. A Comparative Study of Summarization Algorithms applied to Legal Case Judgements, ECIR 2019. [Link] Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation, AACL-IJCNLP 2022. [Link]
Catchphrase Identification A supervised algorithm for extracting legal catchphrases from court case judgements. A Sequence Labeling Model for Catchphrase Identification from Legal Case Documents. Artificial Intelligence and Law 2021. [Link]
Automatic Charge Identification from Facts Identifies charges/crimes in Indian Penal Code, given the natural language (English) description of a situation. Automatic Charge Identification from Facts: A Few Sentence-Level Charge Annotations is All You Need, COLING 2020. [Link]
AILA Dataset for two tasks -- (1) Identifying relevant prior cases for a given situation, (2) Identifying most relevant statutes for a given situation. The datasets are based on legal documents (cases, statutes) from the Indian judicial system. Overview of the FIRE 2019 AILA track: Artificial Intelligence for Legal Assistance, FIRE 2019. [Link]
Identifying the rhetorical role of sentences in court case judgements A dataset of 50 case judgments of the Indian Supreme Court, where the rhetorical role of every sentence is labeled (by Law students), and implementation of our proposed model for identifying rhetorical role of sentences. Identification of Rhetorical Roles of Sentences in Indian Legal Judgments, JURIX 2019. [Link]
IRLed For two tasks -- (1) Catchphrase extraction from Indian legal documents, (2) Identifying prior cases relevant to a given case. Overview of the FIRE 2017 IRLeD Track: Information Retrieval from Legal Documents, FIRE 2017. [Link]
Automatic Catchphrase Identification An unsupervised algorithm for extracting legal catchphrases from court case judgements. Automatic Catchphrase Identification from Legal Court Case Documents, CIKM 2017. [Link]