The Web Conference 2024 Tutorial:
Recent Advances in Generative Information Retrieval

1CAS Key Lab of Network Data Science and Technology, ICT, CAS, University of Chinese Academy of Sciences, 2Leiden University 3Shandong University 4University of Amsterdam

Tuesday May 14 13:30 PM - 17:00 PM (SST) @ Resorts World Convention Centre/ Pisces 4

About this tutorial

Generative retrieval (GR) has become a highly active area of information retrieval (IR) that has witnessed significant growth recently. Compared to the traditional ``index-retrieve-then-rank'' pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications.

We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and the applications of GR. We end by outlining remaining challenges and issuing a call for future GR research. This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.

Slides

Section 1: Introduction
Section 2: Preliminaries
Section 3: Docid designs
Section 4: Training approaches
Section 5: Inference strategies
Section 6 & 7: Applications, challenges & opportunities

Schedule

Time Section Presenter
13:30 - 13:50 Section 1: Introduction Maarten de Rijke
13:50 - 14:20 Section 2: Definition & Preliminaries Zhaochun Ren
14:20 - 15:00 Section 3: Docid designs Yubao Tang
15:00 — 15:15 15min coffee break
15:15 - 15:55 Section 4: Training approaches Weiwei Sun
15:55 - 16:15 Section 5: Inference strategies Weiwei Sun
16:15 - 16:35 Section 6: Applications Yubao Tang
16:35 - 16:50 Section 7: Challenges & Opportunities Maarten de Rijke
16:50 - 17:00 Q & A All

Reading List

The tutorial extensively covers papers highlighted in bold.


Section 3: Docid design

3.1 Pre-defined docids

3.1.1 A single docid represents a document
3.1.1.1 Number-based docids

Unstructured atomic integers


Naively structured strings


Semantically structured strings


Product quantization strings


3.1.1.2 Word-based docids

Titles


URLs


Pseudo queries


Important terms


3.1.2 Multiple docids represent a document

3.1.2.1 Single type


3.1.2.2 Diverse types

3.2 Learnable docids

3.2.1 Repeatable learnable docids

3.2.2 Unique learnable docids


Section 4: Training approaches

4.1 Stationary scenarios

4.1.1 Supervised learning

4.1.2 Pre-training

4.1.3 Pairwise optimization

4.1.4 Listwise optimization

4.1.5 Multiple optimization


4.2 Dynamic scenarios


4.3 GR & QA


4.4 Large-scale corpora



Section 5: Inference strategies

5.1 A single docid represents a document

Constrained beam search with prefix tree


Constrained greedy search with inverted index


5.2 Multiple docids represent a document

Constrained beam search with FM-index


Aggregation functions



Section 6: Applications

6.1 Knowledge-intensive language tasks (KILT)


6.2 Multi-hop retrieval


6.3 Recommendation


6.4 Code retrieval


BibTeX

@inproceedings{tang-2023-recent,
      author = {Tang, Yubao and Zhang, Ruqing and Guo, Jiafeng and de Rijke, Maarten},
      booktitle = {SIGIR-AP 2023: 1st International ACM SIGIR Conference on Information Retrieval in the Asia Pacific},
      date-added = {2023-10-07 17:24:48 +0200},
      date-modified = {2023-10-07 17:26:24 +0200},
      month = {November},
      publisher = {ACM},
      title = {Recent Advances in Generative Information Retrieval},
      year = {2023}
}