Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (2024)

Gustavo Penha1, Ali Vardasbi1, Enrico Palumbo2, Marco de Nadai3, Hugues Bouchard4Spotify1Netherlands, 2Italy, 3Denmark, 4Spaingustavop,aliv,enricop,mdenadai,hb@spotify.com

(2024)

Abstract.

Generative retrieval for search and recommendation is a promising paradigm for retrieving items, offering an alternative to traditional methods that depend on external indexes and nearest-neighbor searches. Instead, generative models directly associate inputs with item IDs. Given the breakthroughs of Large Language Models (LLMs), these generative systems can play a crucial role in centralizing a variety of Information Retrieval (IR) tasks in a single model that performs tasks such as query understanding, retrieval, recommendation, explanation, re-ranking, and response generation. Despite the growing interest in such a unified generative approach for IR systems, the advantages of using a single, multi-task model over multiple specialized models are not well established in the literature. This paper investigates whether and when such a unified approach can outperform task-specific models in the IR tasks of search and recommendation, broadly co-existing in multiple industrial online platforms, such as Spotify, YouTube, and Netflix. Previous work shows that (1) the latent representations of items learned by generative recommenders are biased towards popularity, and (2) content-based and collaborative-filtering-based information can improve an item’s representations. Motivated by this, our study is guided by two hypotheses: [H1] the joint training regularizes the estimation of each item’s popularity, and [H2] the joint training regularizes the item’s latent representations, where search captures content-based aspects of an item and recommendation captures collaborative-filtering aspects. Our extensive experiments with both simulated and real-world data support both [H1] and [H2] as key contributors to the effectiveness improvements observed in the unified search and recommendation generative models over the single-task approaches.

Generative Retrieval, Generative Recommendation, Joint Search and Recommendation, Multi-task Learning

conference: Proceedings of the 18th ACM Conference on Recommender Systems; 14–18 October 2024; Bari, Italyjournalyear: 2024copyright: acmlicensedconference: 18th ACM Conference on Recommender Systems; October 14–18, 2024; Bari, Italybooktitle: 18th ACM Conference on Recommender Systems (RecSys ’24), October 14–18, 2024, Bari, Italydoi: 10.1145/3640457.3688123isbn: 979-8-4007-0505-2/24/10

1. Introduction

Generative recommendation (GenR)
DescriptionExample
InputTokens for history items[ϕ(item1),ϕ(item2)]italic-ϕ𝑖𝑡𝑒subscript𝑚1italic-ϕ𝑖𝑡𝑒subscript𝑚2[~{}\phi(item_{1}),~{}\phi(item_{2})][ italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ]
OutputTokens for target itemϕ(item3)italic-ϕ𝑖𝑡𝑒subscript𝑚3~{}\phi(item_{3})italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )
Generative retrieval (GenS)
DescriptionExample
InputTokens for textual querytokenize(“brazillian jazz”)
OutputTokens for relevant itemϕ(item3)italic-ϕ𝑖𝑡𝑒subscript𝑚3~{}\phi(item_{3})italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )

The recent breakthroughs of Large Language Models (LLMs) have significantly influenced the field of Information Retrieval (IR). In search, where the task is to rank a set of relevant documents for a query, LLMs can be employed to learn better textual representations in both sparse(Formal etal., 2021; Nguyen etal., 2023; Lassance etal., 2024) and dense retrieval(Izacard etal., 2021; Karpukhin etal., 2020; BehnamGhader etal., 2024), re-rank items directly(Nogueira and Cho, 2019; Qin etal., 2023; Pradeep etal., 2023b), and support evaluation(Faggioli etal., 2023; Gilardi etal., 2023). Similarly, in recommendation systems, where the objective is to rank a set of relevant items for a user, LLMs can improve user and item textual representations(Wu etal., 2021; Liu etal., 2024), generate explanations for recommendations(Silva etal., 2024), and perform conversational recommendation(Penha and Hauff, 2020; Wang etal., 2023a).

However, employing LLMs to retrieve items111We use the terms item and document interchangeably in this paper to refer to the units of information that can be recommended and searched for. for a given query or user is challenging. Architectures based on LLMs such as Cross-encoders(Nogueira and Cho, 2019; Qin etal., 2023; Pradeep etal., 2023b) lack efficiency for large sets of itemsdue to their self-attention mechanisms that process the tokens of the query and the tokens of the documents together. Although Bi-encoders(Karpukhin etal., 2020) and Two-tower models(Yi etal., 2019) can employ LLMs for retrieval, they cannot leverage the attention across the query and document tokens(Lin etal., 2022). Generative retrieval(DeCao etal., 2020; Tay etal., 2022) has been recently introduced to perform retrieval directly within a pre-trained LLM. This method learns to predict document unique identifiers (IDs) for input queries, bypassing the need for separate text-based encodings of queries and documents. Thus, generative retrieval might be effectively applied in a multi-task learning framework(Metzler etal., 2021) to develop a unified model composed of a single LLM, rather than multiple task-specific models. Research indicates that a single model addressing both search and recommendation could enhance the effectiveness of both tasks(Zamani and Croft, 2018, 2020). However, such a multi-task approach remains under-explored in the generative IR literature.

This paper seeks to understand the circumstances under which a unified generative model for search and recommendation is beneficial for both tasks. To do so, we study how and when the search task can improve the effectiveness of the recommendation task and vice-versa. Table1 shows examples of input and output for the recommendation task, in which a generative recommender generates item IDs based on past user interactions, and for the search task, in which a generative retrieval model generates item IDs for a given query. A joint generative model is trained for both tasks.

Previous work shows that (1) the latent representation of items, learned by generative models, are biased towards items popularity222Popularity of an item is measured by its number of interactions within the recommendation task, or the number of queries leading to it in the search task.(Liu etal., 2023), and (2) content-based and collaborative-filtering-based information can improve an item’s representations(Yu etal., 2012; Thorat etal., 2015; Parthasarathy and SathiyaDevi, 2023). Based on those observations, we formulate two guiding hypotheses for our experiments: [H1] the joint training regularizes the estimation of each item’s popularity, and [H2] the joint training regularizes the item’s latent representations. Our main findings based on simulated and real-world data are:

  • Improvements in the effectiveness of the joint generative model over the task-specific ones are evident in the simulated data for both hypotheses, provided that there is low KL divergence(Kullback and Leibler, 1951) between the popularity distributions of items in search and recommendation for [H1] and that the distribution of item co-occurrences aligns with the other task for [H2], thereby making the regularization effects helpful.

  • The joint training of generative retrieval models for both recommendation and search proves more effective than the task-specific models for both tasks across three real-world datasets, showing an average increase of 16% in R@30. Our follow-up analyses suggest that the regularization effect on the item’s latent representation (i.e., [H2]) is the primary reason the predictions of the joint generative model differ from those of the task-specific models.

2. Related Work

Generative Retrieval

Traditional information retrieval methods first encode all documents into a sparse or dense latent representation space and then find the closest document to the query at test time. In contrast, generative retrieval models(DeCao etal., 2020; Tay etal., 2022; Zhang etal., 2023; Li etal., 2023; Wang etal., 2023c; Yang etal., 2023; Wang etal., 2022; Zeng etal., 2023) learn to directly map queries to document IDs, enabling end-to-end pipelines within a single LLM. One of the first retrieval methods to use a Transformer-based language model (LM) is GENRE(DeCao etal., 2020), which performs entity retrieval to directly output IDs for a query. GENRE predicts the entity name in an auto-regressive fashion, e.g. “Leonardo\rightarrowLeonardo da\rightarrowLeonardo da Vinci”, using constrained beam search. Generative retrieval has been popularized in IR by DSI(Tay etal., 2022), which uses a LM to predict relevant documents for a given query, where documents are represented by IDs (e.g. doc_01) rather than its text—leading to fewer tokens representing each document and enables to control tokens that are shared across documents. Considering that we have weights tied to document tokens, generative retrieval requires each document to be explicitly linked to model weights in the LM head.

Tying documents to specific weights in the model requires documents to have been previously observed at training time to be predicted. For this reason, generative retrieval faces challenges of scalability and ingestion of new documents(Kishore etal., 2023; Mehta etal., 2022). Moreover, generative retrieval struggles to outperform and replace existing retrieval approaches when used for larger sets of items(Pradeep etal., 2023a). Recent studies indeed show that additional techniques are required to have competitive generative retrieval approaches(Zeng etal., 2023, 2024). In this paper, we focus on the effects an additional retrieval recommendation task might have when employing multi-task learning. We expect our results to be agnostic to architectural choices and other explorations in literature aimed at improving the effectiveness and scalability of generative retrieval models.

Generative Recommendation

While search models retrieve documents based on queries, recommender systems retrieve items based on user’s past interactions. These systems may use textual metadata (content-based), historical interactions (collaborative filtering) or both. Traditional models such as the Two-tower model(Yi etal., 2019) learn embeddings for both users and items, and use neighbor search to retrieve a candidate set for recommendation, typically followed by a re-ranking step. In contrast, generative recommender systems can directly retrieve items from a collection for a given user in a single step. Early approaches relied only on user interactions using technologies ranging from LSTMs(Hidasi etal., 2015) to Transformers(Kang and McAuley, 2018a) to learn a mapping between users and items. More recently, LLMs have gained popularity for their ability to mix item IDs and natural language(Fan etal., 2023; Lin etal., 2023). A notable example is P5(Geng etal., 2022), which combines text and IDs333One prompt in P5 could be for example Given the purchase history list of user_15466𝑢𝑠𝑒𝑟_15466user\_15466italic_u italic_s italic_e italic_r _ 15466: 4110 \rightarrow 4467 \rightarrow 4468 \rightarrow 4472. Find the next item. in a pre-trained P5 model to handle multiple recommendation tasks, such as sequential recommendation and rating prediction, within a multi-task setting444While P5 uses random IDs, more sophisticated content-based ways to create semantic IDs have also been proposedRajput etal. (2023); Hua etal. (2023), paralleling semantic strategies used to represent document IDs for search. . While generative search and recommendation have become popular separated research directions(Li etal., 2024) we study here the effect of learning both in the same model.

Joint Search and Recommendation

Search and recommendation tasks share similarities, both in terms of the fundamental problem(Belkin and Croft, 1992)—to identify objects that satisfy users’ needs—and in the way models match queries or users (potentially with contextual information) with documents or items(Xu etal., 2018). In some domains and platforms such as Spotify, YouTube, and Netflix, items can be accessed through textual search queries or recommendations based on a user’s historical interactions and current context. In such cases, a joint search and recommendation model could benefit from the signals coming from both ways to interact with items(Si etal., 2023; Wang etal., 2012).

Gong etal. (2023) argued that in industrial applications, the data collected from either search or recommendation scenarios does not fully capture users’ intents, and shows that a multi-task learning approach for seven IR tasks (e.g. Query-Item Relevance Prediction and Content Search Ranking) is more effective in an A/B test for the cold-start problem. Unlike personalized search systems(Dou etal., 2007), where user previous preferences are used to improve search and models that learn the transitions between search and recommendation surfaces(Shi etal., 2024; Yao etal., 2021), we focus here on the problem of learning better item representations with both search and recommendation data. Literature has also hypothesized that sharing training data from both tasks enhances the item representations(Zamani and Croft, 2018, 2020; Zhao etal., 2022b). Although there is evidence for the improved effectiveness of joint learning over individual models, it remains unknown whether this applies to generative models and the underlying reasons each task could benefit from the other. In this paper, we analyze how search and recommendation tasks could benefit from each other in generative retrieval, investigating the learned popularity distribution of items and the shared item representations in generative models.

3. Research Hypotheses

HypothesesSearch (S) exampleRecommendation (R) exampleEffect R+S joint training

[H1] Popularity

PopRTrain={5%,40%,50%,5%}𝑃𝑜subscriptsuperscript𝑝𝑇𝑟𝑎𝑖𝑛𝑅percent5percent40percent50percent5Pop^{Train}_{R}=\{5\%,40\%,50\%,5\%\}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = { 5 % , 40 % , 50 % , 5 % }PopSTrain={20%,0%,50%,10%}𝑃𝑜subscriptsuperscript𝑝𝑇𝑟𝑎𝑖𝑛𝑆percent20percent0percent50percent10Pop^{Train}_{S}=\{20\%,0\%,50\%,10\%\}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { 20 % , 0 % , 50 % , 10 % }
PopS+RTrain={12.5%,20%,50%,7.5%}𝑃𝑜subscriptsuperscript𝑝𝑇𝑟𝑎𝑖𝑛𝑆𝑅percent12.5percent20percent50percent7.5Pop^{Train}_{S+R}=\{12.5\%,20\%,50\%,7.5\%\}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S + italic_R end_POSTSUBSCRIPT = { 12.5 % , 20 % , 50 % , 7.5 % }
gets closer to
PopRTest={10%,30%,40%,10%}𝑃𝑜subscriptsuperscript𝑝𝑇𝑒𝑠𝑡𝑅percent10percent30percent40percent10Pop^{Test}_{R}=\{10\%,30\%,40\%,10\%\}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_e italic_s italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = { 10 % , 30 % , 40 % , 10 % }
[H2] Latent rep.
S \rightarrow R
q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “medieval setting”
q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT relevant items: {item1,item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{{\color[rgb]{0,0,1}item_{1}},{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }
q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: “historical fiction”
q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT relevant items: {item1,item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{{\color[rgb]{0,0,1}item_{1}},{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }
u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: {item3item2item1}𝑖𝑡𝑒subscript𝑚3𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚1\{item_{3}\rightarrow{\color[rgb]{1,.5,0}item_{2}}\rightarrow{\color[rgb]{%0,0,1}item_{1}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (1)The embedding of item1𝑖𝑡𝑒subscript𝑚1{\color[rgb]{0,0,1}item_{1}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using S+R becomes closer to item2𝑖𝑡𝑒subscript𝑚2{\color[rgb]{1,.5,0}item_{2}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.
[H2] Latent rep.
R \rightarrow S
q3subscript𝑞3{\color[rgb]{1,0,1}q_{3}}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT: “ancient history”
q3subscript𝑞3{\color[rgb]{1,0,1}q_{3}}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT relevant items: {item1,item2,item3}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚3\{{\color[rgb]{0,0,1}item_{1}},{\color[rgb]{1,.5,0}item_{2}},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }
u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: {item3item2}𝑖𝑡𝑒subscript𝑚3𝑖𝑡𝑒subscript𝑚2\{item_{3}\rightarrow{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }
u3subscript𝑢3u_{3}italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT: {item1item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{{\color[rgb]{0,0,1}item_{1}}\rightarrow{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }
Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (2)The embedding of item2𝑖𝑡𝑒subscript𝑚2{\color[rgb]{1,.5,0}item_{2}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using S+R becomes closer to item1𝑖𝑡𝑒subscript𝑚1{\color[rgb]{0,0,1}item_{1}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and item3𝑖𝑡𝑒subscript𝑚3item_{3}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and consequently to the query q3subscript𝑞3{\color[rgb]{1,0,1}q_{3}}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.
Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (3)

Before defining our first hypothesis, let us discuss our motivation to investigate the popularity of items in generative retrieval. Previous work shows that item IDs in generative recommendation systems tend to exhibit a bias towards more popular items(Liu etal., 2023). In Figure1 we show this bias exists in both search555Note that many search datasets do not have unequal distributions of queries to documents, such as MSMarco(Nguyen etal., 2016) and BEIR(Thakur etal., 2021). and recommendation tasks using the latent representations of items (learned by generative retrieval models) in ML, a dataset derived from Movie-Lens 25M (Harper and Konstan, 2015) (see Section5). Considering the importance of item popularity in generative retrieval, our first hypothesis, [H1], is the following:

Hypothesis 1 (Popularity).

A joint model for search and recommendation regularizes the estimation of each item’s popularity.

To give a concrete example, let us analyze the first row of Table2, which shows our motivational examples to illustrate our main hypotheses. Consider a recommendation dataset comprising four items with training popularity distribution of PopRTrain={20%,0%,50%,10%}𝑃𝑜subscriptsuperscript𝑝𝑇𝑟𝑎𝑖𝑛𝑅percent20percent0percent50percent10Pop^{Train}_{R}=\{20\%,0\%,50\%,10\%\}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = { 20 % , 0 % , 50 % , 10 % } (third column of Table2), indicating for example that the first item appears on 20% of the instances. Let’s assume the train distribution differs from the distribution at test time PopRTest={10%,30%,40%,10%}𝑃𝑜subscriptsuperscript𝑝𝑇𝑒𝑠𝑡𝑅percent10percent30percent40percent10Pop^{Test}_{R}=\{10\%,30\%,40\%,10\%\}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_e italic_s italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = { 10 % , 30 % , 40 % , 10 % }. This represents a shift from training to test conditions(Quiñonero-Candela etal., 2022). Suppose that the search data’s training set distribution is PopSTrain={5%,40%,50%,5%}𝑃𝑜subscriptsuperscript𝑝𝑇𝑟𝑎𝑖𝑛𝑆percent5percent40percent50percent5Pop^{Train}_{S}=\{5\%,40\%,50\%,5\%\}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { 5 % , 40 % , 50 % , 5 % }, for example because users behave differently across surfaces. A model trained on both datasets (see column R+S from Table2 for the first row) would retrieve the second item more frequently compared to a model trained solely on PopRTrain𝑃𝑜subscriptsuperscript𝑝𝑇𝑟𝑎𝑖𝑛𝑅Pop^{Train}_{R}italic_P italic_o italic_p start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, leading to better effectiveness on its own shifted test set. For the regularization coming from H1 to increase the effectiveness of the joint model in real-world applications, we hypothesize that the popularity distributions between search and recommendation have to be related but not equal, e.g. low KL Divergence, and the individual datasets have to be insufficient for learning a robust estimation of the test set popularity of each item (e.g. a shift between train and test distributions exists).

Next, we discuss our second hypothesis motivated by previous work on content-based and collaborative-based hybrids(Yu etal., 2012; Thorat etal., 2015; Parthasarathy and SathiyaDevi, 2023) and on joint search and recommendation models(Zamani and Croft, 2018, 2020):

Hypothesis 2 (Latent representation).

A joint model for search and recommendation regularizes the item’s latent representations, meaning that the patterns learned to position items across the latent space in one task can beneficially influence the other.

For example, consider the SR𝑆𝑅S\rightarrow Ritalic_S → italic_R task (Search to Recommendation), shown in the second row of Table2 where recommendation is the target task666We refer to target task as the one we are currently evaluating the joint model on.. Suppose a scenario in which R aims to predict item1𝑖𝑡𝑒subscript𝑚1{\color[rgb]{0,0,1}item_{1}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from user u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT who previously interacted with {item3item2}𝑖𝑡𝑒subscript𝑚3𝑖𝑡𝑒subscript𝑚2\{item_{3}\rightarrow{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }.The recommendation model might struggle to make this prediction if the pair {item3,item1}𝑖𝑡𝑒subscript𝑚3𝑖𝑡𝑒subscript𝑚1\{item_{3},{\color[rgb]{0,0,1}item_{1}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } and {item2,item1}𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚1\{{\color[rgb]{1,.5,0}item_{2}},{\color[rgb]{0,0,1}item_{1}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } have rarely (or never) been observed in the training dataset DrecTrainsubscriptsuperscript𝐷𝑇𝑟𝑎𝑖𝑛𝑟𝑒𝑐D^{Train}_{rec}italic_D start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT. However, the search task could fill this data scarcity if DsearchTrainsubscriptsuperscript𝐷𝑇𝑟𝑎𝑖𝑛𝑠𝑒𝑎𝑟𝑐D^{Train}_{search}italic_D start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_e italic_a italic_r italic_c italic_h end_POSTSUBSCRIPT contains semantically similar queries that lead to those three items. Consider for example that, item1𝑖𝑡𝑒subscript𝑚1{\color[rgb]{0,0,1}item_{1}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is relevant for the queries {“medieval setting”, “historical fiction”}, and item2𝑖𝑡𝑒subscript𝑚2{\color[rgb]{1,.5,0}item_{2}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is relevant for the same set of queries. Consequently, the joint model may learn an embedding space where {item1,item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{{\color[rgb]{0,0,1}item_{1}},{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } are closer, leading to correctly predicting item1𝑖𝑡𝑒subscript𝑚1{\color[rgb]{0,0,1}item_{1}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as a likely next item to item2𝑖𝑡𝑒subscript𝑚2{\color[rgb]{1,.5,0}item_{2}}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Let’s now examine an example in the reverse direction RS𝑅𝑆R\rightarrow Sitalic_R → italic_S (third row in Table2), where the regularization coming from the recommendation objective might help the search task. Consider a test query q3subscript𝑞3q_{3}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT “ancient history” where the relevant documents are {item3,item2,item1}𝑖𝑡𝑒subscript𝑚3𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚1\{item_{3},{\color[rgb]{1,.5,0}item_{2}},{\color[rgb]{0,0,1}item_{1}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } and that a learned retrieval model correctly retrieves {item1,item3}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚3\{{\color[rgb]{0,0,1}item_{1}},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } (they are close to the query in the semantic space). If the pairs {item1,item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{{\color[rgb]{0,0,1}item_{1}},{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and {item2,item3}𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚3\{{\color[rgb]{1,.5,0}item_{2}},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } are prevalent in the DrecTrainsubscriptsuperscript𝐷𝑇𝑟𝑎𝑖𝑛𝑟𝑒𝑐D^{Train}_{rec}italic_D start_POSTSUPERSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT, the joint model will push item3𝑖𝑡𝑒subscript𝑚3item_{3}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT closer to {item1,item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{{\color[rgb]{0,0,1}item_{1}},{\color[rgb]{1,.5,0}item_{2}}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, and consequently to the query, enhancing its retrieval accuracy.

4. Joint Search and Recommendation Generative Model

In generative retrieval, a function ϕitalic-ϕ\phiitalic_ϕ maps each item in the collection to its respective identifier, which might contain one or more tokens. The vocabulary of the underlying pre-trained LM is composed of the vocabulary tokens that represent textual natural language and of the tokens used to represent the items in the collection. We use atomic IDs777Future work may explore replacing these atomic IDs with semantic IDs(Tay etal., 2022; Hua etal., 2023; Rajput etal., 2023), based on content or collaborative embeddings, to scale to a larger set of items. for ϕitalic-ϕ\phiitalic_ϕ, and thus we have one additional token per item in the vocabulary. Generative models are trained auto-regressively with teacher forcing, employing cross-entropy loss between the predicted ID tokens and the ground truth ID tokens. To perform retrieval with generative retrieval, beam search is performed, returning the top K valid item IDs.

Let 𝒟𝒮={(Qi,{item1,item2,,itemk})}i=1Nsubscript𝒟𝒮superscriptsubscriptsubscript𝑄𝑖𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚𝑘𝑖1𝑁\mathcal{D_{S}}=\{(Q_{i},\{item_{1},item_{2},...,item_{k}\})\}_{i=1}^{N}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , { italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be a search dataset comprised of relevance labels for queries, where Q𝑄Qitalic_Q is the query and {item1,itemk,,itemk}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚𝑘𝑖𝑡𝑒subscript𝑚𝑘\{item_{1},item_{k},...,item_{k}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } are the items that are relevant for this query. To train a generative model on this dataset, each query turns into k𝑘kitalic_k input-output pairs of the following format:

[(Q,ϕ(item1)),,(Q,ϕ(itemk))]𝑄italic-ϕ𝑖𝑡𝑒subscript𝑚1𝑄italic-ϕ𝑖𝑡𝑒subscript𝑚𝑘[(Q,~{}\phi(item_{1})),...,(Q,~{}\phi(item_{k}))][ ( italic_Q , italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , ( italic_Q , italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ]

We refer to a generative model trained on 𝒟𝒮subscript𝒟𝒮\mathcal{D_{S}}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT as GenS. Let 𝒟={(Ui={item1,item2,,itemt1},itemt)}i=1Msubscript𝒟superscriptsubscriptsubscript𝑈𝑖𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚𝑡1𝑖𝑡𝑒subscript𝑚𝑡𝑖1𝑀\mathcal{D_{R}}=\{(U_{i}=\{item_{1},item_{2},...,item_{t-1}\},item_{t})\}_{i=1%}^{M}caligraphic_D start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT = { ( italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT be a recommendation dataset comprised of user interactions split into history and target pairs, where the history are the previous interactions of the user sorted by time, and the target item is his last interacted item. To train a generative model on this dataset, each user turns into one pair of the following format:

(concat([ϕ(item1),ϕ(item2),,ϕ(itemt1)]history),ϕ(itemt))target,(concat(\underbrace{[~{}\phi(item_{1}),~{}\phi(item_{2}),...,~{}\phi(item_{t-1%})]}_{history}),\underbrace{~{}\phi(item_{t}))}_{target},( italic_c italic_o italic_n italic_c italic_a italic_t ( under⏟ start_ARG [ italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT italic_h italic_i italic_s italic_t italic_o italic_r italic_y end_POSTSUBSCRIPT ) , under⏟ start_ARG italic_ϕ ( italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT ,

where concat(.)(.)( . ) is the concatenation of the item IDs with a space token. We refer to a generative model trained on 𝒟subscript𝒟\mathcal{D_{R}}caligraphic_D start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT as GenR. We refer to a single generative retrieval model on both 𝒟subscript𝒟\mathcal{D_{R}}caligraphic_D start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT and 𝒟𝒮subscript𝒟𝒮\mathcal{D_{S}}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT as GenR+S.See Table 1 for examples.

5. Experimental setup

In this section, we describe the datasets, the evaluation protocol, and the implementation details we use to analyze the effect of joint training of search and recommendation for generative retrieval.

Simulated datasets

We create three simulated datasets888The simulated datasets and their statistics are available at: https://anonymous.4open.science/r/simulated-deatasets-recsys24-DB90/stats.md to analyze our hypotheses [H1] and [H2]. For each dataset, we hold the target task data constant and alter the data of the other task based on the specific parameter we want to analyze:

  1. (i)

    SIM1: This dataset is used to test hypothesis [H1]. We modify the KL divergence between the popularity distributions of items to examine how differences in item popularity affect the model’s performance. Specifically, we begin by defining a popularity distribution for each item based on Zipf’s law(George, [n. d.]) distribution. Then, we incrementally shuffle the item probability distribution in the search dataset while maintaining a constant distribution in the recommendation dataset. This method introduces increasing divergence between the two popularity distributions. The only learnable signal in both datasets is the popularity bias, allowing us to analyze the impact of one task on the other in a controlled manner, according to how different the popularity distributions are;

  2. (ii)

    SIM2: We vary the percentage of queries that lead to the items of the user history which match the queries, i.e. are the same, that lead to the target items (% of q. match) to test [H2] in the SR𝑆𝑅S\rightarrow Ritalic_S → italic_R direction. We simulate a recommendation dataset composed of five clusters, each containing six items that frequently co-occur in user interaction data. User histories are created by randomly sampling an initial item and pairing it with a second item that comes from the same cluster. We do this until the number of desired interactions per user is reached. This setup is designed to encode the co-occurrences of items within specific clusters as the learnable information in the dataset. When generating the search data, we modify how many queries are distributed for items in the same clusters. For example, with % of q. match at 100% all queries for items within a cluster are the same, whereas, at 50% overlap, only half of the queries are the same, with the remainder being randomly selected;

  3. (iii)

    SIM3: This dataset tests hypothesis [H2] in the RS𝑅𝑆R\rightarrow Sitalic_R → italic_S direction by altering the percentage of item pairs in recommendation data that also appear together in relevant query lists (% pairs in qrels). We simulate a search scenario using a subset of ten randomly selected topics from TREC-DL22(Craswell etal., 2023), providing realistic text for the queries. For each topic, we generate five paraphrased queries using OpenAI GPT-4. One of the paraphrases is designated as the test query, while the remaining four serve as training instances. The recommendation dataset is constructed with varying percentages of appearances of recommendation item pairs in relevant groups of documents (that we refer to as % pairs in qrels). For example, consider a user with the following interactions {item1,item2,item3\{item_{1},item_{2},item_{3}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT} and a search dataset with one query to relevance list (q,{item2,item3})𝑞𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚3(q,\{item_{2},item_{3}\})( italic_q , { italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } ). In this example, the percentage of relevant pairs is 33.33% as, out of the three pairs coming from all the combinations of items from the user interactions ({item1,item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{item_{1},item_{2}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, {item1,item3}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚3\{item_{1},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, and {item2,item3}𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚3\{item_{2},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }), only one pair ({item2,item3}𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚3\{item_{2},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }) appears in a set of relevant documents for a query.

For both recommendation and search tasks, we generate datasets of comparable sizes of training instances. Each model is run five times, and we report the average recall (R@10) values to ensure statistical reliability. We describe in more detail the sampling procedure of each dataset when discussing the results of the simulated datasets.

Real-world datasets

Then, we use three real-world datasets to evaluate the effectiveness of the joint search and recommendation generative model. These datasets are detailed in Table3.(i) ML: Derived from the MovieLens 25M(Harper and Konstan, 2015) dataset, employing user interactions for recommendations, and genres, tags, and genome-tags(Vig etal., 2012) as search queries.(ii) MPD: Constructed using the Million Playlist Dataset(Zamani etal., 2019)999https://research.atspotify.com/2020/09/the-million-playlist-dataset-remastered/, where playlists serve as the recommendation dataset, and unique playlist titles are used as search queries.In both ML and MPD datasets, we ensure to have just items that appear in both datasets and that all items in the test set appear in the training set. In addition to these public datasets, we created (iii) Podcastsdataset to provide an even more realistic assessment of joint search and recommendation effectiveness. This dataset consists of a sample of real log queries and user interactions from Spotify (from 2023-07 to 2023-10), with podcast shows as units of retrieval. This dataset features 84% of items from the search in the recommendation data, and 30% of the items in the recommendation data are found in the search data, illustrating how behaviour can vary across different surfaces and samples of datasets. We filtered the search dataset in order to contain mostly broad queries that lead to successful search interactions, as the logs have a high percentage of narrow, known entity queries(Penha etal., 2023).

RecommendationSearchPopularity distributions comparison
# items# usersDensity

Gini index

# items# queries
Avg. rel.
per query

Gini index

KSdist(S,R)KLD(S——R)KLD(R——S)
ML20k163k0.007500.7420k55k6.860.500.420.910.81
MPD21k85k0.000550.5221k28k36.440.580.220.140.15
Podcasts43k211k0.000300.7315k4k10.270.560.461.491.55

Evaluation

In the recommendation task (R), we use the user history to predict the last item (the target) in an ordered sequence of t𝑡titalic_t items. During training, we construct the user’s history up to (t𝑡titalic_t-3) interactions and simulate additional training instances by sequentially predicting the last item and removing it from the sequence until only two items remain. The test data observes all interactions up to t1𝑡1{t-1}italic_t - 1 to predict the item at position t𝑡titalic_t. The validation test predicts the item at position t1𝑡1{t-1}italic_t - 1 from the previous ones, while the training data does not see the last two interactions for each user. For the search task (S), we maintain distinct sets of queries for training, validation, and testing. Given our focus on the retrieval task, we rely on recall metrics: R@10 for the simulated datasets101010We used a cutoff of 10 due to the small size of the simulated datasets (¡100 entities). and R@30 for the, larger, real-world datasets. We assess the statistical significance of our results using paired Students’ t-tests with a 95% confidence interval.

Implementation Details

We use Huggingface’s implementation of T5(Raffel etal., 2020) (google/flan-t5-base) and train all generative models for 5 epochs with a learning rate of 0.002 and a batch size of 128. We use AdamW(Loshchilov and Hutter, 2017) with a weight decay of 0.01. To increase the number of distinct items retrieved for all generative retrieval models we resort to a diversified beam search approach(Vijayakumar etal., 2016). We use a diversified beam search procedure with a diversity penalty of 0.25. The number of groups equals half the number of desired return items.

The models used for reference for the search task are BM25(Robertson and Walker, 1994) (Pyterrier(Macdonald etal., 2021) implementation with default hyperparameters) and a Bi-encoder(Song etal., 2020) (SentenceTransformers(Reimers and Gurevych, 2019) implementation using pre-trained sentence-transformers/all-mpnet-base-v2). The recommendation models used for reference for the recommendation task are SASRec(Kang and McAuley, 2018b) and BERT4Rec(Sun etal., 2019) using Recbole implementations(Zhao etal., 2022a), and DiffRec(Wang etal., 2023b) using the author’s github code. Although such neural baselines could be referred to as generative recommendation approaches, they do not leverage natural language text tokens, and thus cannot handle the search dataset. PopR returns the most frequent items according to the recommendation training data and PopS does so according to search.

6. Results for Simulated Datasets

This section discusses the results from the simulated datasets. We begin by analyzing the item popularity under [H1] with the first simulated dataset (SIM1), followed by exploring the regularization effect on item representation for [H2] with SIM2 and SIM3.

6.1. SIM1: Item Popularity

We assess [H1] by training generative retrieval models on the datasets from the first simulation, each with varying degrees of KL divergence between item popularity distributions. Figure2 provides evidence supporting [H1]. It suggests that learning for both tasks within a single generative model regularizes the learned popularity distribution of the items. However, we note here that there is no shift between train and test popularity distributions in this simulation. Additionally, the only signal in the simulated data is the popularity of each entity diverging from real world datasets where entities that are similar co-occur more frequently.

Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (4)

6.2. SIM2: Item Representation (S \rightarrow R)

This section assesses our second hypothesis [H2] by focusing on recommendation as the target task. The results from simulation SIM2, presented in Table4, show that a mismatch in queries across the clusters negatively affects the regularization, decreasing the effectiveness of GenR+S compared to GenR. However, increasing the number of matching queries within clusters enhances the performance relative to the baseline. This finding supports [H2], suggesting that the regularization can be beneficial particularly when there is a high degree of similarity between the search and recommendation data in terms of item co-occurrences (% of q. match). To visually assess the impact of joint training, we compare, in Figure3, the item embeddings obtained from a recommendation-only model with the ones obtained by a joint model that incorporates both search (at 100% of q. match) and recommendation data. We observe that the joint model effectively groups items according to their underlying clusters compared to the model trained solely on recommendation.

Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (5)
SampleModel(SIM2) S → RModel(SIM3) R → S
% of q.
match
R@10
% pairs
in qrels
R@10
100%GenR-0.907GenS-0.580
GenR+S0%0.842GenR+S0%0.240
25%0.88025%0.280
50%0.93850%0.280
75%0.94475%0.460
100%0.985100%0.640
65%GenR-0.717GenS-0.320
GenR+S0%0.732GenR+S0%0.200
25%0.75225%0.260
50%0.70850%0.220
75%0.90375%0.400
100%0.897100%0.380

6.3. SIM3: Item Representation (R \rightarrow S)

In this section, we validate our second hypothesis by focusing on search as the target task. Table4 shows the results for simulation SIM3. They indicate that the joint model performs more effectively than the model trained solely on the recommendation data when the % pairs in qrels exceeds a certain threshold. This finding supports [H2], suggesting that the multi-task learning objective can serve as an effective regularizer for item representations. This regularization is particularly beneficial when there is a high degree of similarity between the co-occurrences of items in the search and recommendation data.

7. Results for Real-World Datasets

In this section, we present the results from real-world data in a question-and-answer format to provide a structured understanding of the insights derived from the datasets. Table5 shows the main results for the three real-world datasets (Head indicates the effectiveness for the top 1% most popular items in the train set, where Torso is the remaining set of items.). Note that Table5 also contains baselines (PopR, PopS, SASRec(Kang and McAuley, 2018b), BERT4Rec(Sun etal., 2019), and DiffRec(Wang etal., 2023b) for the recommendation task and BM25(Robertson and Walker, 1994) and Bi-encoder(Song etal., 2020) for the search task) as a reference. The aim of this research is not to provide a new generative model that is more effective than the non-generative counterparts but to bridge the gap of understanding when and why jointly training on search and recommendation benefits a generative model.

What is the effectiveness of the joint model compared to the task-specific models?

Recommendation
MLMPDPodcasts
AllHeadTorsoAllHeadTorsoAllHeadTorso
Generative retrieval methods
GenR0.1030.2670.0570.0670.2690.0430.1120.3340.018
GenR+Simprov.0.119(+16%)0.307(+15%)0.066(+16%)0.055(18%)percent18{\color[rgb]{0,0,0}(-18\%)}( - 18 % )0.333(+24%)0.021(50%)percent50{\color[rgb]{0,0,0}(-50\%)}( - 50 % )0.149(+33%)0.345(+3%)0.067(+262%)
Popularity and neural-based methods (for reference)
PopR0.0690.3120.0000.0280.2590.0000.0730.2470.000
PopS0.0620.2640.0050.0260.2410.0000.0290.0910.003
SASRec0.2070.3560.1640.2310.4620.2050.2560.5460.134
BERT4Rec0.2470.4010.2030.1510.4600.1150.2600.5180.151
DiffRec0.2110.5280.1210.2150.3350.2010.2800.4420.212
Search
MLMPDPodcasts
AllHeadTorsoAllHeadTorsoAllHeadTorso
Generative retrieval methods
GenS0.0200.0530.0000.0280.0400.0020.0800.1480.004
GenR+Simprov.0.023(+16%)0.060(+12%)0.001(-)0.033(+18%)0.044(+11%)0.009(+373%)0.106(+33%)0.171(+16%)0.034(+855%)
Sparse and dense methods (for reference)
BM250.0910.0600.1090.0320.0170.0650.3040.2370.383
Bi-encoder0.0810.0620.0920.0320.0170.0650.2690.2120.336

In most scenarios, the joint model is more effective than the task-specific models for search and recommendation tasks. One exception is observed in the MPD dataset, where GenR+S only outperforms the recommendation-specific model (GenR) for Head items (R@30 0.333). These results underscore that the search and recommendation tasks mutually benefit each other in generative retrieval, with an average increase of 16% in R@30.

What are the discrepancies between task-specific and joint model predictions?

RecommendationSearch
ΔΔ\Deltaroman_Δ PopS𝑃𝑜subscript𝑝𝑆Pop_{S}italic_P italic_o italic_p start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT GenR \rightarrow GenR+SΔΔ\Deltaroman_Δ PopR𝑃𝑜subscript𝑝𝑅Pop_{R}italic_P italic_o italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT GenS \rightarrow GenR+S
ML-2.72%+33.23%
MPD+0.92%-7.36 %
Podcasts-17.46%-13.33%

Motivated by [H1], we explore the popularity of items retrieved by the generative models. Specifically, we compute the popularity increase of the items predicted when transitioning from a task-specific model to a joint model. For example, if the target task is recommendation, we measure the increase in popularity according to the search data of the retrieved items by GenR+S compared to the retrieved items by GenR. The results, shown in Table6, indicate that joint training of the model does not always lead to an increase in the average popularity of predicted items (according to the new task we add). This could indicate that the effect on the item’s latent representation coming from [H2] is entangled and thus interfering with the popularity distribution.

RecommendationSearch
ΔΔ\Deltaroman_Δ # matches of history
\rightarrow target queries
ΔΔ\Deltaroman_Δ item pairs from R
in rel pairs for queries
ML+183.09%+136.47%
MPD+210.09%+213.01%
Podcasts+170.92%+314.21%

Building on [H2], we examine the effect of regularization on items’ representations within the joint model. Here, we focus on the differences between the predictions of the task-specific and joint models. We measure two metrics: one for the recommendation task and another for the search task.

For the recommendation task, for each predicted list from GenR+S that is different than GenR, given the ground-truth target item itemt𝑖𝑡𝑒subscript𝑚𝑡item_{t}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the history {item1,,itemt1}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚𝑡1\{item_{1},\dots,item_{t-1}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }, we count all the queries that have both {itemi,itemt}𝑖𝑡𝑒subscript𝑚𝑖𝑖𝑡𝑒subscript𝑚𝑡\{item_{i},item_{t}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } (i<t𝑖𝑡i<titalic_i < italic_t) in their relevance set (ΔΔ\Deltaroman_Δ # matches of history \rightarrow target queries). For example, if the user history includes {item1,item2}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚2\{item_{1},item_{2}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and the target is item3𝑖𝑡𝑒subscript𝑚3item_{3}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and in the search data there are respectively 2222 and 3333 unique queries that have the items {item1,item3}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚3\{item_{1},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } and {item2,item3}𝑖𝑡𝑒subscript𝑚2𝑖𝑡𝑒subscript𝑚3\{item_{2},item_{3}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } in their corresponding relevant set, this value would be 5555.

The second metric,ΔΔ\Deltaroman_Δ item pairs from R in rel pairs for queries, applicable to the search task (second column of Table7), calculates the average number of co-occurrences of the pairs of relevant items for a given query in the recommendation data. For example, if the relevant items for a query are {item1,item3,item5}𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚3𝑖𝑡𝑒subscript𝑚5\{item_{1},item_{3},item_{5}\}{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } and they co-occur in the recommendation data as follows: {item1,item3}:3:𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚33\{item_{1},item_{3}\}:3{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } : 3, {item1,item5}:4:𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚54\{item_{1},item_{5}\}:4{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } : 4, {item1,item5}:2:𝑖𝑡𝑒subscript𝑚1𝑖𝑡𝑒subscript𝑚52\{item_{1},item_{5}\}:2{ italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } : 2, the average score for the query would be (3+4+2)/33423(3+4+2)/3( 3 + 4 + 2 ) / 3. The findings, detailed in Table7, reveal when the joint model predicts different items than the task-specific models, both metrics we defined increase significantly. This indicates that the regularization of items’ latent representation significantly alters the joint model’s predictions, providing strong support for our second hypothesis.

Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (6)

What is the effect of removing the popularity information from the added task?

Although disentangling the effects of both hypotheses from the real-world datasets is challenging, we attempt to remove popularity bias by limiting the number of training instances per item. By restricting the training data to a fixed number of instances per item we effectively remove the popularity bias in the dataset, although we also modify the item representation and co-occurrences for frequently occurring items. The results displayed in Figure4 show a decrease in effectiveness under these conditions, supporting [H1]—that the regularization effect on the learned popularity distribution influences the gains from the joint model.

What is the effect of the redundancy of co-occurring items across the search and recommendation datasets?

To further understand when the joint model is more effective, we split the test instances based on redundant and non-redundant pairs of items. We define a redundant pair of items if they appear both in the search training data, as relevant for a single query, and in the recommendation training data, as part of the same user history of interactions. A non-redundant pair does not appear in the training set of the target task but appears in the other task training data. While redundant pairs can reinforce useful relationships between pairs, the non-redundant pairs can fill missing patterns between the items that the original task dataset did not have.

The statistics related to the two types of pairs for the Podcasts dataset in Table8 show that the number of pairs originating from the search dataset, which were not present in the recommendation training data, is relatively low (only 56 non-redundant pairs). This suggests that the observed gains in the Podcasts dataset for the recommendation task are likely not due to learning new relationships between pairs (that were not present in the recommendation data) but rather due to reinforcing the existing ones (31% improvement for redundant pairs). Conversely, for the search task, we do observe that the gains of the joint model over the task-specific one are higher for non-redundant pairs. This indicates that the regularization effect of [H2] can be beneficial for both redundant and non-redundant pairs of items.

RecommendationSearch
Redundant # pairs193364146
Improv. of GenR+S for redundant31.81%61.39%
Non-redundant # pairs5611222
Improv. of GenR+S for non-redun.0.00%86.76%

8. Conclusion

In this paper, we have explored the impact of multi-task learning in generative retrieval models, specifically focusing on the search and recommendation tasks. We investigated two hypotheses that might explain the effectiveness improvements of multi-task learning for search and recommendation. The first hypothesis is that a joint model can achieve a more robust estimation of each item’s popularity, i.e. its frequency of appearance in user interactions or queries. The second hypothesis is that the joint learning objective provides a beneficial regularization effect on the item latent representations.

Using simulated datasets, we provided evidence supporting both hypotheses. We also identify scenarios where improvements might not manifest: e.g. when there is a high divergence between the items’ popularity distributions or when the items’ co-occurrences do not align across tasks. Additionally, we used three different real-world datasets to show that the multi-task learned generative retrieval model yields effectiveness improvements over the task-specific models, with an average increase of +16% in R@30. Our analysis detailed the conditions under which the joint model outperforms task-specific models, examining changes in predictions, the impact of eliminating popularity bias in training data and analyzing the effect of item pair redundancy across tasks.

We believe this research marks a significant step towards developing unified LLMs for a broad range of IR tasks, shedding light on the specific contributions of search and recommendation tasks for generative retrieval. For future research, we plan to explore the effect of integrating additional tasks, such as generating explanations, within a unified multi-task learned LLM for IR and to examine ID strategies (how to represent each item based on a set of tokens) in the multi-task learning framework.

References

  • (1)
  • BehnamGhader etal. (2024)Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024.LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders.arXiv preprint arXiv:2404.05961 (2024).
  • Belkin and Croft (1992)NicholasJ Belkin and WBruce Croft. 1992.Information filtering and information retrieval: Two sides of the same coin?Commun. ACM 35, 12 (1992), 29–38.
  • Craswell etal. (2023)Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, EllenM. Voorhees, and Ian Soboroff. 2023.Overview of the TREC 2022 deep learning track. In Text REtrieval Conference (TREC). NIST, TREC.https://www.microsoft.com/en-us/research/publication/overview-of-the-trec-2022-deep-learning-track/
  • DeCao etal. (2020)Nicola DeCao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2020.Autoregressive entity retrieval.arXiv preprint arXiv:2010.00904 (2020).
  • Dou etal. (2007)Zhicheng Dou, Ruihua Song, and Ji-Rong Wen. 2007.A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th international conference on World Wide Web. 581–590.
  • Faggioli etal. (2023)Guglielmo Faggioli, Laura Dietz, CharlesLA Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, etal. 2023.Perspectives on large language models for relevance judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 39–50.
  • Fan etal. (2023)Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li. 2023.Recommender systems in the era of large language models (llms).arXiv preprint arXiv:2307.02046 (2023).
  • Formal etal. (2021)Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021.SPLADE v2: Sparse lexical and expansion model for information retrieval.arXiv preprint arXiv:2109.10086 (2021).
  • Geng etal. (2022)Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022.Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
  • George ([n. d.])K George. [n. d.].Zipf. 1935. The psycho-biology of language.
  • Gilardi etal. (2023)Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023.ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences 120, 30 (2023), e2305016120.
  • Gong etal. (2023)Yuqi Gong, Xichen Ding, Yehui Su, Kaiming Shen, Zhongyi Liu, and Guannan Zhang. 2023.An Unified Search and Recommendation Foundation Model for Cold-Start Scenario. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 4595–4601.
  • Harper and Konstan (2015)FMaxwell Harper and JosephA Konstan. 2015.The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19.
  • Hidasi etal. (2015)Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015.Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939 (2015).
  • Hua etal. (2023)Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023.How to Index Item IDs for Recommendation Foundation Models.arXiv preprint arXiv:2305.06569 (2023).
  • Izacard etal. (2021)Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021.Unsupervised dense information retrieval with contrastive learning.arXiv preprint arXiv:2112.09118 (2021).
  • Kang and McAuley (2018a)Wang-Cheng Kang and Julian McAuley. 2018a.Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  • Kang and McAuley (2018b)Wang-Cheng Kang and Julian McAuley. 2018b.Self-Attentive Sequential Recommendation.arXiv:1808.09781[cs.IR]
  • Karpukhin etal. (2020)Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020.Dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2004.04906 (2020).
  • Kishore etal. (2023)Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, and KilianQ Weinberger. 2023.IncDSI: Incrementally Updatable Document Retrieval.(2023).
  • Kullback and Leibler (1951)Solomon Kullback and RichardA Leibler. 1951.On information and sufficiency.The annals of mathematical statistics 22, 1 (1951), 79–86.
  • Lassance etal. (2024)Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024.SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789 (2024).
  • Li etal. (2024)Yongqi Li, Xinyu Lin, Wenjie Wang, Fuli Feng, Liang Pang, Wenjie Li, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2024.A Survey of Generative Search and Recommendation in the Era of Large Language Models.arXiv:2404.16924[cs.IR]
  • Li etal. (2023)Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li. 2023.Multiview Identifiers Enhanced Generative Retrieval.arXiv preprint arXiv:2305.16675 (2023).
  • Lin etal. (2023)Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, etal. 2023.How can recommender systems benefit from large language models: A survey.arXiv preprint arXiv:2306.05817 (2023).
  • Lin etal. (2022)Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2022.Pretrained transformers for text ranking: Bert and beyond.Springer Nature.
  • Liu etal. (2024)Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2024.Once: Boosting content-based recommendation with both open-and closed-source large language models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 452–461.
  • Liu etal. (2023)Zhenghao Liu, Sen Mei, Chenyan Xiong, Xiaohua Li, Shi Yu, Zhiyuan Liu, Yu Gu, and Ge Yu. 2023.Text Matching Improves Sequential Recommendation by Reducing Popularity Biases. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 1534–1544.
  • Loshchilov and Hutter (2017)Ilya Loshchilov and Frank Hutter. 2017.Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101 (2017).
  • Macdonald etal. (2021)Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, and Iadh Ounis. 2021.PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval., 4526–4533pages.
  • Mehta etal. (2022)SanketVaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, VinhQ Tran, Jinfeng Rao, Marc Najork, Emma Strubell, and Donald Metzler. 2022.DSI++: Updating Transformer Memory with New Documents.arXiv preprint arXiv:2212.09744 (2022).
  • Metzler etal. (2021)Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork. 2021.Rethinking search: making domain experts out of dilettantes. In Acm sigir forum, Vol.55. ACM New York, NY, USA, 1–27.
  • Nguyen etal. (2023)Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023.A Unified Framework for Learned Sparse Retrieval. In European Conference on Information Retrieval. Springer, 101–116.
  • Nguyen etal. (2016)Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016.Ms marco: A human-generated machine reading comprehension dataset.(2016).
  • Nogueira and Cho (2019)Rodrigo Nogueira and Kyunghyun Cho. 2019.Passage Re-ranking with BERT.arXiv preprint arXiv:1901.04085 (2019).
  • Parthasarathy and SathiyaDevi (2023)Govindarajan Parthasarathy and Shanmugam SathiyaDevi. 2023.Hybrid Recommendation System Based on Collaborative and Content-Based Filtering.Cybernetics and Systems 54, 4 (2023), 432–453.
  • Penha and Hauff (2020)Gustavo Penha and Claudia Hauff. 2020.What does bert know about books, movies and music? probing bert for conversational recommendation. In Proceedings of the 14th ACM conference on recommender systems. 388–397.
  • Penha etal. (2023)Gustavo Penha, Enrico Palumbo, Maryam Aziz, Alice Wang, and Hugues Bouchard. 2023.Improving content retrievability in search with controllable query generation. In Proceedings of the ACM Web Conference 2023. 3182–3192.
  • Pradeep etal. (2023a)Ronak Pradeep, Kai Hui, Jai Gupta, AdamD Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and VinhQ Tran. 2023a.How Does Generative Retrieval Scale to Millions of Passages?arXiv preprint arXiv:2305.11841 (2023).
  • Pradeep etal. (2023b)Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023b.Rankvicuna: Zero-shot listwise document reranking with open-source large language models.arXiv preprint arXiv:2309.15088 (2023).
  • Qin etal. (2023)Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, etal. 2023.Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563 (2023).
  • Quiñonero-Candela etal. (2022)Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and NeilD Lawrence. 2022.Dataset shift in machine learning.Mit Press.
  • Raffel etal. (2020)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ Liu. 2020.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research 21, 140 (2020), 1–67.
  • Rajput etal. (2023)Shashank Rajput, Nikhil Mehta, Anima Singh, RaghunandanH Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, VinhQ Tran, Jonah Samost, etal. 2023.Recommender Systems with Generative Retrieval.arXiv preprint arXiv:2305.05065 (2023).
  • Reimers and Gurevych (2019)Nils Reimers and Iryna Gurevych. 2019.Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.https://arxiv.org/abs/1908.10084
  • Robertson and Walker (1994)StephenE Robertson and Steve Walker. 1994.Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR’94. Springer, 232–241.
  • Shi etal. (2024)Teng Shi, Zihua Si, Jun Xu, Xiao Zhang, Xiaoxue Zang, Kai Zheng, Dewei Leng, Yanan Niu, and Yang Song. 2024.UniSAR: Modeling User Transition Behaviors between Search and Recommendation.arXiv preprint arXiv:2404.09520 (2024).
  • Si etal. (2023)Zihua Si, Zhongxiang Sun, Xiao Zhang, Jun Xu, Xiaoxue Zang, Yang Song, Kun Gai, and Ji-Rong Wen. 2023.When Search Meets Recommendation: Learning Disentangled Search Representation for Recommendation.arXiv preprint arXiv:2305.10822 (2023).
  • Silva etal. (2024)Ítallo Silva, Leandro Marinho, Alan Said, and MartijnC Willemsen. 2024.Leveraging ChatGPT for Automated Human-centered Explanations in Recommender Systems. In Proceedings of the 29th International Conference on Intelligent User Interfaces. 597–608.
  • Song etal. (2020)Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020.Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems 33 (2020), 16857–16867.
  • Sun etal. (2019)Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019.BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  • Tay etal. (2022)Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, etal. 2022.Transformer memory as a differentiable search index.Advances in Neural Information Processing Systems 35 (2022), 21831–21843.
  • Thakur etal. (2021)Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021.Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663 (2021).
  • Thorat etal. (2015)PoonamB Thorat, RajeshwariM Goudar, and Sunita Barve. 2015.Survey on collaborative filtering, content-based filtering and hybrid recommendation system.International Journal of Computer Applications 110, 4 (2015), 31–36.
  • Vander Maaten and Hinton (2008)Laurens Vander Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).
  • Vig etal. (2012)Jesse Vig, Shilad Sen, and John Riedl. 2012.The tag genome: Encoding community knowledge to support novel interaction.ACM Transactions on Interactive Intelligent Systems (TiiS) 2, 3 (2012), 1–44.
  • Vijayakumar etal. (2016)AshwinK Vijayakumar, Michael Cogswell, RamprasathR Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016.Diverse beam search: Decoding diverse solutions from neural sequence models.arXiv preprint arXiv:1610.02424 (2016).
  • Wang etal. (2012)Jian Wang, Yi Zhang, and Tao Chen. 2012.Unified recommendation and search in e-commerce. In Information Retrieval Technology: 8th Asia Information Retrieval Societies Conference, AIRS 2012, Tianjin, China, December 17-19, 2012. Proceedings 8. Springer, 296–305.
  • Wang etal. (2023b)Wenjie Wang, Yiyan Xu, Fuli Feng, Xinyu Lin, Xiangnan He, and Tat-Seng Chua. 2023b.Diffusion recommender model. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 832–841.
  • Wang etal. (2023a)Xiaolei Wang, Xinyu Tang, WayneXin Zhao, Jingyuan Wang, and Ji-Rong Wen. 2023a.Rethinking the evaluation for conversational recommendation in the era of large language models.arXiv preprint arXiv:2305.13112 (2023).
  • Wang etal. (2022)Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, etal. 2022.A neural corpus indexer for document retrieval.Advances in Neural Information Processing Systems 35 (2022), 25600–25614.
  • Wang etal. (2023c)Zihan Wang, Yujia Zhou, Yiteng Tu, and Zhicheng Dou. 2023c.NOVO: Learnable and Interpretable Document Identifiers for Model-Based IR.(2023).
  • Wu etal. (2021)Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021.Empowering news recommendation with pre-trained language models. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 1652–1656.
  • Xu etal. (2018)Jun Xu, Xiangnan He, and Hang Li. 2018.Deep learning for matching in search and recommendation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1365–1368.
  • Yang etal. (2023)Tianchi Yang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, and Qi Zhang. 2023.Auto Search Indexer for End-to-End Document Retrieval.arXiv preprint arXiv:2310.12455 (2023).
  • Yao etal. (2021)Jing Yao, Zhicheng Dou, Ruobing Xie, Yanxiong Lu, Zhiping Wang, and Ji-Rong Wen. 2021.USER: A unified information search and recommendation model based on integrated behavior sequence. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2373–2382.
  • Yi etal. (2019)Xinyang Yi, Ji Yang, Lichan Hong, DerekZhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019.Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems. 269–277.
  • Yu etal. (2012)Kai Yu, Anton Schwaighofer, Volker Tresp, Wei-Ying Ma, and HongJiang Zhang. 2012.Collaborative ensemble learning: Combining collaborative and content-based information filtering via hierarchical Bayes.arXiv preprint arXiv:1212.2508 (2012).
  • Zamani and Croft (2018)Hamed Zamani and WBruce Croft. 2018.Joint modeling and optimization of search and recommendation.arXiv preprint arXiv:1807.05631 (2018).
  • Zamani and Croft (2020)Hamed Zamani and WBruce Croft. 2020.Learning a joint search and recommendation model from user-item interactions. In Proceedings of the 13th international conference on web search and data mining. 717–725.
  • Zamani etal. (2019)Hamed Zamani, Markus Schedl, Paul Lamere, and Ching-Wei Chen. 2019.An analysis of approaches taken in the acm recsys challenge 2018 for automatic music playlist continuation.ACM Transactions on Intelligent Systems and Technology (TIST) 10, 5 (2019), 1–21.
  • Zeng etal. (2023)Hansi Zeng, Chen Luo, Bowen Jin, SheikhMuhammad Sarwar, Tianxin Wei, and Hamed Zamani. 2023.Scalable and Effective Generative Information Retrieval.arXiv preprint arXiv:2311.09134 (2023).
  • Zeng etal. (2024)Hansi Zeng, Chen Luo, and Hamed Zamani. 2024.Planning Ahead in Generative Retrieval: Guiding Autoregressive Generation through Simultaneous Decoding.arXiv:2404.14600[cs.IR]
  • Zhang etal. (2023)Hailin Zhang, Yujing Wang, Qi Chen, Ruiheng Chang, Ting Zhang, Ziming Miao, Yingyan Hou, Yang Ding, Xupeng Miao, Haonan Wang, etal. 2023.Model-enhanced Vector Index.arXiv preprint arXiv:2309.13335 (2023).
  • Zhao etal. (2022b)Kai Zhao, Yukun Zheng, Tao Zhuang, Xiang Li, and Xiaoyi Zeng. 2022b.Joint learning of e-commerce search and recommendation with a unified graph neural network. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1461–1469.
  • Zhao etal. (2022a)WayneXin Zhao, Yupeng Hou, Xingyu Pan, Chen Yang, Zeyu Zhang, Zihan Lin, Jingsen Zhang, Shuqing Bian, Jiakai Tang, Wenqi Sun, Yushuo Chen, Lanling Xu, Gaowei Zhang, Zhen Tian, Changxin Tian, Shanlei Mu, Xinyan Fan, Xu Chen, and Ji-Rong Wen. 2022a.RecBole 2.0: Towards a More Up-to-Date Recommendation Library. In CIKM. ACM, 4722–4726.
Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other? (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Msgr. Refugio Daniel

Last Updated:

Views: 6478

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.