Lectures

Advanced Information Retrieval Models Ounis, Iadh
The development of effective information retrieval (IR) models has been at the forefront of IR research in the past three decades. Essentially, a retrieval model specifies how documents and queries should be represented and how relevance should be decided. IR models have undergone several major paradigm shifts over the history of the field, moving from simple but theoretically founded models to more complex machine-learned ranking models, through sophisticated probabilistic models. This lecture covers a number of advanced and effective search models, including field-based models, proximity models and state-of-the-art learning to rank approaches, focussing on their intuitions, assumptions, their salient characteristics and their uses.

 

Building digital archive systems for historians Jieh Hsiang
The maturity of digitization technologies has provided historians with an unprecedented amount of historical archives in digital form.   Generally speaking, however, the systems that are built for using these digital archives have not taken into consideration the special needs of historical research.
Conventional wisdom dictates that a retrieval system should yield high precision for a casual user while provide high recall for scholarly use.   In order to ensure high precision, a retrieval system assumes that documents are independent from (or even competing with) each other so that those deemed more relevant to the query will emerge on top of the resulting list.  High recall is then achieved through enhancements such as query refinement or a thesaurus to get as many potentially relevant documents as possible.
We noticed from our interaction with historians that they consider high recall desirable but not crucial.  While there is always the concern of missing something important, many historians are also worried about being overwhelmed by the large number of query returns resulted from high recall.  A more fundamental issue that had been overlooked, however, is that historians usually consider documents as related, not independent as assumed by document retrieval systems.  Indeed, a historian rarely looks at a single document alone, but rather a group of documents and searches for properties that the documents collectively possess.  In this sense historical studies is a research of context: context among documents; context between documents and the intangible societal, cultural and historical factors; or even context observed from missing objects.  To respond to this challenge, we suggest that a digital archives system for scholarly use should provide the user with the collective meanings of a query result set, and not just high recall.
The underlying design philosophy of the digital archive systems that we describe in this talk is to treat a set of query returns as a meaningful sub-collection.  Instead of ranking the documents (as is done in the precision/recall model), the proposed methodology returns a list of documents when issued a query, together with a choice of textual contexts of the returned sub-collection as well as visual mechanisms to observe, explain, and refine the contexts.  In other words, in addition to providing search and retrieve, the system also allows the user to observe, analyze, explore, and discover textual contexts among the retrieved documents.  The contexts and their visualization are expressed in such a way that they can be interpreted by the user.  Thus, the historian can observe and explore relationships among documents in an unprecedented scale and ease.  Through exploring and analyzing the contexts, the historian can also discover research issues that had not been noticed before.  What we are advocating, then, is the system as a digital platform that evolves from providing information to discovering research problems.  We emphasize that the system is an observation environment and is not meant to replace the historian’s decision-making process.  A digital system can provide textual context: observable textual contexts within the documents, but not contexts that associate the documents with the external world.  The interpretations of the findings and the narratives are still within the scholar’s realm.
In this talk we shall give a sampling of digital archive systems designed under this principle.  Although the historical materials in our systems are mostly in Chinese and are text-based (mainly Taiwanese history, Chinese history, and Chinese classical writings), the principle is the same regardless of the language or nature of the archives.  We shall introduce three types of textual contexts: metadata contexts, statistical contexts, and expanding contexts, and will discuss why they are the kind of “explainable textual context” that are useful and friendly to historians.  Some typical contexts are chronological distribution, geographical distribution, term frequencies and co-occurrences of people and places, relevance factors, appositional term analysis, holistic comparisons of query returns, land transition graphs, and co-citation diagrams.  Future directions will also be outlined.

 

Evaluation I: system oriented Tetsuya Sakai
In this lecture, I will talk about some evaluation measures for modern information access tasks, how to choose from available measures, and how to report experimental results in technical papers. 
Keywords: average precision; normalised discounted cumulative gain; expected reciprocal rank; normalised cumulative utility measures; diversified search measures; time-biased gain; U-measure; statistical significance tests, p-values, effect sizes, confidence intervals.

 

Introduction to Information Retrieval Ounis, Iadh
Information Retrieval (IR) is the science behind effective search systems such as Google and Bing. Moreover, search systems are nowadays invreasingly prevalent enabling us to quickly find information relevant to all aspects of our daily life. This lecture will provide an overview of a classical search system, where such systems are used, and discusses the challenges that these systems face when attempting to accurately and quickly answer the users' queries. In particular, the lecture will describe how large corporas of documents are processed and represented using structures that can be searched efficiently. It will also discuss the core text processing and retrieval techniques that enable search engines to find relevant documents for a user query. The lecture will conclude with current topics in information retrieval, as well as a summary of current resources that are useful for learning about and developing IR systems. 

 

Linguistic Information Retrieval and Interactive Writing Environment Chang, Jason S.
In this talk, I will introduce two systems for linguistic information retrieval: Linggle and WriteAhead, each with a difference design philosophy and operation environment. Linggle is a Web-scale linguistics search engine that retrieves short phrases in response to a given query. Unlike a typical concordance, Linggle accepts queries for keywords, wildcard, wild part of speech (PoS), synonymous words, and returns short phrases with frequency counts. In our approach, we argument Google Web 1T corpus with full query indexing supporting syntactic and semantic queries. The method involves converting each phrase in the data collection to all potential queries for indexing, and using the MapReduce framework to precompute the top ranking results of all queries. At run time, only minimal query processing is required for some query operators, resulting in lightning-speed retrieval--Linggle shows the search results as the user type away. 
The second linguistic IR application, WriteAhead, is an Interactive Writing Environment (IWE), much like an programmer's Interactive Development Environment (IDE). IWE provides writing suggestions for writers, while IDE provides code editor, debugger, and intelligent code completion for computer programmers. WriteAway does two things really well. First, it examines the unfinished sentence you just typed in and then automatically gives you tips in the form of grammar patterns (accompanied with examples similar to those found in a good dictionary) for continuing/editing your sentence. Second, WriteAway automatically ranks suggestions relevant to the text around the cursor, so you spend less time looking at tips, and focus more on writing your text.
To help second language learners in the IWE, we introduce a method for learning to find complementation patterns from a corpus for the purpose of assisting L2 learners to write fluently and avoid errors. In our approach, phrase chunks are transformed into patterns for statistical analysis and filtering. The method involves learning regular expression templates from dictionary examples, extracting and filtering grammar patterns using the templates, and selecting authentic examples from an academic corpus. We present an interactive writing environment, WriteAhead  that automatically displays examples grouped by grammar pattern to prompt the users as the user writes away. Alternatively, when the user mouse over a word during self-editing, WriteAhead displays grammar patterns and examples most relevant to the word and its surrounding context to help the user edit. Additionally, we also provide translations of the examples for  Preliminary experiments and evaluation show that WriteAhead with the acquired patterns and examples has the potential to promote better writing and improve writing skills in the long run.

 

Multimedia IR Winston Hsu
This short lecture aims to bring the participants a broad and comprehensive coverage on the foundations and recent developments of content-based and semantic-based image and video retrieval on large-scale image / video collections. We will present a balanced review of the area of content-based and semantic-based visual retrieval for large-scale image and video collections by presenting topics of both practical and theoretical interest. Some live demos will be arranged for better explanations. We will also include open research issues and emerging opportunities for large-scale multimedia collections, such as those from fast-growing user-contributed photos, egocentric videos, etc. Besides those fundamental low-level features, this tutorial incorporates additional topics on the latest development of local features, coding methods, efficient feature indexing, hash learning, scalable semantic detection, mobile visual search, etc. Furthermore, we will also brief recent huge improvements in visual analytics by deep convolutional network (DCN), explain DCN’s technical cores for image/video analytics, and demonstrate its capabilities for current challenging application needs. Finally, we will project these techniques to promising applications and open problems arising from large-scale and personal visual data analytics.

 

Search Interface Hideo joho
Search is a complex collaboration between search engines and searchers. This lecture looks at human factors in Information Retrieval with focus on interactive features of search user interfaces. First, it provides background knowledge such as major challenges faced by search engine users, and their behavioral patterns in search. Second, the lecture looks at major components such query box and search result presentation, and expand them to query expansion techniques and result diversification methods. Third, the lecture introduces some of the advanced topics such as multimodal interfaces and collaborative search interfaces. Finally, the lecture discusses future directions of search interfaces. Participants should be able to gain the basic and advanced knowledge regarding the interaction between search engines and end-users by the end of this lecture.

 

Search User Behavior Modeling Yiqun Liu
When users interact with information retrieval (IR) systems, they leave rich implicit feedback in the form of clicks, mouse movements etc. This feedback contains valuable information about users and about IR systems. Analyzing and interpreting user interactions and modeling user search behavior has become an important research topic in Web search studies. It enables us to better understand users, perform user simulations, improve search algorithms and build quality metrics. In the first half of this lecture, I will introduce existing efforts that are paid to understand the cognitive behavior of search users and focus on the characteristics of search users’ querying, clicking-through and examination behaviors. In the second half of the lecture, I will talk about some recent trends in search behavior modeling. Especially, IR systems become more and more heterogeneous: they deal with information of various media types, structure and semantics; run on multiple devices and support a variety of short- and long-term search tasks; serve users with different background and preferences. We will focus on how to construct user behavior modeling to support effective IR methods in such heterogeneous environments.