My Notes on Coursera's Text Retrieval and Search Engines Course: 2015

Sunday, 12 April 2015

Week 4 Guiding Questions

What are some of the general challenges in building a web search engine?
– Scalability
– Low quality information and spams
– Dynamics of the Web

What is a crawler?
Crawler basically fetch and prepares pages for the indexer.

How can we implement a simple crawler?
1. Start with a set of “seed pages” in a priority queue
2. Fetch pages from the web
3. Parse fetched pages for hyperlinks; add them to the queue
4. Follow the hyperlinks in the queue

What is focused crawling?
Targeted crawling for example, crawl only for pages about automobiles.

What is incremental crawling?
Incremental or repeated crawling only crawls for updated pages to minimize resource overhead.

What kind of pages should have a higher priority for recrawling in incremental crawling?
1. frequently updated pages
2. frequently accessed pages

What can we do if the inverted index doesn’t fit in any single machine?
Store data on multiple machines. Process data in parallel.

What’s the basic idea of Google File System (GFS)?
Basic idea is having distributed file system.

How does MapReduce work?
It takes the input and convert to separate clustered key value pairs. It will then 'map' key value pairs then collect/sort them by key value. It will then fed into 'reduce' function which will generate new key value pairs.

What are the two key functions that a programmer needs to implement when programming with a MapReduce framework?
Map and Reduce

How can we use MapReduce to build an inverted index in parallel?
1. Map - count related words, array, doc id, etc
2. ReduceMap - cconcatenates all the input and put them together as single entry.

What is anchor text? Why is it useful for improving search accuracy?
Anchor text describes a link to a page. It provides good evidence for matching the page that it is pointing to if you match the anchor text.

What is a hub page? What is an authority page?
Hub page is a directory page which allows you to see other possible related pages. Authority page to provide some additional score to say how likely a page is an authority.

What kind of web pages tend to receive high scores from PageRank?
The ones with more links receive high scores.

How can we interpret PageRank from the perspective of a random surfer “walking” on the web?
Average probability visiting page di.

How exactly do you compute PageRank scores?
By computing the probability of going from di to dj.

How does the HITS algorithm work?

What’s the basic idea of learning to rank?
Basic idea is fitting functions with training data (which already got some relevance judgments).

How can logistic regression be used to combine multiple features for improving ranking accuracy of a search engine?
First estimate Betas by maximizing the likelihood of training data. Once Betas are known, we can take Xi(Q,D) computed based on a new query and a new document to generate a score for D w.r.t. Q.

What is content-based information filtering?
– Look at what items U likes, and then check if X is similar
Item similarity => content-based filtering

How can we use a linear utility function to evaluate a filtering system?
Example with coefficients (3-2) If you deliver a good item, you gain 3 and lose 2 but if you delivered a bad item, you gain 2 but lose 3. Deliver as many good articles as possible and minimize delivery of bad articles.

How should we set the coefficients in such a linear utility function?
It needs to be set depending on the specific application. (eg 3,-2)

How can we extend a retrieval system to perform content-based information filtering?
• “Reuse” retrieval techniques to score documents
• Use a score threshold for filtering decision
• Learn to improve scoring with traditional feedback
• New approaches to threshold setting and learning

What is exploration-exploitation tradeoff?
Expolitation means what you learn from user so you don't deviate much which means not exploring further.

How does the beta-gamma threshold learning algorithm work?

What is the basic idea of collaborative filtering?
- Look at who likes X, and then check if U is similar
User similarity => collaborative filtering

How does the memory-based collaborative filtering algorithm work?
storing all user information and when we are considering a particular user, we retrieve the relevant users and try to use the information of those users to predict the preference of this user.

What is the “cold start” problem in collaborative filtering?
“cold start” problem happens if there no sufficiently large number of user preferences available.

Key Phrases/Concepts
- Scalability, efficiency
- Spam
- Crawler, focused crawling, incremental crawling
- Google File System (GFS)
- MapReduce
- Link analysis, anchor text
- PageRank, HITS
- Learning to rank, features, logistic regression
- Content-based filtering
- Collaborative filtering
- Beta-gamma threshold learning
- Linear utility
- User profile
- Exploration-exploitation tradeoff
- Memory-based collaborative filtering
- Cold start

Friday, 3 April 2015

Week 3 Guiding Questions

Given a table of relevance judgments in the form of three columns (query, document, and binary relevance judgments), how can we estimate p(R=1|q,d)?
We can estimate the probability by computing the total number of query and doc pair with R=1 over the total number of all related query and doc pair.

How should we interpret the query likelihood conditional probability p(q|d)?
It is interpreted as probability that a user who likes d would pose query q.

What is a Statistical Language Model?
- A probability distribution over word sequences
– p(“Today is Wednesday”) - 0.001
– p(“Today Wednesday is”) - 0.0000000000001
– p(“The eigenvalue is positive”) - 0.00001
- Context-dependent
- Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

What is a Unigram Language Model?
Unigram Language Model involves generating text by generating each word INDEPENDENTLY thus allowing us to compute probabilities of all sequences of the words.

How many parameters are there in a unigram language model?
The number of parameters depends on the number of vocabulary size provided.
Parameters: {p(wi )} p(w1 )+…+p(wN )=1 (N is voc. size)

How do we compute the maximum likelihood estimate of the Unigram Language Model (based on a text sample)?
Maximum likelihood estimate is basically the count of words in the document divided by the total number of words in the document or document length.

What is a background language model?
It represents frequency of words(eg. in english) in general.

What is a collection language model?
It represents frequency of words in a chosen collection (eg. computer science papers).

What is a document language model?
It represents frequency of words in a chosen document (eg. text mining paper).

Why do we need to smooth a document language model in the query likelihood retrieval model?
In order to assign a non zero probability for words which have not been observed in the document, we have to take away some probability weights from the words that are observed in the document to give the weighting to unseen words.

Smoothing a language model is an attempt to try to recover the model for the whole article.

What would happen if we don’t do smoothing?
If there is no smoothing, for words which did not occur in the document, they will have zero probability which means there is no chance of sampling any word that is not in the document.

When we smooth a document language model using a collection language model as a reference language model, what is the probability assigned to an unseen word in a document?
The probability assigned to an unseen word in a document will be proportional to the probability of the word in the collection.

How can we prove that the query likelihood retrieval function implements TF-IDF weighting if we use a collection language model smoothing?

How does linear interpolation (Jelinek-Mercer) smoothing work? What is the formula?
The idea of Jelinek-Mercer smoothing is to get the maximum likelihood estimate and rely on collection language model where unobserved words will not have zero probability. Get the linear interpolation between the maximum likelihood estimate and the collection language model which is controlled by a "fixed" smoothing parameter number.

Jelinek-Mercer Smoothing formula

How does Dirichlet Prior smoothing work? What is the formula?
The idea of Dirichlet Prior(Bayesian) smoothing is to get the maximum likelihood estimate and rely on collection language model where unobserved words will not have zero probability. Get the linear interpolation between the maximum likelihood estimate and the collection language model which is controlled by a "dynamic" smoothing parameter number which will result to long documents having smaller coefficients which mean long documents will have less smoothing.

Dirichlet Prior(Bayesian) Smoothing formula

What are the similarity and difference between Jelinek-Mercer smoothing and Dirichlet Prior smoothing?
Similarity is they both use the maximum likelihood estimate and rely on collection and get the liner interpolation of both. The difference is that Jelinek-Mercer used "fixed" coefficient while Dirichlet Prior(Bayesian) uses "dynamic" coefficient as smoothing parameter number.

What is relevance feedback?
Users make explicit relevance judgments on the initial results
(judgments are reliable, but users don’t want to make extra effort)

What is pseudo-relevance feedback?
Pseudo/Blind/Automatic Feedback
Top-k initial results are simply assumed to be relevant
(judgments aren’t reliable, but no user activity is required)
Eg. Top 10 results will be assumed relevant.

What is implicit feedback?
User-clicked docs are assumed to be relevant; skipped ones non-relevant
(judgments aren’t completely reliable, but no extra effort from users)

How does Rocchio work?
Rocchio is feedback for VSM. Moving of the query vector to the centroid of relevant documents and move away from the centroid of non-relevant documents.

Why do we need to ensure that the original query terms have sufficiently large weights in feedback?
To avoid "overfitting" because the sample in feedback is relatively small sample and we do not overly trust the small sample and original query terms are most important as the user types in the original terms.

What is the KL-divergence retrieval function?
Kullback-Leibler (KL) Divergence Retrieval Model is generalizing the frequency inside query likelihood into a language model. It measures divergence of two distributions, one is query model and the other is document language model smoothed by collection language model.

How is it related to the query likelihood retrieval function?
Kullback-Leibler (KL) Divergence Retrieval Model is just generalizing the frequency inside query likelihood into a language model. KL is almost identical to the query likelihood retrieval function.

What is the basic idea of the two-component mixture model for feedback?
The basic idea is to use the background language model to explain words so as to discriminate very common words which are usually not relevant.

Key Phrases/Concepts
- p(R=1|q,d) ; query likelihood, p(q|d)
- Statistical Language Model; Unigram Language Model
- Maximum likelihood estimate
- Background language model, collection language model, document language model
- Smoothing of Unigram Language Models
- Relation between query likelihood and TF-IDF weighting
- Linear interpolation (i.e., Jelinek-Mercer) smoothing
- Dirichlet Prior smoothing
- Relevance feedback, pseudo-relevance feedback, implicit feedback
- Rocchio
- Kullback-Leiber divergence (KL-divergence) retrieval function
- Mixture language model

Monday, 30 March 2015

Week 2 Guiding Questions

What is the typical architecture of a text retrieval system?
A typical architecture of a text retrieval system includes
- tokenizer -> indexer (offline)
- tokenizer -> scorer (online)
- feedback process (offline and online)

What is an inverted index?
Inverted index is a list of documents that match every term in the vocabulary. It is the dominating indexing method for supporting basic search algorithms.

Why is it desirable for compressing an inverted index?
Inverted index compression is desirable to help improve speed because I/O takes more time than CPU, when compressing, the size will be smaller so we can reduce I/O not to mention saving disk space.

How can we create an inverted index when the collection of documents does not fit into the memory?
When the collection does not fit into memory, we use sort-based methods on the disk.
Sort-based methods:
– Step 1: Collect local (termID, docID, freq) tuples
– Step 2: Sort local tuples (to make “runs”)
– Step 3: Pair-wise merge runs
– Step 4: Output inverted file

How can we leverage an inverted index to score documents quickly?
– Exploit inverted index to accumulate scores for documents matching a query term
– Exploit Zipf’s law to avoid touching many documents not matching any query term

Why is evaluation so critical for research and application development in text retrieval?
Reason 1: Assess the actual utility of a TR system
Reason 2: Compare different systems and methods

How does Cranfield evaluation methodology work?
Cranfield evaluation methodolgy works by building reusable test collections & defining measures

How do we evaluate a set of retrieved documents?
Precision and Recall are the basic measures to evaluate a set of retrieved documents.

Precision to what extent all retrieved results are relevant
Recall measures the completeness of coverage of relevant documents

Precision: are the retrieved results all relevant?
Recall: have all the relevant documents been retrieved?

How do you compute precision, recall, and F1?
Below are the formula to compute precision, recall and F1.

How do we evaluate a ranked list of search results?
We evaluate ranked list of search results by looking at precision and recall at different cut offs (where user will stop). We also compare PR curves which can showcase the utility of the a TR system between compared systems. We also use a single number called average precision that would accurately depict which one is better. Average precision applies to one query(in contrast to mean average precision) and it is very sensitive to slight changes in PR values hence it is good to use.

How do you compute average precision?
We compute average precision by getting the sum all precision values at distinct recall levels and divide it by the total number of relevant documents in the collection.

How do you compute mean average precision (MAP) and geometric mean average precision (gMAP)?
MAP is computed by getting the arithmetic mean of average precision over a set of queries

gMAP is computed by getting the geometric mean of average precision over a set of queries

gMAP tends to be affected more by low values(in contrast to MAP) which are queries that do not have good performance. If you think of improving search engines for difficult queries, gMAP is preferred. User decides preference between MAP and gMAP.

What is mean reciprocal rank?
Mean reciprocal rank is the average reciprocal rank over a set of topics. Reciprocal rank = 1/r where r is the rank position of the single relevant doc.

Why is MAP more appropriate than precision at k documents when comparing two retrieval methods?
MAP is more appropriate to use as it considers combination of precision and recall and it results to a more sensitivity to the rank of every relevant document.

Why is precision at k documents more meaningful than average precision from a user’s perspective?
Precision at k documents is more meaningful than average precision from a user's perspective because it is the actual precision value assigned to each document. Average precision may be misleading.

How can we evaluate a ranked list of search results using multi-level relevance judgments?
We can evaluate a ranked list of search results using multi-level relevance judgments by using nDCG.
Main idea of nDCG @k documents
– Measure the total utility of the top k documents(cut off) to a user
– Utility of a lowly ranked document is discounted
– Normalized to ensure comparability across queries

How do you compute normalized discounted cumulative gain (nDCG)?
You compute nDCG by DCG@k documents divided by IdealDCG@k documents

Why is normalization necessary in nDCG? Does MAP need a similar normalization?
Normalization is necessary in order to ensure compatibility across queries.MAP does not evaluate multi-level judgment so it does not need similar normalization process.

Why is it important to perform a statistical significance test when we compare the retrieval accuracies of two search engine systems?
We need to do statistical significance test to see the actual distribution of precision at k documents. Also, to assess the variance of different queries. This is because average precision may be misleading.

Key Phrases/Concepts
- Inverted index; postings
- Binary coding; unary coding; gamma-coding; d-gap
- Zipf’s law
- Cranfield evaluation methodology
- Precision; recall
- Average precision; mean average precision (MAP); geometric mean average precision (gMAP)
- Reciprocal rank; mean reciprocal rank
- F-measure
- Normalized discounted cumulative gain (nDCG)
- Statistical significance test

Saturday, 21 March 2015

Week 1 Guiding Questions

What does a computer have to do in order to understand a natural language sentence?

In order to understand a natural language sentence, computers need to do the following:

1. Lexical Analysis (part-of-speech tagging) - noun, verb, etc

2. Syntactic Analysis (parsing) - noun phrase, verb phrase, etc

3. Semantic Analysis - use of symbols to denote structure and give partial meaning to the sentence.
Examples
- Entity/relation extraction
- Word sense disambiguation
- Sentiment analysis

4. Inference - infer some extra information based on text understanding

5. Pragmatic Analysis (speech act) - understand the speech act of a sentence

What is ambiguity?
Ambiguity is overloading same word with different meanings.

Why is natural language processing (NLP) difficult for computers?

NLP is difficult for computers due to ambiguities
1. Word-level ambiguity (eg. “design” can be a noun or a verb)
2. Syntactic ambiguity (eg “A man saw a boy with a telescope.”)
3. Anaphora resolution (eg. “John persuaded Bill to buy a TV for himself.”)
4. Presupposition (“He has quit smoking.” implies that he smoked before.)

What is bag-of-words representation? Why do modern search engines use this simple representation of text?

Bag-of-words representation is keeping all individual/duplicated words but will ignore the order of words. Search engines use bag-of-words because normally if the search results show one or more matches of the search string, it is most likely the match.

What are the two modes of text information access? Which mode does a Web search engine such as Google support?

Two modes of text information access

1. Pull Mode (search engines)
- Users take initiative
- Ad hoc information

2. Push Mode (recommender systems)
- Systems take initiative
- Stable information need or system has good knowledge about a user’s need

Web search engine such as Google uses the Pull Mode.
Push Mode example is news subscription depending on interest.

When is browsing more useful than querying to help a user find relevant information?

Browsing works well when the user wants to explore information in a certain topic, doesn’t know what keywords to use, or can’t conveniently enter a query

Why is a text retrieval task defined as a ranking task?
A text retrieval task is defined as a ranking task because it involves returning a ranked list of documents relevant to the given query.

What is a retrieval model?

A retrieval model is the formalization of relevance. It is a computational definition of relevance.

What are the two assumptions made by the Probability Ranking Principle?
Probability Ranking Principle is the optimal strategy under the following two assumptions:
1. The utility of a document (to a user) is independent of the utility of any other document
2. A user would browse the results sequentially

What is the Vector Space Retrieval Model? How does it work?

Vector Space Retrieval Model is a framework which includes representation of a set of documents as vectors in a common vector space.

It works by having terms representing a doc/query. It decides relevance through similarity of query and document.

How do we define the dimensions of the Vector Space Model?
We define the dimensions of the Vector Space Model using each word in our vocabulary which is basically Bag of Words(BOW).

What are some different ways to place a document as a vector in the vector space?

We can place a document as a vector in the vector space by some strategies below:
1. Using a 0/1 Bit Vector representation where 1 denotes query term presence and 0 denotes query term absence in the document.
2. Using Term Frequency where it denotes the number of times the query term occurs in a document.
3. Using IDF weighting which is giving more preference to a word which occurred many times in the document but infrequently in the whole collection.

What is Term Frequency (TF)?
Term Frequency is the number of times a query term occurs in a document.

What is TF Transformation?

TF Transformation is the process of restricting the contribution of the high count of terms to the overall weighting.

What is Document Frequency (DF)?

Document frequency is the count of documents that contain a particular term.

What is Inverse Document Frequency (IDF)?
Inverse Document Frequency is when a term that does not occur in many documents. Typically, the higher the IDF the more relevant is the document.

What is TF-IDF Weighting?
TF-IDF weighting is giving more preference to a word which occurred many times in the document but infrequently in the whole collection.

Why do we need to penalize long documents in text retrieval?

We should penalize long documents because they naturally have better chances to match any query.

What is pivoted document length normalization?

Pivoted document length normalization is using the average document length as a 'pivot'(ie. reference point) to determine whether to penalize or reward a long document.

What are the main ideas behind the retrieval function BM25?
BM25 is a sublinear TF transformation which
- capture the intuition of “diminishing return” from higher TF
- avoid dominance by one single term over all others

Key Phrases/Concepts
- Part-of-speech tagging; syntactic analysis; semantic analysis; ambiguity
-“Bag of words” representation
- Push, pull, querying, browsing
- Probability Ranking Principle
- Relevance- Vector Space Model
- Term Frequency (TF)
- Document Frequency (DF); Inverse Document Frequency (IDF)
- TF Transformation
- Pivoted length normalization
- Dot product
- BM25

Friday, 20 March 2015

Overview

These are some personal answers on guiding questions presented under each weekly topic of Coursera's Text Retrieval and Search Engines Course. You may use this blog as a one of your reviewers in preparation for the quizzes.

Coursera and other terms may be registered trademarks or trademarks of their respective owners. The contents of this blog were produced independently and have not been authorised, sponsored, or otherwise approved by Coursera or any of its affiliated educational institutions.

Thanks to ChengXiang Zhai for sharing his expert knowledge on the topic.