My Notes on Coursera's Text Retrieval and Search Engines Course: Week 3 Guiding Questions

Given a table of relevance judgments in the form of three columns (query, document, and binary relevance judgments), how can we estimate p(R=1|q,d)?
We can estimate the probability by computing the total number of query and doc pair with R=1 over the total number of all related query and doc pair.

How should we interpret the query likelihood conditional probability p(q|d)?
It is interpreted as probability that a user who likes d would pose query q.

What is a Statistical Language Model?
- A probability distribution over word sequences
– p(“Today is Wednesday”) - 0.001
– p(“Today Wednesday is”) - 0.0000000000001
– p(“The eigenvalue is positive”) - 0.00001
- Context-dependent
- Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

What is a Unigram Language Model?
Unigram Language Model involves generating text by generating each word INDEPENDENTLY thus allowing us to compute probabilities of all sequences of the words.

How many parameters are there in a unigram language model?
The number of parameters depends on the number of vocabulary size provided.
Parameters: {p(wi )} p(w1 )+…+p(wN )=1 (N is voc. size)

How do we compute the maximum likelihood estimate of the Unigram Language Model (based on a text sample)?
Maximum likelihood estimate is basically the count of words in the document divided by the total number of words in the document or document length.

What is a background language model?
It represents frequency of words(eg. in english) in general.

What is a collection language model?
It represents frequency of words in a chosen collection (eg. computer science papers).

What is a document language model?
It represents frequency of words in a chosen document (eg. text mining paper).

Why do we need to smooth a document language model in the query likelihood retrieval model?
In order to assign a non zero probability for words which have not been observed in the document, we have to take away some probability weights from the words that are observed in the document to give the weighting to unseen words.

Smoothing a language model is an attempt to try to recover the model for the whole article.

What would happen if we don’t do smoothing?
If there is no smoothing, for words which did not occur in the document, they will have zero probability which means there is no chance of sampling any word that is not in the document.

When we smooth a document language model using a collection language model as a reference language model, what is the probability assigned to an unseen word in a document?
The probability assigned to an unseen word in a document will be proportional to the probability of the word in the collection.

How can we prove that the query likelihood retrieval function implements TF-IDF weighting if we use a collection language model smoothing?

How does linear interpolation (Jelinek-Mercer) smoothing work? What is the formula?
The idea of Jelinek-Mercer smoothing is to get the maximum likelihood estimate and rely on collection language model where unobserved words will not have zero probability. Get the linear interpolation between the maximum likelihood estimate and the collection language model which is controlled by a "fixed" smoothing parameter number.

Jelinek-Mercer Smoothing formula

How does Dirichlet Prior smoothing work? What is the formula?
The idea of Dirichlet Prior(Bayesian) smoothing is to get the maximum likelihood estimate and rely on collection language model where unobserved words will not have zero probability. Get the linear interpolation between the maximum likelihood estimate and the collection language model which is controlled by a "dynamic" smoothing parameter number which will result to long documents having smaller coefficients which mean long documents will have less smoothing.

Dirichlet Prior(Bayesian) Smoothing formula

What are the similarity and difference between Jelinek-Mercer smoothing and Dirichlet Prior smoothing?
Similarity is they both use the maximum likelihood estimate and rely on collection and get the liner interpolation of both. The difference is that Jelinek-Mercer used "fixed" coefficient while Dirichlet Prior(Bayesian) uses "dynamic" coefficient as smoothing parameter number.

What is relevance feedback?
Users make explicit relevance judgments on the initial results
(judgments are reliable, but users don’t want to make extra effort)

What is pseudo-relevance feedback?
Pseudo/Blind/Automatic Feedback
Top-k initial results are simply assumed to be relevant
(judgments aren’t reliable, but no user activity is required)
Eg. Top 10 results will be assumed relevant.

What is implicit feedback?
User-clicked docs are assumed to be relevant; skipped ones non-relevant
(judgments aren’t completely reliable, but no extra effort from users)

How does Rocchio work?
Rocchio is feedback for VSM. Moving of the query vector to the centroid of relevant documents and move away from the centroid of non-relevant documents.

Why do we need to ensure that the original query terms have sufficiently large weights in feedback?
To avoid "overfitting" because the sample in feedback is relatively small sample and we do not overly trust the small sample and original query terms are most important as the user types in the original terms.

What is the KL-divergence retrieval function?
Kullback-Leibler (KL) Divergence Retrieval Model is generalizing the frequency inside query likelihood into a language model. It measures divergence of two distributions, one is query model and the other is document language model smoothed by collection language model.

How is it related to the query likelihood retrieval function?
Kullback-Leibler (KL) Divergence Retrieval Model is just generalizing the frequency inside query likelihood into a language model. KL is almost identical to the query likelihood retrieval function.

What is the basic idea of the two-component mixture model for feedback?
The basic idea is to use the background language model to explain words so as to discriminate very common words which are usually not relevant.

Key Phrases/Concepts
- p(R=1|q,d) ; query likelihood, p(q|d)
- Statistical Language Model; Unigram Language Model
- Maximum likelihood estimate
- Background language model, collection language model, document language model
- Smoothing of Unigram Language Models
- Relation between query likelihood and TF-IDF weighting
- Linear interpolation (i.e., Jelinek-Mercer) smoothing
- Dirichlet Prior smoothing
- Relevance feedback, pseudo-relevance feedback, implicit feedback
- Rocchio
- Kullback-Leiber divergence (KL-divergence) retrieval function
- Mixture language model

My Notes on Coursera's Text Retrieval and Search Engines Course

Friday, 3 April 2015

Week 3 Guiding Questions

No comments:

Post a Comment