My Notes on Coursera's Text Retrieval and Search Engines Course: Week 1 Guiding Questions

What does a computer have to do in order to understand a natural language sentence?

In order to understand a natural language sentence, computers need to do the following:

1. Lexical Analysis (part-of-speech tagging) - noun, verb, etc

2. Syntactic Analysis (parsing) - noun phrase, verb phrase, etc

3. Semantic Analysis - use of symbols to denote structure and give partial meaning to the sentence.
Examples
- Entity/relation extraction
- Word sense disambiguation
- Sentiment analysis

4. Inference - infer some extra information based on text understanding

5. Pragmatic Analysis (speech act) - understand the speech act of a sentence

What is ambiguity?
Ambiguity is overloading same word with different meanings.

Why is natural language processing (NLP) difficult for computers?

NLP is difficult for computers due to ambiguities
1. Word-level ambiguity (eg. “design” can be a noun or a verb)
2. Syntactic ambiguity (eg “A man saw a boy with a telescope.”)
3. Anaphora resolution (eg. “John persuaded Bill to buy a TV for himself.”)
4. Presupposition (“He has quit smoking.” implies that he smoked before.)

What is bag-of-words representation? Why do modern search engines use this simple representation of text?

Bag-of-words representation is keeping all individual/duplicated words but will ignore the order of words. Search engines use bag-of-words because normally if the search results show one or more matches of the search string, it is most likely the match.

What are the two modes of text information access? Which mode does a Web search engine such as Google support?

Two modes of text information access

1. Pull Mode (search engines)
- Users take initiative
- Ad hoc information

2. Push Mode (recommender systems)
- Systems take initiative
- Stable information need or system has good knowledge about a user’s need

Web search engine such as Google uses the Pull Mode.
Push Mode example is news subscription depending on interest.

When is browsing more useful than querying to help a user find relevant information?

Browsing works well when the user wants to explore information in a certain topic, doesn’t know what keywords to use, or can’t conveniently enter a query

Why is a text retrieval task defined as a ranking task?
A text retrieval task is defined as a ranking task because it involves returning a ranked list of documents relevant to the given query.

What is a retrieval model?

A retrieval model is the formalization of relevance. It is a computational definition of relevance.

What are the two assumptions made by the Probability Ranking Principle?
Probability Ranking Principle is the optimal strategy under the following two assumptions:
1. The utility of a document (to a user) is independent of the utility of any other document
2. A user would browse the results sequentially

What is the Vector Space Retrieval Model? How does it work?

Vector Space Retrieval Model is a framework which includes representation of a set of documents as vectors in a common vector space.

It works by having terms representing a doc/query. It decides relevance through similarity of query and document.

How do we define the dimensions of the Vector Space Model?
We define the dimensions of the Vector Space Model using each word in our vocabulary which is basically Bag of Words(BOW).

What are some different ways to place a document as a vector in the vector space?

We can place a document as a vector in the vector space by some strategies below:
1. Using a 0/1 Bit Vector representation where 1 denotes query term presence and 0 denotes query term absence in the document.
2. Using Term Frequency where it denotes the number of times the query term occurs in a document.
3. Using IDF weighting which is giving more preference to a word which occurred many times in the document but infrequently in the whole collection.

What is Term Frequency (TF)?
Term Frequency is the number of times a query term occurs in a document.

What is TF Transformation?

TF Transformation is the process of restricting the contribution of the high count of terms to the overall weighting.

What is Document Frequency (DF)?

Document frequency is the count of documents that contain a particular term.

What is Inverse Document Frequency (IDF)?
Inverse Document Frequency is when a term that does not occur in many documents. Typically, the higher the IDF the more relevant is the document.

What is TF-IDF Weighting?
TF-IDF weighting is giving more preference to a word which occurred many times in the document but infrequently in the whole collection.

Why do we need to penalize long documents in text retrieval?

We should penalize long documents because they naturally have better chances to match any query.

What is pivoted document length normalization?

Pivoted document length normalization is using the average document length as a 'pivot'(ie. reference point) to determine whether to penalize or reward a long document.

What are the main ideas behind the retrieval function BM25?
BM25 is a sublinear TF transformation which
- capture the intuition of “diminishing return” from higher TF
- avoid dominance by one single term over all others

Key Phrases/Concepts
- Part-of-speech tagging; syntactic analysis; semantic analysis; ambiguity
-“Bag of words” representation
- Push, pull, querying, browsing
- Probability Ranking Principle
- Relevance- Vector Space Model
- Term Frequency (TF)
- Document Frequency (DF); Inverse Document Frequency (IDF)
- TF Transformation
- Pivoted length normalization
- Dot product
- BM25

My Notes on Coursera's Text Retrieval and Search Engines Course

Saturday, 21 March 2015

Week 1 Guiding Questions

2 comments: