– Scalability
– Low quality information and spams
– Dynamics of the Web
Crawler basically fetch and prepares pages for the indexer.
How can we implement a simple crawler?
1. Start with a set of “seed pages” in a priority queue
2. Fetch pages from the web
3. Parse fetched pages for hyperlinks; add them to the queue
4. Follow the hyperlinks in the queue
What is focused crawling?
Targeted crawling for example, crawl only for pages about automobiles.
What is incremental crawling?
Incremental or repeated crawling only crawls for updated pages to minimize resource overhead.
What kind of pages should have a higher priority for recrawling in incremental crawling?
1. frequently updated pages
2. frequently accessed pages
What can we do if the inverted index doesn’t fit in any single machine?
Store data on multiple machines. Process data in parallel.
What’s the basic idea of Google File System (GFS)?
Basic idea is having distributed file system.
How does MapReduce work?
It takes the input and convert to separate clustered key value pairs. It will then 'map' key value pairs then collect/sort them by key value. It will then fed into 'reduce' function which will generate new key value pairs.
What are the two key functions that a programmer needs to implement when programming with a MapReduce framework?
Map and Reduce
How can we use MapReduce to build an inverted index in parallel?
1. Map - count related words, array, doc id, etc
2. ReduceMap - cconcatenates all the input and put them together as single entry.
What is anchor text? Why is it useful for improving search accuracy?
Anchor text describes a link to a page. It provides good evidence for matching the page that it is pointing to if you match the anchor text.
What is a hub page? What is an authority page?
Hub page is a directory page which allows you to see other possible related pages. Authority page to provide some additional score to say how likely a page is an authority.
What kind of web pages tend to receive high scores from PageRank?
The ones with more links receive high scores.
How can we interpret PageRank from the perspective of a random surfer “walking” on the web?
Average probability visiting page di.
How exactly do you compute PageRank scores?
By computing the probability of going from di to dj.
What’s the basic idea of learning to rank?
Basic idea is fitting functions with training data (which already got some relevance judgments).
How can logistic regression be used to combine multiple features for improving ranking accuracy of a search engine?
First estimate Betas by maximizing the likelihood of training data. Once Betas are known, we can take Xi(Q,D) computed based on a new query and a new document to generate a score for D w.r.t. Q.
What is content-based information filtering?
– Look at what items U likes, and then check if X is similar
Item similarity => content-based filtering
How can we use a linear utility function to evaluate a filtering system?
Example with coefficients (3-2) If you deliver a good item, you gain 3 and lose 2 but if you delivered a bad item, you gain 2 but lose 3. Deliver as many good articles as possible and minimize delivery of bad articles.
How should we set the coefficients in such a linear utility function?
It needs to be set depending on the specific application. (eg 3,-2)
How can we extend a retrieval system to perform content-based information filtering?
• “Reuse” retrieval techniques to score documents
• Use a score threshold for filtering decision
• Learn to improve scoring with traditional feedback
• New approaches to threshold setting and learning
What is exploration-exploitation tradeoff?
Expolitation means what you learn from user so you don't deviate much which means not exploring further.
How does the beta-gamma threshold learning algorithm work?
What is the basic idea of collaborative filtering?
- Look at who likes X, and then check if U is similar
User similarity => collaborative filtering
How does the memory-based collaborative filtering algorithm work?
storing all user information and when we are considering a particular user, we retrieve the relevant users and try to use the information of those users to predict the preference of this user.
“cold start” problem happens if there no sufficiently large number of user preferences available.
Key Phrases/Concepts
- Scalability, efficiency
- Spam
- Crawler, focused crawling, incremental crawling
- Google File System (GFS)
- MapReduce
- Link analysis, anchor text
- PageRank, HITS
- Learning to rank, features, logistic regression
- Content-based filtering
- Collaborative filtering
- Beta-gamma threshold learning
- Linear utility
- User profile
- Exploration-exploitation tradeoff
- Memory-based collaborative filtering
- Cold start