12-Dec-2009, I have attended yahoo big thinkers conference at Leela Palace hotel in bangalore.This was the first yahoo conference in my IT career. The meeting was addressed by Mr.Rajeev Ratsogi, head of research and development for web data mining. He gave a wonderful insight on how to extract information from billions of heterogeneous pages of the internet and manage it so that people can search the relevant information better.It comprised of techniques on content acquisition,information extraction and integration of information that belongs to the same entity from different webpages. Here are some key points he has mentioned in the meeting:

1)Deep web content i.e the content hidden behind the forms is 500 times the surface content.
2)Approx. ~ 30% webpages are crawled.
3)Aprrox. ~ 31% information is rich from search results.
4)Most pages are similarly structured.
5)Web is a vast repository of human knowledge.
6)Building Knowledge base for better search and help users find relevant information is a challenging task.
7)Extracting and merging attributes of the same entity from different pages using idexing is also challening.

Following techniques are currently being used to address these issues:

1)Data Mining.
2)Machine Learning.
3)Statistical Analysis.
4)Information Retrieval.

