General Concepts of PyEsp

Big problem with web search is too much data, PyEsp will try to automate the process of browsing through search engine results. The main concept here is Semantic Profiling of pages, but I will start with simpler concepts.

It is all done on the client computer

PyEsp will gather and store a lot of information about the user, It will basically learn the user preferences and interest domain. But all this information will be stored only on the user computer. Moreover, the user will be able to browse through the data and delete/modify whatever he/she wants.

Building the search domain

For each query, we want to build a search domain of about 100 pages (the number will be controlled by the user), but yet it is important that the pages will be diverse. We will build this search domain by querying multiple engines with multiple queries. The queries will be based on the user query and on terms from previous similar queries that produced "useful" pages.

Downloading the pages

Since we are downloading the pages from different web servers, we will be able to download more than one page at a time. Downloading 10 pages simultaneously works well on my system, and a search domain of 100 pages takes reasonable time to download. PyEsp will start to analyze a page as its content arrives, it will not need to wait for the entire page to download. Pages that load faster will get profiled faster. PyEsp will cache the downloaded pages, so if a page comes up again in a search domain (within some time limit) it will not download it again.

User Interface

The user interface should be an RIA in the user web browser. This does not mean that PyEsp is a web application. The server side, which is the main program, will run on the same computer. This will let the user use PyEsp from the web browser, an advantage in my view.
The user will see the pages (the search domain) get sorted as the downloading/profiling process progresses, he/she will be able to pause or stop the process at any time. The user will be able to give feedback to PyEsp by selecting an entire page or section of a page, and mark it as "useful" for one or more profilers (more on this below).

Profiling

Profiling of pages is the main purpose of this application. PyEsp will learn the user that uses it, and will create a "user-profile", this is all the knowledge we have about this user. When it tests a page (or text), it tries to match it to a sub set of the user-profile (the "query-profile").

After the user enters search terms, and the application builds the search domain, it starts to match the page's text to the query-profile. Initially there will be no stored data about the user (no user-profile and so no query-profile), so no profiling or sorting can be done. When the user finds a "good" page or a "good" paragraph, he/she should tell about it to the application. The application will create a profile (or sub-profile) from this text, link it with the terms the user used to find the page and add it to the user-profile. When the user makes a new query, and the application determined this sub-profile is in the query-profile (maybe by matching the new search terms to the terms linked to this sub-profile), it tries to match the pages in the new search domain with this sub-profile (and other sub-profiles that in the query-profile), and sort the pages accordingly.

Here is the sequence of events that happen when the user enters a search query:

PyEsp generates a list of related terms by looking at the user-profile.
PyEsp builds the search domain by querying search engines(s) with the search terms and related terms.
PyEsp builds the query-profile by pulling all sub-profiles that relate to this query from the user-profile.
PyEsp applies initial profiling to each page by analyzing the short text attached to every page in the search engine results (or matching it to the query-profile). It will sort the pages according to this initial profiling.
PyEsp downloads the pages by the new order it created, and feeds the text to the query-profile. It creates a rank for every page with respect to how close it is to the query profile. It continually sorts the pages as their rank changes.
The user looks at the pages, if he/she finds a "good" text (page or paragraph) the user "tells" about it to PyEsp (in a context of some profiler - more below).
PyEsp creates a new sub-profile from this text and adds it to the user-profile. It links this sub-profile with the search terms.
PyEsp will join every two "similar" sub-profiles to create one more inclusive and more "influencive" sub-profile (more influencive because a match to this sub-profile is basically a match of two sub-profiles).

PyEsp will be able to work with more than one "profiler" (a profiler is a class and a sub-profile is an instance of a profiler). Joining two sub-profiles will be possible only if these are instances of the same profiler. Profilers will be plug-ins, a user will be able to choose the profilers he/she works with, and other programmers/linguistic experts will be able to add new profilers to the system.

Some profilers should be general, they will analyze the semantic meaning of the text in a general context. But some profilers may be specific, for example a profiler that specializes in analyzing programming related text, or a profiler that specializes in medical terms text, and so on.

An instance of a profiler class (sub-profile) will store some knowledge about a meaning of some text. It then will be able to match other texts to itself, and determine how "close" these two texts (the one the instance was built from to the one it tries to match).

Finally, a good profiler will have to be built upon extensive linguistic knowledge. However, our concept here is to build a framework, that will work with relatively simple profilers, and then develop stronger and stronger profilers (with a lot of help ;-).