General Concepts of PyEsp
Big problem with web search is too much data, PyEsp will try to automate the
process of browsing through search engine results. The main concept here is
Semantic Profiling of pages, but I will start with simpler concepts.
It is all done on the client computer
PyEsp will gather and store a lot of information about the user, It will
basically learn the user preferences and interest domain. But all this
information will be stored only on the user computer. Moreover, the user will
be able to browse through the data and delete/modify whatever he/she wants.
Building the search domain
For each query, we want to build a search domain of about 100 pages (the number
will be controlled by the user), but yet it is important that the pages will be
diverse. We will build this search domain by querying multiple engines with
multiple queries. The queries will be based on the user query and on terms from
previous similar queries that produced "useful" pages.
Downloading the pages
Since we are downloading the pages from different web servers, we will be able
to download more than one page at a time. Downloading 10 pages simultaneously
works well on my system, and a search domain of 100 pages takes reasonable time
to download. PyEsp will start to analyze a page as its content arrives, it will
not need to wait for the entire page to download. Pages that load faster will
get profiled faster. PyEsp will cache the downloaded pages, so if a page comes
up again in a search domain (within some time limit) it will not download it
again.
User Interface
The user interface should be an RIA in the user web browser. This does not mean
that PyEsp is a web application. The server side, which is the main program,
will run on the same computer. This will let the user use PyEsp from the web
browser, an advantage in my view.
The user will see the pages (the search domain) get sorted as the
downloading/profiling process progresses, he/she will be able to pause or stop
the process at any time. The user will be able to give feedback to PyEsp by
selecting an entire page or section of a page, and mark it as "useful" for one
or more profilers (more on this below).
Profiling
Profiling of pages is the main purpose of this application. PyEsp will learn the
user that uses it, and will create a "user-profile", this is all the knowledge
we have about this user. When it tests a page (or text), it tries to match it
to a sub set of the user-profile (the "query-profile").
After the user enters search terms, and the application builds the search
domain, it starts to match the page's text to the query-profile. Initially
there will be no stored data about the user (no user-profile and so no
query-profile), so no profiling or sorting can be done. When the user finds a
"good" page or a "good" paragraph, he/she should tell about it to the
application. The application will create a profile (or sub-profile) from this
text, link it with the terms the user used to find the page and add it to the
user-profile. When the user makes a new query, and the application determined
this sub-profile is in the query-profile (maybe by matching the new search terms
to the terms linked to this sub-profile), it tries to match the pages in the
new search domain with this sub-profile (and other sub-profiles that in the
query-profile), and sort the pages accordingly.
Here is the sequence of events that happen when the user enters a search query:
- PyEsp generates a list of related terms by looking at the user-profile.
- PyEsp builds the search domain by querying search engines(s) with the search
terms and related terms.
- PyEsp builds the query-profile by pulling all sub-profiles that relate to
this query from the user-profile.
- PyEsp applies initial profiling to each page by analyzing the short text
attached to every page in the search engine results (or matching it to the
query-profile). It will sort the pages according to this initial profiling.
- PyEsp downloads the pages by the new order it created, and feeds the text to
the query-profile. It creates a rank for every page with respect to how close it
is to the query profile. It continually sorts the pages as their rank changes.
- The user looks at the pages, if he/she finds a "good" text (page or
paragraph) the user "tells" about it to PyEsp (in a context of some profiler -
more below).
- PyEsp creates a new sub-profile from this text and adds it to the
user-profile. It links this sub-profile with the search terms.
- PyEsp will join every two "similar" sub-profiles to create one more
inclusive and more "influencive" sub-profile (more influencive because a match
to this sub-profile is basically a match of two sub-profiles).
PyEsp will be able to work with more than one "profiler" (a profiler is a class
and a sub-profile is an instance of a profiler). Joining two sub-profiles will
be possible only if these are instances of the same profiler. Profilers will be
plug-ins, a user will be able to choose the profilers he/she works with, and
other programmers/linguistic experts will be able to add new profilers to the
system.
Some profilers should be general, they will analyze the semantic meaning of the
text in a general context. But some profilers may be specific, for example a
profiler that specializes in analyzing programming related text, or a profiler
that specializes in medical terms text, and so on.
An instance of a profiler class (sub-profile) will store some knowledge about a
meaning of some text. It then will be able to match other texts to itself, and
determine how "close" these two texts (the one the instance was built from
to the one it tries to match).
Finally, a good profiler will have to be built upon extensive linguistic
knowledge. However, our concept here is to build a framework, that will work
with relatively simple profilers, and then develop stronger and stronger
profilers (with a lot of help ;-).