Jiri Donat's weblog: UPI Defined

UPI – Unique Personal Identificator – is a new and open syntactic layer of internet that uniquely identifies both authors and users of the internet content. It can be applied to various forms of electronic communication (web pages, discussion forums, mails and even IMs and VoIP calls). As a result, the traditional PageRank method will be superseded by a truly personalized approach. We will not only see search results sorted by our personal preferences, but in addition will be even able to limit web search to “people similar to us”.

The Motivation

Let us start with several concrete examples. Should UPI be massively adopted, the following internet search queries will become possible:

Find for me new ideas on certain subject that were written or read by people similar to myself.
Feed me with any new ideas from people whose reasoning and thinking I like
Find a fishery expert (or any other expert I need right now) that has similar interests and way of thinking like me.
Find a business partner that shares my business specialization and that will be an easy communication partner for me
Find a customer for my product or service I can easily target personally

The Method

There are lots of systems today that attempt to solve similar tasks. Generally speaking, we can divide these systems to the following two categories:

Systems that are trying to dig out more from context around certain keywords (e.g., Zoominfo, that searches names in context of automatically selected adjectives), and
Systems that are trying to add some additional explicit information from the user to the existing web content.

The second group can be further divided to

a) systems that collect user’s behavior – or directly, or through their work with links (e.g. Google Reader, Google Personalized Search, Flork, del.icio.us, or Stumbleupon)

b) systems that try to add an additional piece of syntax to the internet (e.g. the Friend of a Friend FOAF project)

The UPI system falls into the 2b) category.

Acknowledgements

The UPI system was invented in a discussion (here is its full content in Czech language) that was moderated by myself on the discussion server Lupa.cz this February. Several members of the community added significant pieces to the system design, so the idea I am now describing is by no means my sole work. My special thanks go to Jan Bilek, who created several important elements of the system.

How it works

First of all, the user chooses his or her unique UPI. Although the easiest technical solution to implement this function would be to go through one centralized registration service, this centralistic approach would very likely harm the system’s adoption. We are thus envisioning multiple competing services – so called identity servers – to do the registration process. The only thing that must be defined centrally is the UPI syntax. We propose the following one:

chosen_name#identity_server_URL

This syntax corresponds with the popular email syntax
chosen_name@e-mail_server_URL.
Our approach makes it easy to select UPIs really uniquely and yet in a decentralized way; the UPI identificator is easily differentiable from the rest of web content, so, in other words, it creates a new piece of the web syntax, which is easily understandable both to human readers and machines. It also directly points to the home identity server of the user, which helps to resolve potential conflicts if more than one UPI identity server page (so called reading profile – see below) is found for a particular user.

The Role of Identity Servers

The purpose of UPI is to uniquely identify a particular user in all his communication activities. To collect the maximum information possible, we must cover (that means uniquely identify) both reading and writing activities of the user. To allow for maximum adoption, the system itself should not demand too much activity on the user side. We thus propose to include all functions of the system in a simple browser plug-in which will do almost all activities for the user automatically. The user will be only required to sign-in to this service on the device he is going to use. The plug-in will be provided or by any third party (in most cases by a search engine) – this area is fully open to competition, too. In addition, the plug-in can automatically identify existing UPIs on the web pages as we see them and turn them automatically into miniature clickable icons or even pictures of users; clicking on such a picture will show a context menu that is related to search services of a particular search engine. It can for example automatically show us the pages we read jointly, discussions where we both participated or even the entire history of our communication.

Tracing Authors

The easy part of unique identification is the “active”, or authoring part. The browser plug-in will sign everything we publish on the web. This is technically very easy: the browser contains a button that inserts our UPI to any our post, article, and even email we write. However, the syntax of UPI is so easy that users can sign any document even manually, in a similar way to adding email address to their posts.

Tracing Readers

A much more difficult part of the system is to trace reading behavior of a particular user. In an ideal world, every page would be signed by UPI of its author and will in the same time contain UPIs of all its readers – this highly formalized content would be then publicly available for all competing search engines. This would be an ideal form of the web!

This will of course not happen (most of the web pages are “read-only”), but we can do virtually the same by placing our reading history to any publicly visible page of the internet. Our plug-in will automatically add URL of every page we visit to our “reading profile” – a web page with specific syntax which can be located on any server we have writing access to. The server that hosts this page will be then called identity server. Over the time, special identity servers will certainly appear on the internet, but to use UPI system we don’t need anything else than just one web page we have write access to.

The Role of Web Search Engines

As soon as we have the web content signed by UPIs of authors and related via reading profiles to its readers, the main part can come. All this information is publicly available, so a competition between different search engines in processing this valuable information may start. The main outcome of this competition will be implementation of “people similar to me” search function. Let us underline that the UPI concept will not become a competitive advantage of any particular web search service; it will serve to all of them, both general and specialized, in creating better personalized search.

How Will Search Engines Process UPI?

The “search people similar to me” function implementation will revolve around the family of statistical cluster analysis methods. The algorithm may look this way:

For each user the search engine searches the web for the person’s UPI and for his reading profile. If multiple reading profiles are found, it resolves this conflict. The information found is then transformed into multidimensional user information that will serve as an input for cluster analysis. This multidimensional representation of user information is to a certain extent similar to the FOAF project, but it is much more information rich and, in addition, it dynamically evolves during the time and so it respects changes of user’s behavior. The actual realization of this transformation will become a competitive advantage between different web search services.
After creating the representation for all UPI users, the cluster analysis starts. For each user the search engine calculates his “distance” to all other users. The detailed realization of the cluster analysis will become a competitive field, too. As a result, we get a two-dimensional matrix of mutual users’ relations. This matrix will then become a direct, personalized successor to PageRank; it will serve for any search query the particular user will carry on from now on, until the next analysis is performed.

Advantages of Openness

Because the UPI concept itself will not serve as a competitive advantage to any particular search service, all search services will be encouraged to optimize its implementation.

The search engines competition will evolve around refining the following areas:

processing the raw content of web pages (web crawlers may be for example able to identify UPIs not only on the same page, but in addition within the same discussion threat, or analyze the frequency of communication between particular UPIs in participating e-mail or IM systems);
further processing of UPI-based information (for example, “aging” of my reading or publishing history could be optimized for particular search scenarios – how should the weight of pages visited or created decrease over the time?);
representation of user information to the form which will provide the best cluster analysis results;
the cluster analysis itself – it can be modified to best serve specific search queries.

Transferability of UPIs

There are of course many remaining things to be resolved. What happens if I am not satisfied with my identity server? Or, if the server stops its service entirely? There should be an easy procedure which allows me to move to another identity server and still maintains my existing UPI (as UPI should be persistent over the time). So I should be able to transfer my UPI to any other identity server; the original server will then be responsible for displaying my new ID server.

What happens if the original server stops working entirely? Even this situation can be resolved. My new identity server will always display my UPI on my publicly visible reading profile. This page will be searchable by search engines, so they can find my UPI page wherever it is located (because UPI is unique and reading profile has a given syntax – so we know the list of reading profiles for each UPI). It is user’s responsibility (and also his own interest) to ensure there is just one UPI profile page with his profile on the internet. If there is more that one page, the user is informed about this problem by his search engine and is then asked to resolve this ubiquity. He can for example blacklist a fake “reading page” provided by a malicious server. Such a black list can be in addition shared by multiple search servers.

Motivation of Users

A nice feature of the system is that it motivates its users for a fair and consistent usage. Soon after we start to use this system, we start to benefit from an improved web search. If an user for example decides to stop using his UPI and replace it by another one, he instantly looses all the information that he already built during the usage of his former identity. In other words, the longer and the more consistently I use my UPI, the more I benefit from it.

OK, there can be one special question: how about if the user wants to visit some xxx pages? He is certainly not willing to have this part of his history publicly available in his profile. But that is fine, too. The user is free to have more than one UPI, if he wants to. His second, “xxx-UPI” will help him to find the xxx-content even better than before, while his “normal” UPI will help him in his normal work. By choosing the right UPI he actually submits an additional information to the system. The user is of course also free to sign off from his UPI-toolbar entirely when he wants to visit pages he doesn’t want to share with anybody else. In that case, he can browse the content entirely anonymously.

So it is the user’s own motivation to use the system as frequently as possible and in a very consistent way. Only this usage pattern will give him the best search benefits.

Conclusion

The main properties of the UPI system are openness and simplicity. It extends the current internet infrastructure and its proven algorithms, so it builds upon existing and verified systems. These properties maximize chances of the system for its mass adoption.
The system is not implemented yet, but I will be happy to assist with its implementation to anybody who is interested.

Labels: internet community, internet search, unique personal identificator, upi