Jiri Donat's weblog

Tuesday, April 11, 2006

The Power of LinkedIn

I’ve just joined the fast growing Central European team of Capgemini. In my new role of Managing Consultant it will be my pleasure to develop offerings of this global IT services and business consultancy around the lines of Service Oriented Architecture.

In another place of this blog I am discussing the business model of social networks. Indeed, the model is flawed, as today’s applications motivate participants to grow their “trusted” networks indefinitely (last time today I’ve got an invitation saying “it is always beneficial to increase the size and scope of ones network…”). So, this conclusion is very true.

But of course, even if the business model is not right, it does not imply anything about practical usability of these applications. Actually, I can serve as a good example myself. After being a member of the LinkedIn network for just two months, I was approached by headhunters working for Cap. They found my profile at LinkedIn around the same time when another big IT company found me on this network, too. Then both these companies approached me directly and gave me the luxury of deciding between two good opportunities.

The lesson learned? Applications like social networks really work. Even before visionary projects like UPI happen (sorry, this is my child :-)), social networks are already turning the internet into a more structured place. By improving search in more and more special areas, the internet is gradually becoming a medium where you can find what you need.

So there is one symbolism for me. Since now, I have a new job. But in the same time, I have been shown that the world has changed.

Welcome to a networked world! It will be my pleasure to continue meeting you there.

Labels: internet community, internet recruitment, internet search, social networks

Sunday, April 02, 2006

UPI Defined

UPI – Unique Personal Identificator – is a new and open syntactic layer of internet that uniquely identifies both authors and users of the internet content. It can be applied to various forms of electronic communication (web pages, discussion forums, mails and even IMs and VoIP calls). As a result, the traditional PageRank method will be superseded by a truly personalized approach. We will not only see search results sorted by our personal preferences, but in addition will be even able to limit web search to “people similar to us”.

The Motivation

Let us start with several concrete examples. Should UPI be massively adopted, the following internet search queries will become possible:

Find for me new ideas on certain subject that were written or read by people similar to myself.
Feed me with any new ideas from people whose reasoning and thinking I like
Find a fishery expert (or any other expert I need right now) that has similar interests and way of thinking like me.
Find a business partner that shares my business specialization and that will be an easy communication partner for me
Find a customer for my product or service I can easily target personally

The Method

There are lots of systems today that attempt to solve similar tasks. Generally speaking, we can divide these systems to the following two categories:

Systems that are trying to dig out more from context around certain keywords (e.g., Zoominfo, that searches names in context of automatically selected adjectives), and
Systems that are trying to add some additional explicit information from the user to the existing web content.

The second group can be further divided to

a) systems that collect user’s behavior – or directly, or through their work with links (e.g. Google Reader, Google Personalized Search, Flork, del.icio.us, or Stumbleupon)

b) systems that try to add an additional piece of syntax to the internet (e.g. the Friend of a Friend FOAF project)

The UPI system falls into the 2b) category.

Acknowledgements

The UPI system was invented in a discussion (here is its full content in Czech language) that was moderated by myself on the discussion server Lupa.cz this February. Several members of the community added significant pieces to the system design, so the idea I am now describing is by no means my sole work. My special thanks go to Jan Bilek, who created several important elements of the system.

How it works

First of all, the user chooses his or her unique UPI. Although the easiest technical solution to implement this function would be to go through one centralized registration service, this centralistic approach would very likely harm the system’s adoption. We are thus envisioning multiple competing services – so called identity servers – to do the registration process. The only thing that must be defined centrally is the UPI syntax. We propose the following one:

chosen_name#identity_server_URL

This syntax corresponds with the popular email syntax
chosen_name@e-mail_server_URL.
Our approach makes it easy to select UPIs really uniquely and yet in a decentralized way; the UPI identificator is easily differentiable from the rest of web content, so, in other words, it creates a new piece of the web syntax, which is easily understandable both to human readers and machines. It also directly points to the home identity server of the user, which helps to resolve potential conflicts if more than one UPI identity server page (so called reading profile – see below) is found for a particular user.

The Role of Identity Servers

The purpose of UPI is to uniquely identify a particular user in all his communication activities. To collect the maximum information possible, we must cover (that means uniquely identify) both reading and writing activities of the user. To allow for maximum adoption, the system itself should not demand too much activity on the user side. We thus propose to include all functions of the system in a simple browser plug-in which will do almost all activities for the user automatically. The user will be only required to sign-in to this service on the device he is going to use. The plug-in will be provided or by any third party (in most cases by a search engine) – this area is fully open to competition, too. In addition, the plug-in can automatically identify existing UPIs on the web pages as we see them and turn them automatically into miniature clickable icons or even pictures of users; clicking on such a picture will show a context menu that is related to search services of a particular search engine. It can for example automatically show us the pages we read jointly, discussions where we both participated or even the entire history of our communication.

Tracing Authors

The easy part of unique identification is the “active”, or authoring part. The browser plug-in will sign everything we publish on the web. This is technically very easy: the browser contains a button that inserts our UPI to any our post, article, and even email we write. However, the syntax of UPI is so easy that users can sign any document even manually, in a similar way to adding email address to their posts.

Tracing Readers

A much more difficult part of the system is to trace reading behavior of a particular user. In an ideal world, every page would be signed by UPI of its author and will in the same time contain UPIs of all its readers – this highly formalized content would be then publicly available for all competing search engines. This would be an ideal form of the web!

This will of course not happen (most of the web pages are “read-only”), but we can do virtually the same by placing our reading history to any publicly visible page of the internet. Our plug-in will automatically add URL of every page we visit to our “reading profile” – a web page with specific syntax which can be located on any server we have writing access to. The server that hosts this page will be then called identity server. Over the time, special identity servers will certainly appear on the internet, but to use UPI system we don’t need anything else than just one web page we have write access to.

The Role of Web Search Engines

As soon as we have the web content signed by UPIs of authors and related via reading profiles to its readers, the main part can come. All this information is publicly available, so a competition between different search engines in processing this valuable information may start. The main outcome of this competition will be implementation of “people similar to me” search function. Let us underline that the UPI concept will not become a competitive advantage of any particular web search service; it will serve to all of them, both general and specialized, in creating better personalized search.

How Will Search Engines Process UPI?

The “search people similar to me” function implementation will revolve around the family of statistical cluster analysis methods. The algorithm may look this way:

For each user the search engine searches the web for the person’s UPI and for his reading profile. If multiple reading profiles are found, it resolves this conflict. The information found is then transformed into multidimensional user information that will serve as an input for cluster analysis. This multidimensional representation of user information is to a certain extent similar to the FOAF project, but it is much more information rich and, in addition, it dynamically evolves during the time and so it respects changes of user’s behavior. The actual realization of this transformation will become a competitive advantage between different web search services.
After creating the representation for all UPI users, the cluster analysis starts. For each user the search engine calculates his “distance” to all other users. The detailed realization of the cluster analysis will become a competitive field, too. As a result, we get a two-dimensional matrix of mutual users’ relations. This matrix will then become a direct, personalized successor to PageRank; it will serve for any search query the particular user will carry on from now on, until the next analysis is performed.

Advantages of Openness

Because the UPI concept itself will not serve as a competitive advantage to any particular search service, all search services will be encouraged to optimize its implementation.

The search engines competition will evolve around refining the following areas:

processing the raw content of web pages (web crawlers may be for example able to identify UPIs not only on the same page, but in addition within the same discussion threat, or analyze the frequency of communication between particular UPIs in participating e-mail or IM systems);
further processing of UPI-based information (for example, “aging” of my reading or publishing history could be optimized for particular search scenarios – how should the weight of pages visited or created decrease over the time?);
representation of user information to the form which will provide the best cluster analysis results;
the cluster analysis itself – it can be modified to best serve specific search queries.

Transferability of UPIs

There are of course many remaining things to be resolved. What happens if I am not satisfied with my identity server? Or, if the server stops its service entirely? There should be an easy procedure which allows me to move to another identity server and still maintains my existing UPI (as UPI should be persistent over the time). So I should be able to transfer my UPI to any other identity server; the original server will then be responsible for displaying my new ID server.

What happens if the original server stops working entirely? Even this situation can be resolved. My new identity server will always display my UPI on my publicly visible reading profile. This page will be searchable by search engines, so they can find my UPI page wherever it is located (because UPI is unique and reading profile has a given syntax – so we know the list of reading profiles for each UPI). It is user’s responsibility (and also his own interest) to ensure there is just one UPI profile page with his profile on the internet. If there is more that one page, the user is informed about this problem by his search engine and is then asked to resolve this ubiquity. He can for example blacklist a fake “reading page” provided by a malicious server. Such a black list can be in addition shared by multiple search servers.

Motivation of Users

A nice feature of the system is that it motivates its users for a fair and consistent usage. Soon after we start to use this system, we start to benefit from an improved web search. If an user for example decides to stop using his UPI and replace it by another one, he instantly looses all the information that he already built during the usage of his former identity. In other words, the longer and the more consistently I use my UPI, the more I benefit from it.

OK, there can be one special question: how about if the user wants to visit some xxx pages? He is certainly not willing to have this part of his history publicly available in his profile. But that is fine, too. The user is free to have more than one UPI, if he wants to. His second, “xxx-UPI” will help him to find the xxx-content even better than before, while his “normal” UPI will help him in his normal work. By choosing the right UPI he actually submits an additional information to the system. The user is of course also free to sign off from his UPI-toolbar entirely when he wants to visit pages he doesn’t want to share with anybody else. In that case, he can browse the content entirely anonymously.

So it is the user’s own motivation to use the system as frequently as possible and in a very consistent way. Only this usage pattern will give him the best search benefits.

Conclusion

The main properties of the UPI system are openness and simplicity. It extends the current internet infrastructure and its proven algorithms, so it builds upon existing and verified systems. These properties maximize chances of the system for its mass adoption.
The system is not implemented yet, but I will be happy to assist with its implementation to anybody who is interested.

Labels: internet community, internet search, unique personal identificator, upi

My FOAF Comments

The Friend of a Friend (FOAF) project is certainly worth a look. It attempts to provide some basic machinery to help us “tell the Web about the connections between the things that matter to us”. People are one special case of these “things”, so from this perspective, FOAF has similar motivation to UPI (Unique Personal Identificator).

I have however one issue with this system. To my opinion, it is not feasible to try to put condensed personal information (relations to other people or activities) into one short static descriptor. It will never be exact; it stays static over the time and still requires quite a lot of work from participating users. To my opinion, another approach makes better sense: to uniquely identify the user and let him freely work and use the internet. As a result, enough information will be created during a time. This information will then allow any (competing) web engine to create on the fly “FOAF-like” identificators that are however dynamically evolving over the time. In addition, these “dynamic FOAFs” can be then focused and optimized to a particular purpose.

I am sure that the UPI approach, which we are going to describe in the next post, can eventually fulfill the FOAF Goals, but can even strive for something more...

Labels: foaf, internet community, internet search, unique personal identificator, upi

Friday, March 24, 2006

Funny Profiles on Zoominfo

These days a lot of people try hard to work on improving search on the internet. Today’s wealth of internet content is so vast that any method that would help people to differentiate quality content from the ballast (that is overall flooding the net) would be extremely beneficial. Well, we already have one such a method – it is called PageRank. This method is based on the “universal popularity” of a particular site expressed by links that are pointing to it. In other words, PageRank grubs out the semantic information on popularity from the only available syntactic tool: web links. The PageRank algorithm is well proven and fine-tuned to the best possible extent. It is very hard to find any further improvement of it.

Context digging

OK, so where can we move from this point? There are just two ways forward:

to add some additional syntax piece to the internet (that would help make the content better searchable), or
try to work better with the existing unstructured content.

Zoominfo can serve as a typical application of the second approach. It tries to dig out the semantics information from the context of keywords and automatically builds user profiles from publicly available news resources. To do this, it attempts to uniquely identify a particular person by searching its name in the context of other keywords that are automatically identified as being relevant to this person. This is a very non-trivial thing to do, indeed!

The Reality Check

Let me share some examples with you. If we search Zoominfo for the most popular Czech singer Karel Gott, we find eight (!) different profiles. The good news is that all are sort of related to the singer; however, the bad news is that no one is really correct and seven of the eight actually don’t mention that this person is a singer! Where is the problem? In the attempt to differentiate possible namesakes the system actually splits information about one person to many different profiles. Of course, the balance is difficult to reach. On one hand, it is wise to suppose that if there is a lot of information about a particular person, part of it should be contributed to namesakes. On the other hand, it doesn’t hold always, particularly if the person is really popular.

From professor to journalist or landlord

However, this problem is even more general and is not limited to top celebrities only. For example prof. Vorisek, who is the Head of Department of Information Technologies at the Prague Economic University, has 4 different profiles. Only the profile No. 2 is sort of correct, but it is vastly incomplete, just quoting his name and school. We don’t even know his function and have no idea about his other activities. In addition, some of the profiles are pretty funny. My favorite one is the one that actually identifies Jiri as a sort of landlord of Zofin Palace. In reality, Zofin Palace is just the venue of a regular annual conference Jiri’s department is organizing.

The conclusion

I don’t think that people at Zoominfo don’t try hard. They certainly do. The problem is a more serious one: the task to process context of keywords exceeds capabilities of today’s technologies, even if we limit this task to search in a particular context only (e.g., search of names and positions, as Zoominfo does). The idea itself is not bad, but it is a too ambitious one. Generally speaking, the complexity of this task is close to the problem of an automatic text comprehension and translation. Zoominfo’s case just illustrates that we are not at this stage yet.

This is a very clear message that shouldn’t be overlooked. It is (yet) very hard and even contra productive to automatically work with unstructured information, even in very special scenarios. On the other hand the syntax approach (PageRank) works well; the problem however is that its mechanism is already “milked to death”.

The solution?

To get better search results, we will have to add some additional syntax to the web. We should do it smartly – we cannot expect too much work from users, but in the same time we should make this web extension a clear advantage for everybody who joins.

There are many applications already that tackle the internet search problem this way – social networks can serve as a good example; thanks to their growing popularity they are in fact turning a significant part of the internet to a structured form! Another interesting example is the Friend of a Friend (FOAF) project.

We will however try to formulate a more general approach based on Unique Personal Identificator (UPI). It is actually a nice paradox that Zoominfo (and not only it) would greatly benefit from such a system. On the other hand, if the internet had UPI, applications like Zoominfo would not be necessary at all...

Labels: internet community, internet recruitment, internet search, pagerank, social networks

Monday, March 13, 2006

What Will Supersede PageRank?

Today we live in a world ruled by PageRank. Every web page has its specific rank that says whether it is valuable to the internet community or not. There is however one problem. There is nothing like a “universal” internet community per se. There are just people with different priorities, interests, expectations.

Although PageRank was a big success of its days (being able to distinguish between valuable content and the “mess” of the web), more and more people understands that the “majority” approach, that fits well with broadcasting media, is not suitable for the internet, which is by its nature an interactive medium, able to personally identify its users.

“I don’t want to only see the stories that most people are interested in, I want interesting stories.” (Dave’s Wordpress Blog)

OK, this is a reasonable expectation. But, how to move on? By replacing an “universal” PageRank with an “personalized” one?

A “personalized” PageRank

Page Rank is a brilliant piece of thinking. It was able to make use of the only semantic information that is embedded in the web syntax (the links) to evaluate quality of pages. By processing statistics of links we can understand which pages are most linked to, and this in fact allows us to access the vast amount of work of people who already read and evaluated these pages and created links to those they considered valuable.

But the links are already “milked to death” and there is nothing other in the web syntax that would give us an additional clue to quality of web content. So any attempt to move forward with the quality of web search would require introducing some new piece of syntax to the web, or, put it simply, something that would make the web content more structured. Yes, it is a tremendous task, but not impossible. And in fact, it is already happening.

Towards a more structured web

There are two possible approaches to adding more structure to the web:

Growing popularity and thus mass penetration of structured applications, like social networks.
Introducing a new piece to the web’s syntax, that would be seamlessly integrated to the existing web. My candidate: the Unique Personal Identificator (UPI).

These are quite different approaches; while the first one is based on mass adoption of structured applications, the second one is based on adoption of simple additional syntax by users. Let’s start with the first one for now.

Social network as a search engine

Social network is in fact an application that consists of

a specialized web search engine coupled with
a specialized web hosting service.

This approach has a clear motivation: the specialized search engine greatly benefits from being able to work with upfront defined structured information. So, for example, if we assume that the name is always filled in a field called “name”, company name in the appropriate field “company” (and is in addition related to the unique ticker symbol), education degree and country are selected from a pre-filled list etc., we are able to provide far better and far more relevant search results for our predefined queries than any full-text based approach can. So we are just porting the old good theory from traditional database systems to the internet. Ideally, the entire web should be structured this way!

Growing popularity of social networks

But now the interesting piece comes. The web is in fact becoming more structured, thanks to these applications. Because the search in social networks really works (well, structured search worked in traditional databases since 60’s, so why not here), these applications become useful and thus popular. The biggest social networks today contain tens of million of users and put profiles of these users on the web. Thanks to this development, a significant piece of the internet content is becoming structured in a very formal, traditional “database way”. We can even say that the web is becoming a more organized place.

Wider consequences of social networks

So there are now millions of users on the web, who took the time to create their personalized and structured profiles, and who keep these structured profiles updated. This is an amount of work that cannot be overlooked. In fact, it could already be compared (at least to certain extent) to the effort, which web users invested into linking their pages. This growing piece of structured web content will serve as a special (and welcomed!) input to universal web search engines. It can greatly improve their search capabilities in the areas where applications like social networks force people to use “strict syntax”.

Vision

This in fact doesn’t mean anything else than introduction of new syntax rules to certain application areas of the web. It is fair to expect that there will be more and more applications like social networks over the time. All these applications will have one thing in common: they all will motivate users to use the internet in a predefined, highly structured way. Whether this will result in structured personal profiles, product descriptions, descriptions of calendar events, or others, all this information will turn the internet to a more structured base of data. The amount of structured content on the internet will grow and will become a goldmine for any search engine of the future. As a result, traditional full text based web search will be complemented by more efficient tools in all areas where possible. Thank to this development, search will certainly improve. But for a really significant improvement, we should dethrone PageRank from its role of a sole and universal expert for evaluating information relevance.

PageRank Replacement?

To do this, we should implement a shift from evaluating pages to evaluating users. This would be a true revolution in the web search allowing us to search personally relevant information.

However, as we already said, this would require introducing a new piece to syntax to the entire web. Very difficult concept, indeed! Could we find out a method how to persuade users and developers to adopt this new piece of web syntax? Let us think about it next time.

Labels: internet, internet community, internet search, pagerank