History: TikiFest2013-Gatineau-NLP
Preview of version: 10
Table of contents
- Where
- Who
- When
- What
- 2013-06-03 Presentation about Natural Language Processing (NLP) in Tiki
- Summary
- Use Cases
- Eliminate Duplicates
- Find similar or related to a particular interest
- Use Case: "See also"
- Use Case: "You might be interested in the following job postings|resume"
- Use Case: "You might want to meet the following people"
- Use Case: "People who bought this product also bought these"
- Use Case: "The product you just purchased is very similar to the following"
- Use Case: "People who visited this page also visit those"
- Use Case: "The following people seem to visit|like the same kinds of pages as you"
- Use Case: "The following groups|pages might want need to meet|reference each other"
- Use Case: "You might want to supplement your keywords with those"
- Use Case: "The following Tiki sites near you might be of interest to yours"
- Organizing content
- Use Cases
- Infrastructure needed
- What to implement first
- Pictures
- Links
Where
Gatineau, Canada
Who
- Alain Désilets
- Matthieu Hermet
- Marc Laporte
When
June 2-3-4-5, 2013
What
Discussion about Natural Language Processing (NLP) possibilities in Tiki. The result was a list of Use Cases for NLP and IR in Tiki.
2013-06-03 Presentation about Natural Language Processing (NLP) in Tiki
in French, with Alain Désilets, Matthieu Hermet and Nicolas Rosenfeld (filmed by Marc Laporte)
Summary
This was a brainstorming session to see how more advanced Natural Language Processing capabilities could be used in Tiki.
Use Cases
We came up with a great number of Use Cases, which fell into 3 clusters below.
Eliminate Duplicates
Large Tiki sites like *.tiki.org, or large corporate wikis, often contain pages that are close duplicates. Often a person creates a page or uploads a document, without realising that someone already has created such a page or document.
These kinds of duplicates can be eliminated either at the moment they are created (ideally), or after the fact.
Use Case: "The page you just created may be a duplicate of these"
When a user saves a new page for the first couple of times, the system looks for the pages that are most similar to it. If some of them are too close for comfort, it signals that to the user so he can merge his new page with an existing one if needed.
Use Case: "The document you uploaded may be a duplicate of these"
Very similar to the previous use case, except that it applies to uploaded documents.
Duplicate elimination may be even more useful for Uploads, because Uploaded files are more "opaque" and there are more chances that you will create a duplicate. By "opaque", I mean that it's harder to search through the content of uploads, or view summaries of them. So it's more likely that you will not realize that the content you are about to upload is already there.
Use Case: "Show me the most similar pages that match query X"
This is for eliminating duplicates after the fact. Say, you want to check for duplicates in the documentation about blogs. You go to doc.tiki.org, and ask to see the most similar pairs of pages that contain the word "blog".
You can then inspect the pairs one by one, starting from the top, to see if any of them might need to be merged.
Find similar or related to a particular interest
Given that I am interested in certain things, I want the system to show me other similar things that might be of interest to me.
Use Case: "See also"
When reading a page|article|tracker|upload you get a module that says "See also". It points to a short list of the pages|articles|trackers|uploads most similar to the one you are looking at.
Use Case: "You might be interested in the following job postings|resume"
Marc mentioned a customer that allows people to post resumes or jobs. When someone enters his resume, it would be nice if they saw a list of job postings that most closely match their resume.
Conversely, when an employer posts a job, it would be nice if he saw a list of the resumes that most closely match the job.
Use Case: "You might want to meet the following people"
When a new user fills his/her user profile, it would be nice if he/she saw a list of the users with the most similar profiles.
Use Case: "People who bought this product also bought these"
When using Tiki for e-commerce, it would be nice if we could do this.
Use Case: "The product you just purchased is very similar to the following"
Very similar to the above, except that the similarity is based on the actual product description, as opposed to the social use of the products (i.e. who purchased what).
Use Case: "People who visited this page also visit those"
Very similar to the "See also" use case, except that here, the similarity is based on Social Use of the pages (who looks at what) rather than the content of the page.
Use Case: "The following people seem to visit|like the same kinds of pages as you"
Similar to the "You might want to mee the followig people", except that affinity is not based on the content of the user profiles, but rather on their usage of the site (what pages they visit).
Use Case: "The following groups|pages might want need to meet|reference each other"
In large organizations, it's not uncommon for two groups to be working on similar things without knowing about each other.
It might be nice if you could identify groups of people or pages which are very similar, but are not "aware" of each other. In the case of pages, awareness means referencing each other, whereas for groups of people, it might mean to be part of a common category or users group.
Use Case: "You might want to supplement your keywords with those"
Whenever you enter keywords to describe a thing in tiki, the system could suggest additional keywords that seem similar or related to them.
Use Case: "The following Tiki sites near you might be of interest to yours"
This is in the context of TikiConnect. Basically, when you create a new Tiki site, you can opt into TikiConnect. This means that your site will be sending some information to the Tiki "Mother" site, and the mother site will be able to find connections between your site and other similar sites.
Organizing content
Often, you want to organize content by assigning it categories, or by splitting it into mutually exclusive clusters.
Use Case: "Split conference participants into tables of 10"
Say you are using a Tiki to organize a conference. Each participant filled a user profile. You want to set the seating arrangement at lunch, so that people with similar interests will be at the same table. A sheet would be printed out at each table, with list of participants and their keywords. Thus, the conversations will be very interesting. "I see Bob is interested in X and Y. So am I! Which one of you is Bob?".
The event lasts 3 days so the system would sit you with new people every day.
Use Case: "Split work to do on this wiki among 15 volunteers"
Say you have 15 people who volunteer to cleanup doc.tiki.org.
Would be nice to split the whole site into 15 cluster of approximately the same size, with each cluster corresponding to the interests of a particular volunteer.
Use Case: "Where should this go?"
You create a new page and you want to know which category to put it in. When you save it, the system automatically suggests the most likely categories.
Infrastructure needed
What would we need to support the above use cases?
More like this (mlt) functionality
Most of the use cases assume that, given a particular "thing" in wiki (a page, a tracker, a user profile, etc...), we are able to find the most similar things in the site.
So we need a class that can do this. It might be nice to have it be a plugin so it could be embedded in wiki pages.
The similarity metric used by this mlt plugin may be based on the actual content of the thing, or based on how people in the community "use" that thing (i.e. visit, modify, purchase, like, etc...).
It would be nice however if this kind of social data could be codified as a kind of "pseudo-content" field associated with the "thing". That way, doing mlt based on social use would just involve telling the mlt plugin to use a particular field of the thing to compute similarity, and it wouldn't have to know that this field is not actual content, but rather meta data about social use of the thing.
Is seems that ElasticSearch already has a mlt functionality. We should take advantage of that.
As of 2013-06-06, Mathieu Hermet is looking into this.
More like THESE functionality
Some of the Use Cases involve classifying or clustering groups of documents. In those cases, we don't just want to find individual things that are similar to a given individual thing. Rather, we may want to find things like:
- groups of things similar to a given individual thing
- individual things similar to a given group of things
- groups of things similar to a given group of things
As far we know, ElasticSearch does not have such a "More like THESE" functionality, but it should be possible to build on top of the "More like this" functionality to create it.
For example, you might be able to create pseudo-documents for each group and index them with ElasticSearch. The content of the pseudo-document would be the concatenation of the content of each of its members.
Note: This may be inefficient for large groups. For example, if you have a group of 1000 documents, then adding a single document would mean:
- Retrieve the content of those 1000 documents
- Concatenate them with the new document to be added
- Reindex this pseudo-document
But as a first pass, it should do the trick. For larger groups, we may have to plug into ElasticSearch at a lower level, to be able to update a group's word frequency model incrementally.
Automatic clustering algorithms
Some of the use cases assume that you can split an otherwise unorganized bag of things, into clusters.
I don't think ElasticSearch has that. However, it may be that something like Maui can do it, and plug on top of ElasticSearch.
Semantic augmentation of content
The ElasticSearch mlt functionality may not work that well with short texts (ex: user profiles), because the chances of having overlapping terms between such texts is lower.
But it could be that we can augment the terms present in such short texts, with terms that are somewhat "implied". This can be done with an Explicit Semantic Analysis framework (which is also included in Maui apparently).
What to implement first
We should start by going for Use Cases that
- Have a high potential value
- Yet, are implementable with barebone ElasticSearch mlt functionality
Examples include:
- Functionality for identifying possible duplicates
- "See also" Use Case, at least the one that is based on the actual content of the page, not on the social use of it.
Pictures
Links
Next TikiFest: TikiFest NLP 12
See: tv:TikiFest NLP 11