History: TikiFest2013-Gatineau-NLP

View published page Collapse Into Edit Sessions

Preview of version: 10

Where
Who
When
What
2013-06-03 Presentation about Natural Language Processing (NLP) in Tiki
Summary
- Use Cases
Infrastructure needed
What to implement first
Pictures
Links

Where

Gatineau, Canada

Who

Alain Désilets
Matthieu Hermet
Marc Laporte

When

June 2-3-4-5, 2013

What

Discussion about Natural Language Processing (NLP) possibilities in Tiki. The result was a list of Use Cases for NLP and IR in Tiki.

2013-06-03 Presentation about Natural Language Processing (NLP) in Tiki

in French, with Alain Désilets, Matthieu Hermet and Nicolas Rosenfeld (filmed by Marc Laporte)

Summary

This was a brainstorming session to see how more advanced Natural Language Processing capabilities could be used in Tiki.

Use Cases

We came up with a great number of Use Cases, which fell into 3 clusters below.

Eliminate Duplicates

Large Tiki sites like *.tiki.org, or large corporate wikis, often contain pages that are close duplicates. Often a person creates a page or uploads a document, without realising that someone already has created such a page or document.

These kinds of duplicates can be eliminated either at the moment they are created (ideally), or after the fact.

Use Case: "The page you just created may be a duplicate of these"

When a user saves a new page for the first couple of times, the system looks for the pages that are most similar to it. If some of them are too close for comfort, it signals that to the user so he can merge his new page with an existing one if needed.

Use Case: "The document you uploaded may be a duplicate of these"

Very similar to the previous use case, except that it applies to uploaded documents.

Duplicate elimination may be even more useful for Uploads, because Uploaded files are more "opaque" and there are more chances that you will create a duplicate. By "opaque", I mean that it's harder to search through the content of uploads, or view summaries of them. So it's more likely that you will not realize that the content you are about to upload is already there.

Use Case: "Show me the most similar pages that match query X"

This is for eliminating duplicates after the fact. Say, you want to check for duplicates in the documentation about blogs. You go to doc.tiki.org, and ask to see the most similar pairs of pages that contain the word "blog".

You can then inspect the pairs one by one, starting from the top, to see if any of them might need to be merged.

Find similar or related to a particular interest

Given that I am interested in certain things, I want the system to show me other similar things that might be of interest to me.

Use Case: "See also"

Use Case: "You might be interested in the following job postings|resume"

Marc mentioned a customer that allows people to post resumes or jobs. When someone enters his resume, it would be nice if they saw a list of job postings that most closely match their resume.

Conversely, when an employer posts a job, it would be nice if he saw a list of the resumes that most closely match the job.

Use Case: "You might want to meet the following people"

When a new user fills his/her user profile, it would be nice if he/she saw a list of the users with the most similar profiles.

Use Case: "People who bought this product also bought these"

When using Tiki for e-commerce, it would be nice if we could do this.

Use Case: "The product you just purchased is very similar to the following"

Very similar to the above, except that the similarity is based on the actual product description, as opposed to the social use of the products (i.e. who purchased what).

Use Case: "People who visited this page also visit those"

Very similar to the "See also" use case, except that here, the similarity is based on Social Use of the pages (who looks at what) rather than the content of the page.

Use Case: "The following people seem to visit|like the same kinds of pages as you"

Similar to the "You might want to mee the followig people", except that affinity is not based on the content of the user profiles, but rather on their usage of the site (what pages they visit).

Use Case: "The following groups|pages might want need to meet|reference each other"

In large organizations, it's not uncommon for two groups to be working on similar things without knowing about each other.

It might be nice if you could identify groups of people or pages which are very similar, but are not "aware" of each other. In the case of pages, awareness means referencing each other, whereas for groups of people, it might mean to be part of a common category or users group.

Use Case: "You might want to supplement your keywords with those"

Whenever you enter keywords to describe a thing in tiki, the system could suggest additional keywords that seem similar or related to them.

Use Case: "The following Tiki sites near you might be of interest to yours"

This is in the context of TikiConnect. Basically, when you create a new Tiki site, you can opt into TikiConnect. This means that your site will be sending some information to the Tiki "Mother" site, and the mother site will be able to find connections between your site and other similar sites.

Organizing content

Often, you want to organize content by assigning it categories, or by splitting it into mutually exclusive clusters.

Use Case: "Split conference participants into tables of 10"

Say you are using a Tiki to organize a conference. Each participant filled a user profile. You want to set the seating arrangement at lunch, so that people with similar interests will be at the same table. A sheet would be printed out at each table, with list of participants and their keywords. Thus, the conversations will be very interesting. "I see Bob is interested in X and Y. So am I! Which one of you is Bob?".

The event lasts 3 days so the system would sit you with new people every day.

Use Case: "Split work to do on this wiki among 15 volunteers"

Say you have 15 people who volunteer to cleanup doc.tiki.org.

Would be nice to split the whole site into 15 cluster of approximately the same size, with each cluster corresponding to the interests of a particular volunteer.

Use Case: "Where should this go?"

You create a new page and you want to know which category to put it in. When you save it, the system automatically suggests the most likely categories.

Infrastructure needed

What would we need to support the above use cases?

More like this (mlt) functionality

Most of the use cases assume that, given a particular "thing" in wiki (a page, a tracker, a user profile, etc...), we are able to find the most similar things in the site.

So we need a class that can do this. It might be nice to have it be a plugin so it could be embedded in wiki pages.

The similarity metric used by this mlt plugin may be based on the actual content of the thing, or based on how people in the community "use" that thing (i.e. visit, modify, purchase, like, etc...).

It would be nice however if this kind of social data could be codified as a kind of "pseudo-content" field associated with the "thing". That way, doing mlt based on social use would just involve telling the mlt plugin to use a particular field of the thing to compute similarity, and it wouldn't have to know that this field is not actual content, but rather meta data about social use of the thing.

Is seems that ElasticSearch already has a mlt functionality. We should take advantage of that.

As of 2013-06-06, Mathieu Hermet is looking into this.

More like THESE functionality

Some of the Use Cases involve classifying or clustering groups of documents. In those cases, we don't just want to find individual things that are similar to a given individual thing. Rather, we may want to find things like:

groups of things similar to a given individual thing
individual things similar to a given group of things
groups of things similar to a given group of things

As far we know, ElasticSearch does not have such a "More like THESE" functionality, but it should be possible to build on top of the "More like this" functionality to create it.

For example, you might be able to create pseudo-documents for each group and index them with ElasticSearch. The content of the pseudo-document would be the concatenation of the content of each of its members.

Note: This may be inefficient for large groups. For example, if you have a group of 1000 documents, then adding a single document would mean:

Retrieve the content of those 1000 documents
Concatenate them with the new document to be added
Reindex this pseudo-document

But as a first pass, it should do the trick. For larger groups, we may have to plug into ElasticSearch at a lower level, to be able to update a group's word frequency model incrementally.

Automatic clustering algorithms

Some of the use cases assume that you can split an otherwise unorganized bag of things, into clusters.

I don't think ElasticSearch has that. However, it may be that something like Maui can do it, and plug on top of ElasticSearch.

Semantic augmentation of content

The ElasticSearch mlt functionality may not work that well with short texts (ex: user profiles), because the chances of having overlapping terms between such texts is lower.

But it could be that we can augment the terms present in such short texts, with terms that are somewhat "implied". This can be done with an Explicit Semantic Analysis framework (which is also included in Maui apparently).

What to implement first

We should start by going for Use Cases that

Have a high potential value
Yet, are implementable with barebone ElasticSearch mlt functionality

Examples include:

Functionality for identifying possible duplicates
"See also" Use Case, at least the one that is based on the actual content of the page, not on the social use of it.

Pictures

Plugin Image

File not found.

Plugin Image

File not found.

Plugin Image

File not found.

Links

Next TikiFest: TikiFest NLP 12

See: tv:TikiFest NLP 11

History

Enable pagination rows per page

Advanced

Information	Version
27 Jan 2018 03:16 GMT-0000 Torsten Fabricius	13	View
26 Jun 2013 22:36 GMT-0000 alain_desilets	12	View
26 Jun 2013 22:35 GMT-0000 alain_desilets	11	View
26 Jun 2013 22:34 GMT-0000 alain_desilets	10	View
23 Jun 2013 00:50 GMT-0000 Marc Laporte	9	View
06 Jun 2013 22:02 GMT-0000 Marc Laporte typo	8	View
06 Jun 2013 21:59 GMT-0000 alain_desilets	7	View
06 Jun 2013 21:15 GMT-0000 alain_desilets	6	View
04 Jun 2013 20:48 GMT-0000 Marc Laporte	5	View
04 Jun 2013 18:36 GMT-0000 Marc Laporte	4	View
04 Jun 2013 18:34 GMT-0000 Marc Laporte	3	View
04 Jun 2013 16:58 GMT-0000 Marc Laporte	2	View
04 Jun 2013 12:04 GMT-0000 Marc Laporte	1	View

Navigation and related functionality and content

Related content

Custom Share Module 0.1dev

History: TikiFest2013-Gatineau-NLP

Preview of version: 10

Table of contents

Where

Who

When

What

2013-06-03 Presentation about Natural Language Processing (NLP) in Tiki

Summary

Use Cases

Eliminate Duplicates

Use Case: "The page you just created may be a duplicate of these"

Use Case: "The document you uploaded may be a duplicate of these"

Use Case: "Show me the most similar pages that match query X"

Find similar or related to a particular interest

Use Case: "See also"

Use Case: "You might be interested in the following job postings|resume"

Use Case: "You might want to meet the following people"

Use Case: "People who bought this product also bought these"

Use Case: "The product you just purchased is very similar to the following"

Use Case: "People who visited this page also visit those"

Use Case: "The following people seem to visit|like the same kinds of pages as you"

Use Case: "The following groups|pages might want need to meet|reference each other"

Use Case: "You might want to supplement your keywords with those"

Use Case: "The following Tiki sites near you might be of interest to yours"

Organizing content

Use Case: "Split conference participants into tables of 10"

Use Case: "Split work to do on this wiki among 15 volunteers"

Use Case: "Where should this go?"

Infrastructure needed

More like this (mlt) functionality

More like THESE functionality

Automatic clustering algorithms

Semantic augmentation of content

What to implement first

Pictures

Links

History