top of page

The Wikipedia page on Latent Semantic Analysis, the mathematical model that conceptual analytics in Relativity uses, gives a good demonstration of how a matrix is formed with words on separate rows, and documents in separate columns, to find the distance between conceptually similar items.


Latent Semantic Indexing assumes that its initial calculation of the number of terms in each documents is too conservative. It needs to recalculate to account for the number of terms related to each document. The docuuments are analyzed to associate those which contain similar words.



The drawbacks of Latent Semantic Indexing are that:

  1. The math may justify terms being closely associated which have no real relationship, since the average of a word's meaning in the data set is used. But this discrepancy will be decreased where words in the document set are used consistently in a predominant context.

  2. In a Bag of Words model, the text is considered to be a bunch of words in no order. N-grams letter sequences are used to find relationships between terms.

  3. LSI uses a probabilistic model for the sample data that does not necessarily match up with the actual sample data.

 
 

Conceptual analytics in Relativity uses latent semantic indexing. Rather than referring to a master dictionary, mathematics is used to identify concepts in documents. The approach is based on the co-occurence of terms used in the documents that an analytics index is based on. The content of the workspace determines how documents are related to one another, and which concepts are present in those documents.


Latent Semantic Indexing is the mathematical basis of a conceptual analytics index which is based on a set of documents. Conceptual analytics can also use a classification index which is based on coded examples. This type of index uses a Support Vector Machine learning.


ree

LSI as used in Relativity has several key characteristics:

  1. It is language agnostic. Latent Semantic Indexing will discover correlations between documents and concepts inside them no matter which language the documents are in.

  2. The training data source used for a LSI conceptual analytics index can be the same as the full set of documents to be analyzed, or be a subset of those documents.

  3. It generates a multi-dimensional concept space. This is a mathematical model. The documents which are indexed are mapped on to the concept space - which is a spatial index. Documents which are closer together in the concept space, will be more conceptually similar.

  4. The similarity between two documents or two words is measured by rank value, also referred to as a coherence score. The higher score, the higher the degree of similarity.

  5. The coherence score is not a percentage of shared content, but a measurement of distance.

  6. Analytics indexes are always in memory.


Latent Semantic Indexing processing technique is based on the following:

  1. The principle that documents which are conceptually similar will use similar sections of text.

  2. Use of a matrix, or chart, in which each word is listed on a separate row, and each document in a separate column.

  3. Singular value decompositon is used to decrease the number of rows while maintaining the similarity relationships between the columns.

  4. The degree of similarity between two documents is calculated by finding the cosine of the angle between the two vectors formed by two columns. A value close to 1 will indicate they are very similar. A value closer to 0 will show that they are more dissimilar.

 
 

If you are using Relativity analytics to get an overview of the contents of a data set, keep in mind that these are the basic steps which should be performed at the start:


  • Repeated content identification should be run to find disclaimers and confidentiality footers which occur frequently in the full set.

  • The repeated content filters should be linked to an analytics index.


ree

  • Assemble a set of conceptually rich documents, which exclude computer generated files, to use for the index in a saved search.

  • Set the index to optimize the training set in order to automatically remove conceptually irrelevant documents.

  • Set the index to automatically remove email signatures and footers.

  • Populate and build the index.

  • Perform clustering and categorization to find the kind of documents present in the full document set.

 
 

Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page