top of page

Those of us trying to keep up with the increasing use of AI systems for legal work, have probably heard at some point that the Large Language Models which are the basis for ChatGPT and other AI tools, will give poorer results as more and more text is used to train the model. A 2024 article posted by the Databricks data analysis company, which is valued at $62 billion, (Quinn Leng, et al., Long Context RAG Performance of LLMs (Aug. 12, 2024), https://www.databricks.com/blog/long-context-rag-performance-llms ) found OpenAI's "GPT-4-0125-preview starts to decrease after 64k tokens, and only a few models can maintain consistent long context RAG performance on all datasets." RAG stands for retrieval augmented generation - it's basically the use of external documents for AI LLM systems.


This chart from the article shows how likely different AI models are to generate a correct answer when they use somewhere between 2K to 125K tokens.



But what's a token? In the context of AI a token is a string of characters that an AI system will use to detect relationships with other text strings broken into tokens. There can be more tokens than words in a block of text. Open AI's online Tokenizer will calculate the number of tokens in any text block you enter:


The token count can add up rapidly. The site, https://token-calculator.net/ generates 288185 tokens for the full text of Moby Dick.


As this example shows, not all words are classified as one token:


So if a study indicates there will be a performance decline after 64,000 tokens, keep in mind the poor performance that may result when working with document productions of several million documents.


 
 

DISCO has integrated a proprietary AI system called Cecilia into its eDiscovery platform. I received a demonstration of it this past week, and Cecilia AI shows how artificial intelligence is transforming electronic discovery and how attorneys work with document productions.


Cecilia AI can answer questions based on an analysis of the data set loaded into DISCO, and it can further narrow down the data it considers in its answers based on a smaller subset of data. In this example you can see that it identifies the position of an executive whose name appears in the Enron email data set.



Similarly, Cecilia AI can provide definitions for unfamiliar terms with a simple right-click function:



Cecilia AI does not however allow a user to provide feedback, instructing it to correct a mistake or hallucination so it will not get repeated in the future for other users who may not recognize the mistake. So if an attorney knew that in fact Kenneth Lay had become CEO of Enron in 1984 rather than in 1985, it cannot tell Cecilia AI to give that answer going forward.


There is an option to reset Cecilia so that it will not base its results on previous inputs made by users of the database.



It does list the documents that it uses as the basis for its answers.


You can ask Cecilia questions about individual documents, such as whether or not a contract has a clause addressing potential damages. DISCO provides Cecilia as a free feature on all databases, and up to 50 questions based on a single document or document summaries can be generated each day without opting for Cecilia to be fully enabled.


Cecilia will not answer questions about documents which are less than 300 characters, or more than 250,000 characters. The upper limit is surprising - a short novel like The Adventures of Huckleberry Finn is about 455,000 characters.


DISCO's Auto Review uses Cecilia to identify relevant documents for production based on tags in which a subject matter expert simply explains in ordinary language the kind of documents he or she wants to be identified:


Auto Review provides percentages for precision, recall, and prevalence, and breaks down what fraction of document results are associated with a given tag.




Cecilia can tell a user the custodian for a specific document, and it can run searches for document-based queries made in ordinary English such as, "Show all documents between July 4, 2021 and December 25, 2023", or, "How many email messages are in this database?"


 
 

Relativity commissioned a study last year on how lawyers are using artificial intelligence. Here are some key points that I found interesting:


  1. While 38% of law firm study participants used AI software, significantly more — 50 % — of government employees did.

  2. AI software was most often used by legal teams for document review.

  3. Two-thirds of study participants have implemented training programs to help employees learn how to use AI.

  4. Paralegals actually use AI more often than lawyers.

  5. AI is more often used as a way to automate low-level tasks, and with the goal of cutting costs - two times more frequently than as a means to enhance risk compliance or legal analysis.

  6. There was more concern about the loss of confidential data, than there was about misleading AI hallucinations.

  7. IT professionals tend to be concerned about the loss of confidential data that is input into large language models (LLMs).

  8. Law firms were twice as likely to use in-house proprietary models or software provided by vendors as they were to rely on publicly available AI software.



 
 

Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page