How to measure the efficiency of Large Language Models?

LLM Evaluating Metrics - Konica Minolta

How to measure the efficiency of Large Language Models?

In an era where artificial intelligence (AI) is revolutionizing industries, Large Language Models (LLMs) stand at the forefront of this transformation. From automating customer service to generating insightful content, LLMs are proving to be invaluable assets. As companies rush to integrate these advanced models into their operations, it is vital to find an answer to this: how can we guarantee the quality of results achieved? Indeed, with the growing reputation of these powerful systems, and the extensions of the use case applications, the challenge of effectively evaluating their performance becomes increasingly evident, not just as a technical necessity but as a strategic imperative for businesses aiming for success in the AI-driven landscape. Following an introduction to NLP for extracting tasks from documents, we will go deeper into a very hyped topic in NLP:  Generative AI and particularly LLM, illustrating all the considerations that we took into consideration before employing these amazing technologies.


Generative AI is a transformative field that enables machines to create new content, mimicking human creativity. These systems use existing data to create new artefacts and recently we have seen them expand into every area, being able to produce different types of content, such as images, videos, music, text, and more. Large Language Models (LLMs), such as GPT, are at the forefront of generative AI, offering the ability to understand and generate text that mirrors human communication. From a technical perspective, LLMs leverage neural networks, specifically recurrent or transformer architectures, to learn the intricate patterns and structures present in natural language data.

Through extensive training on large datasets, often consisting of billions of words or sentences, LLMs can capture the nuances of language, including grammar, semantics, and context. One of the defining features of LLMs is their ability to generate coherent and contextually relevant text across a wide range of tasks, such as language translation, text summarization, question answering, and natural language understanding. These models achieve state-of-the-art performance in various language-related tasks, often surpassing human-level performance in certain benchmarks.


AI popularity increased consistently thanks to the wide application range in business practical problems, across various industries. Streamlining operations, increasing efficiency, automating repetitive tasks, optimizing processes, and reducing manual effort are just a few of the key applications where AI has become one of the leading technologies used. The efforts to implement AI in these fields ultimately lead to cost savings and resource optimization, core values from a business perspective. Machine Learning Operations (MLOps) practices help companies to ensure the reliability, scalability, and efficiency of their AI systems. Among these methodologies, model evaluation and monitoring stand out as critical practices in ensuring consistent performance and longevity of ML systems. Indeed:

  • before deploying Machine Learning (ML) models into production, thorough validation and testing are conducted to ensure accuracy and reliability. It’s also important to address biases to ensure fairness in predictions, particularly in sensitive applications.
  • once deployed, continuous monitoring of key performance indicators (KPIs) is crucial for effectively detecting data or model drift and preventing degradation in the performance of the model over time.

While this is crucial for all ML models, it becomes even more pertinent for LLMs due to their complex nature and the potential impact of biases in their outputs. LLMs that are not precise can cause serious problems for businesses, such as legal and compliance challenges, or harm to their reputation. Additionally, developers and users often do not have ownership of training data, so they are not fully aware of what the model relies on for its prediction: this makes it more difficult to identify biases and vulnerabilities. Therefore, since leveraging LLMs can offer numerous benefits for businesses, it is essential to pair these implementations with a clear and concrete definition of meticulous evaluation and monitoring practices.


When establishing the best practice to use for implementing LLM models we need to consider several key points:

  • Stability in Production: LLMs should be able to handle real-world conditions and diverse user inputs, and consider that the output of the models may vary consistently, without breaking down or producing errors.
  • Security and Confidence: LLMs should be secure from malicious attacks and ensure the privacy of the users. They should also be reliable and trustworthy, producing accurate and unbiased outputs.
  • Model Transparency and Explainability: Strive for transparency in how LLMs make decisions. Implement methods that provide insights into the model’s decision-making process, making it easier for users to understand and trust the outcomes.
  • Continuous Monitoring and Evaluation: Establish processes for ongoing monitoring and evaluation of LLM performance. Regularly assess the model’s accuracy, reliability, and fairness, and make necessary adjustments to improve outcomes.]

Unfortunately, unlike traditional ML tasks, there isn’t yet a set of gold standard evaluation metrics for LLMs, but various evaluation metrics contribute to assessing LLM performance. This underscores the importance of ongoing research and development to establish robust evaluation metrics tailored specifically for LLMs. Thus, currently, each organization has to define and choose its metrics and benchmarks. Moreover, human validation is still necessary to check the quality and ethics of the LLM outputs.

Defining meticulous evaluation and monitoring practices for LLMs involves establishing systematic procedures that ensure the models’ effectiveness, fairness, and reliability.

  • Establish Clear Objectives and Metrics: defining specific objectives for the application, that align with business goals and identify the KPIs that measure success towards these objectives. Examples may be accuracy, speed and user satisfaction.
  • Continuous Testing and Validation: performing rigorous testing before deployment to validate the model’s performance against the chosen metrics and use possible validation datasets representative of the real-world scenarios to test for robustness.
  • Real-time Monitoring and Feedback Loops: implementing real-time monitoring systems that track the model’s performance over time and detect deviations from expected behavior while also establishing a feedback loop with users to gather insight on model outputs user consideration may be important for allowing to detect areas for improvement.
  • Addressing Anomalies and Model Updates: developing procedures for handling anomalies and unexpected model behaviour promptly, and regular update of the model to incorporate new data, address issues, and improve performance based on feedback and technological advances.

In this article, we aim to delve deeply into the first step, expanding our understanding of the challenges associated with evaluating Large Language Models (LLMs). To achieve a comprehensive and clear understanding when selecting the best metrics for evaluation, it’s crucial to acknowledge a fundamental distinction in the field. We have two types of methods to compute a metric, which, depending on the task and available data, may be more suitable for your specific use case: supervised and unsupervised metrics.

Supervised metrics require a human-generated reference to be computed. These references need to be of high quality as they serve as the benchmark for the expected output. Many supervised metrics are straightforward and quick to compute offering clear interpretability. On the other hand, unsupervised methods do not require human evaluation. Metrics in this category often rely on pre-trained models (such as BERT) or models previously trained on a human judgment base. They’re hence more complex and resource-intensive to compute, but they eliminate the need for human-generated references, which can be difficult and costly to acquire.


In the realm of supervised methods, metrics like BLEU, ROUGE (with its variations such as ROUGE-L, ROUGE-S, and ROUGE-W), and METEOR play a pivotal role, especially in tasks like translation. These metrics excel in environments where outcomes can be clearly defined and matched against human-generated references. However, their dependency on high-quality reference texts means they may fall short in evaluating more subjective or creative outputs produced by LLMs. This limitation arises because all these metrics are based on simple lexical similarity, overlooking the semantic context. Word-based metrics tend to be more indicative of the informativeness of a text rather than its quality or naturalness, which poses a significant challenge in assessing the output of LLMs. But if high quality references are available maybe an embedding-based approach, computing the similarities between the output text embeddings and the reference embeddings may be interesting to consider. ROUGE offers an embedding-based variant called ROUGE-WE. Additionally, one can utilize Word Mover’s Distance (WMD), which quantifies dissimilarity by considering the minimum ‘travel’ required to connect the embedded word.


Unsupervised metrics, in contrast to the supervised one, generally evaluate text based on its intrinsic properties, such as coherence, fluency, and diversity, without direct comparison to a specific reference. Using pre-trained model or words/sentences embeddings to evaluate the output, they are good in capturing context.  Falls into this category: perplexity measure, which gives an insight on the quality of language models to understand the language structure, and Flesch Reading Ease or the Gunning Fog Index, which provide measures text fluency and accessibility. The state-of-the-art (SOTA) in language model evaluation has introduced even more sophisticated unsupervised metrics, like BLANC for example, specifically thought for summarization evaluation.

BLANC offers insights into the model’s ability to fill in gaps in texts, assessing its understanding of context and content relevance. All these metrics are particularly valuable in evaluating generative tasks where diverse and innovative outputs are desirable, and there isn’t a single “correct” answer. The counterbalance is that they are heavier to compute since they may require model inference or fine-tuning. Additionally, their applicability can be limited by the languages supported by the underlying models and by their ability to coherently handle the specific context or domain of the text being evaluated. This means that while they offer a deeper dive into the quality and relevance of LLM outputs, their use may be constrained by practical considerations and the specific requirements of the task at hand.


Assessing lexical accuracy while also capturing semantic quality in LLM-generated text is the challenge we need to face: language is complex and models are creative, so choosing a tailored metric is difficult. A strategy which uses a combination of supervised and unsupervised metrics, aligned with each application’s needs and possibilities, likely offers the most effective evaluation strategy, taking into account both the diversity of tasks and their impacts on society.

The journey into understanding evaluation metrics for LLMs has only just begun. As we dive further into the world of AI-generated content, a new chapter unfolds, one that examines the practical application and significance of these metrics. From promoting fairness and minimizing bias to boosting user interaction and contentment, the search for the ideal combination of evaluation metrics underscores the continuous advancement of AI in our everyday lives.

Get in contact with our researchers Roberta Parisi and Brunino Criniti if you have any thoughts on this topic and keep an eye out for our upcoming article, where we will explore the tangible effects of LLM evaluation metrics and their influence on the development of AI-centric content.