This guide is a crucial resource for the evaluation of large language models (LLMs). It identifies key principles for creating reliable testing protocols, details approaches for measuring the effectiveness of models in NLP problems such as text summarization or prompt engineering, and discusses methods for tracking all performance differences over time. The goal is to ensure the robustness and accuracy of LLMs for every NLP application.