Detailed Explanation: KL divergence quantifies the difference between two probability distributions P and Q, typically representing the "true" distribution and an approximation, respectively. It is a fundamental concept in information theory, often used in machine learning for tasks like optimizing models in probabilistic frameworks, such as in variational inference and Bayesian machine learning. It is always non-negative and zero only when P and Q are identical, making it a valuable tool for model comparison and improving predictive performance.

Historical Overview: The concept of KL divergence was introduced by Solomon Kullback and Richard Leibler in their 1951 paper "On Information and Sufficiency." Initially rooted in information theory, it gained widespread popularity in the late 20th century with the rise of computational statistics and machine learning, where it became crucial for developing and refining probabilistic models.

Key Contributors: Solomon Kullback and Richard Leibler are the primary contributors to the development of KL divergence. Their pioneering work laid the groundwork for subsequent advances in information theory and its applications across various scientific disciplines, including statistics, machine learning, and data science.