The word entropy could essentially be replaced with “randomness”. Generally used in the context of classification problems (in machine learning, that is), entropy tells us how random is the distribution of a certain class in the dataset.
Let’s take a binary classification problem, to simplify things. We are training a model to predict whether the stock market will go up or down tomorrow. If, in our dataset, 50% of the time the market goes up, and 50% of the time the market goes down, then the next day return of the stock market has a very high entropy (it’s purely random). Mathematically, the binary information entropy is calculated as:
So if you do the math, the entropy of the stock market example is 1. If instead the stock market went up 100% of the time and went down 0% of the time (or vice versa), the entropy would fall to 0.
What then, is cross-entropy? In machine learning, it is a measure of the “distance” or “divergence” between the model’s distribution of class outcomes and the dataset’s distribution of class outcomes. More specifically, for each “actual vs predicted” pair in our dataset:
Here the actual value will be 1 or 0 (instead of some kind of probability), but the predicted value used in the formula is the predicted probability, not the most likely class. Otherwise, notice the similarity to the information entropy formula above. The other difference is that in cross-entropy we use the actual value relative to the predicted value.
Intuitively, it means that when the dataset says “1” and we predict a probability of “1”, we add 0 (the math is: -1 x log(1) – 0 x log(0)) to the total error sum. When we predict a probability of “0.1” but the dataset said “1”, we add 3.322 (the math is: -1 x log(0.1) – 0 x log(0.9)) to the total error sum. So the farther the predicted probability is from the actual label, the higher the error will be.