They fail in delivering roughly 3% of the change requests they get. They want to predict those changes upfront and take enough care.
They are getting better at handling changes. The percentage of failures comes down. Hence, data even 2 to 3 years back is irrelevant. The biased data (only 3% failures) did not help either.
We built a model that combined text and structured data and that had given the performance client sought. However, the main learning was not that.
The client wanted a specific precision. They came back saying that the model was inconsistent and the precision it offers varied greatly between data sets.
Here are the results from two data sets they shared:
The recall stayed the same whereas precision varied greatly.
We realized that the problem was in how they were selecting the validation data. Sometimes it had many positives and sometimes it had few positives (some seasons just had easier change requests). As a result, precision varied greatly.
We explained to them that the problem was with the performance metric and not with the model. Trained them to look at the entire confusion matrix and not just the precision (which is just a ratio).
Ratios are fickle. They change when the numerator changes. They change when the denominator changes. They also change when both change! So, they cannot be relied upon completely.