I have a couple questions regarding logging historical adaptive model data.
When I first learned of this feature, I assumed that it was intended to help the business understand how its models are performing over time. As I dig deeper into the documentation, however, I see a feature that makes me question my assumption. The documentation suggests that customers can specify the percentage of positive and negative responses that they want to record. What is the purpose of only logging a percentage of responses? If a customer wanted to capture 100% of positive and negative responses, what time-to-live would you recommend for the JSON file?
Second, regarding the JSON file repository, can this be provided by Pega Cloud (for a Cloud implementation) or should the customer provide it?
Hi Cheri
To see how models perform over time, you don’t need this feature. That view is available in adaptive models and in predictions already, and is based on storing snapshots of model data in the datamart tables at regular intervals.
The historical dataset feature is mainly to give (data science teams of) customers insight in the actual data that drives (adaptive) models (at the moment the feature is only available for adaptive). Actual data often includes contextual and session data that is not always available in the data lakes outside of Pega, so can help people understand better what makes the models tick. With the data they could even create challenger models, or build group/issue level models that could be deployed along with the regular adaptive models.
As for your time-to-live question - I can’t give a generic answer. It depends on the goal I guess. But the amount of storage will quickly become significant: for every response to every model, the data of all the predictors is stored. That is also why we provide a sampling mechanism - so you can span a larger time frame. The separate percentages for positives and negatives make sense because typically the “success rates” are small - depending on the channel perhaps only a few %, for email often even way lower. In the analysis you will probably more interested in the data from that 1% that did click, than in the 99% that didn’t. Of course, if you use the data to build external models you’ll need to compensate for this sampling bias.
Regards
-Otto