CDH Community Event: Data Science for Pega CDH, Accelerating Insight Generation

In this session, Otto Perdeck, Director of Data Science, Philip Mann, Business Excellence Director, Sara Shishehchi, Senior Data Scientist, and Grigoriy Sen, Lead Decisioning Architect explore open source tooling for Data Scientists working with Pega Customer Decision Hub. These tools help analyze Pega models, responses and predictors using Python and R. Discover how easy it is to go beyond the out-of-the-box reports, conduct impactful analyses, and add value to the Pega project.

Playback the video recording

https://players.brightcove.net/1519050010001/default_default/index.html?videoId=6289009140001

Note that Q&A from the session have been posted as replies below. Please continue the discussion there!

Return to the CDH Community Event main page in Collaboration Center

View the presentation deck

CDH Community Event - Data Science for Pega CDH - Accelerating Insight Generation.pdf (4.27 MB)

A question from one of our participants:

CatBoost clearly beat ADM, so is there any way CatBoost could be used to replace ADM? I mean how could one conceivably use the fact CatBoost won to improve ADM?

@seng1

First, and I think I indicated that in the webinar – no one from our team and the customer team got the idea that CatBoost “clearly beat ADM”.

In fact, the image presented was for one treatment level model only and it was used as an example. It was actually - the other way around – after examining performance of ADM and CatBoost models and also considering other metrics, not only AUC (precision, recall – which was extremely important as soon as we had unbalanced dataset, time and amount of data needed to learn) we came to the conclusion that both algorithms had very close performance.

Sometimes CatBoost was better, but not always and ADM was able to learn faster often, which was also very important for that customer and their use-cases.

So, the conclusion that CatBoost was better is not correct and I’m sorry if I made somebody feel that it was the conclusion. I wanted to show what kind of analysis can be done and that’s clearly a data scientist work to make conclusions and interpret the results. It wasn’t my goal to dive deep into the results analysis as soon as it would require much more time and propound explanation of the customer data and use-cases.

Eventually almost every framework boasts to be best in class, but the result always depends on the data and usage scenario, that’s why all of them co-exist and used in their own niche.

A participant asked…

How does PEGA ensure a suitable explore / exploit framework when it trains its ADM models? And consequently did the panelists think that any ML which uses ADM data would just end up learning the “logic” behind the ADM model that supported the promotion of the offer i.e. the label.

@seng1

I’d like to touch the statement about “logic behind ADM”. There was no goal to understand “logic” of ADM, but rather – to see what dependencies in the data the other algorithms would find and get extra insights, cause all algorithms have their way to explain themselves and all may be useful (eg we could have used linear regression to see if there is any linear dependencies in the data, but we wouldn’t use it to explain ADM).

In addition to that: I think the background here is the use case to build an (off-line) model using the “historical” data that ADM is using.

First of all, that is a potential use of this data, perhaps to provide a high benchmark of what is achievable (remember ADM is an on-line classifier, it does not have the luxury of a historical dataset). More typical use is to analyse the data itself, use it to detect drift or simply to check that assumptions about the data distributions really hold, or find technical issues e.g. because fields are missing.

If you would use the ADM predictions as the target, you would indeed reverse-engineer the ADM logic. But the historical data includes the actual responses from the customers as well, so you are not just learning ADM’s logic.

Please note that as of August 31, 2022, the open-source auxiliary data science toolkit has changed name. The GitHub repository is now https://github.com/pegasystems/pega-datascientist-tools, the R and Python packages are called pdstools. See instructions on the GitHub Wiki https://github.com/pegasystems/pega-datascientist-tools/wiki for installation and use.