Hunting Criminals with Hybrid Analytics, Semi-supervised Learning & Agent Feedback

Fraud detection is a classic adversarial analytics challenge: As soon as an automated system successfully learns to stop one scheme, fraudsters move on to attack another way. Each scheme requires looking for different signals (i.e. features) to catch, is relatively rare (one in millions for finance or ecommerce, for example), and it may take months to investigate a single case (in healthcare or tax, for example) – making quality training data scarce.

This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll be looking for fraud signals in public email datasets, using IPython and popular open-source libraries (scikit-learn, statsmodel, nltk, etc.) for data science and Apache Spark as the compute engine for scalable parallel processing.

We will iteratively build a machine-learned hybrid model – combining features from different data sources & algorithmic approaches, to catch diverse aspects of suspect behavior.

We will discuss:

Natural language processing: Finding keywords in relevant context within unstructured text
Statistical NLP: sentiment analysis, via supervised machine learning
Time series analysis: understanding daily/weekly cycles and changes in habitual behavior
Graph analysis: finding actions outside the usual or expected network of people
Heuristic rules: finding suspect actions based on past schemes or external datasets
Topic modeling: highlighting use of keywords outside an expected context
Anomaly detection: Fully unsupervised ranking of unusual behavior

This talk assumes basic understanding of these data science tools, so that we can focus on their applicability for this use case, and on how they complement each other.

Apache Spark is used to run these models at scale – in batch mode for model training and with Spark Streaming for production use. We’ll discuss the data model, computation & feedback workflows, as well as some tools & libraries built on top of the open-source components to enable faster experimentation, optimization & productization of the models.

David Talby is Atigeo’s senior vice president of engineering. David has extensive experience in building & operating web-scale big data and data science platforms, as well as building world-class, agile, distributed teams. Previously he was with Microsoft’s Bing group where he led business operations for Bing Shopping in the US and Europe, and earlier he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams which helped scale Amazon’s financial systems. David holds a PhD in Computer Science along with two masters degrees, in Computer Science and Business Administration.

Claudiu Branzan is a senior engineering lead at Atigeo, leading a team of data scientists & software engineers tackling complex challenges in machine learning, data mining, information retrieval & statistics. Claudiu has over ten tears on real-world data science experience, across industries including finance, healthcare, legal, mobile and retail. He has co-authored multiple patents, and holds an Master’s degree in Industrial Intelligent Systems from the Polytechnic University of Timi?oara.