Skip to main content

Celonis Product Documentation

Common ML problems

Description

Machine learning is about building a model that can answer your questions. How do we get this model? What types of problems can ML solve?

Content

Types of machine learning

In basic terms, ML is the process of training a piece of software, called a model, to make useful predictions using a data set. This predictive model can then serve up predictions about previously unseen data. We use these predictions to take action in a product; for example, the system predicts that a user will like a certain video, so the system recommends that video to the user.

Often, people talk about ML as having two paradigms, supervised and unsupervised learning. However, it is more accurate to describe ML problems as falling along a spectrum of supervision between supervised and unsupervised learning. For the sake of simplicity, this course will focus on the two extremes of this spectrum.

Supervised Learning

Supervised learning is a type of ML where the model is provided with labeled training data. But what does that mean?

For example, suppose you are an amateur botanist determined to differentiate between two species of the Lilliputian plant genus (a completely made-up plant). The two species look pretty similar. Fortunately, a botanist has put together a data set of Lilliputian plants they found in the wild along with their species name.

Here's a snippet of that data set:

Leaf Width

Leaf Length

Species

2.7

4.9

small-leaf

3.2

5.5

big-leaf

2.9

5.1

small-leaf

3.4

6.8

big-leaf

Leaf width and leaf length are the features (which is why they are both labeled X), while the species is the label. A real life botanical data set would probably contain far more features (including descriptions of flowers, blooming times, arrangement of leaves) but still have only one label. Features are measurements or descriptions; the label is essentially the "answer." For example, the goal of the data set is to help other botanists answer the question, "Which species is this plant?"

This data set consists of only four examples. A real-life data set would likely contain vastly more examples.

Suppose we graph the leaf width and leaf length and then color-code the species.

Graph1.svg

In supervised machine learning, you feed the features and their corresponding labels into an algorithm in a process called training. During training, the algorithm gradually determines the relationship between features and their corresponding labels. This relationship is called the model. Often times in machine learning, the model is very complex. However, suppose that this model can be represented as a line that separates big-leaf from small-leaf:

Graph2.svg

Now that a model exists, you can use that model to classify new plants that you find in the jungle. For example:

Graph3.svg

To tie it all together, supervised machine learning finds patterns between data and labels that can be expressed mathematically as functions. Given an input feature, you are telling the system what the expected output label is, thus you are supervising the training. The ML system will learn patterns on this labeled data. In the future, the ML system will use these patterns to make predictions on data that it did not see during training.

An exciting real-world example of supervised learning is a study from Stanford University that used a model to detect skin cancer in images. In this case, the training set contained images of skin labeled by dermatologists as having one of several diseases. The ML system found signals that indicate each disease from its training set and used those signals to make predictions on new, unlabeled images.

Unsupervised Learning

In unsupervised learning, the goal is to identify meaningful patterns in the data. To accomplish this, the machine must learn from an unlabeled data set. In other words, the model has no hints on how to categorize each piece of data and must infer its own rules for doing so.

In the following graph, all the examples are the same shape because we don't have labels to differentiate between examples of one type or another here:

Graph4.svg

Fitting a line to unlabeled points isn't helpful. We still end up with examples of the same shape on both sides of the line. Clearly, we will have to try a different approach.

Graph5.svg

Here, we have two clusters. (Note that the number of clusters is arbitrary). What do these clusters represent? It can be difficult to say. Sometimes the model finds patterns in the data that you don't want it to learn, such as stereotypes or bias.

Graph6.svg

However, when new data arrives, we can categorize it pretty easily, assuming it fits into a known cluster. But what if your photo clustering model has never seen a pangolin before? Will the system cluster the new photo with armadillos or maybe hedgehogs? This course will talk more about the difficulties of unlabeled data and clustering later on.

Graph7.svg

Note: While it is very common, clustering is not the only type of unsupervised learning.

Reinforcement Learning

An additional branch of machine learning is reinforcement learning (RL). Reinforcement learning differs from other types of machine learning. In RL you don't collect examples with labels. Imagine you want to teach a machine to play a very basic video game and never lose. You set up the model (often called an agent in RL) with the game, and you tell the model not to get a "game over" screen. During training, the agent receives a reward when it performs this task, which is called a reward function. With reinforcement learning, the agent can learn very quickly how to outperform humans.

The lack of a data requirement makes RL a tempting approach. However, designing a good reward function is difficult, and RL models are less stable and predictable than supervised approaches. Additionally, you need to provide a way for the agent to interact with the game to produce data, which means either building a physical agent that can interact with the real world or a virtual agent and a virtual world, either of which is a big challenge. See this blogpost by Alex Irpan for an overview of the types of problems currently faced in RL. Reinforcement learning is an active field of ML research, but in this course, we'll focus on supervised solutions because they're a better-known problem, more stable, and result in a simpler system.

For comprehensive information on RL, check out Reinforcement Learning: An Introduction by Sutton and Barto.

Easy to understand examples

Supervised learning

Unsupervised learning

  • You get a bunch of photos with information about what is on them and then you train a model to recognize new photos.

  • You have a bunch of molecules and information about which are drugs and you train a model to answer whether a new molecule is also a drug.

  • You have a bunch of photos of 6 people but without information about who is on which one and you want to divide this dataset into 6 piles, each with the photos of one individual.

  • You have molecules, part of them are drugs and part are not but you do not know which are whichand you want the algorithm to discover the drugs.

Types of ML problems in process mining

Type of ML Problem

Description

Example

Classification

Pick one of N labels

Will this delivery be late?

Regression

Predict numerical values

What is the predicted throughput time for this case?

Clustering

Group similar examples

Which of my cases are similar in flow and clusters of process flows are there?

Association rule learning

Infer likely association patterns in data

If activity X happens which activity is also likely to happen?

Structured output

Create complex output

Which topic is this service desk ticket about?

Ranking

Identify position on a scale or status

How urgent is this service desk ticket?

Anomaly detection

Find uncommon things

Which of my cases do not follow the desired path?

Portions of this page are modifications based on work created and shared by Google and used according to terms described in the Creative Commons Attribution 4.0 License.