Basic Concepts of Feature Selection

Feature selection can greatly improve your machine learning models. In this series of blog posts, we will discuss all you need to know about feature selection.

  1. We start with explaining why feature selection is important – even in the age of deep learning. We then move on and establish that feature selection is actually a very hard problem to solve. We will cover different approaches which used today but have their own problems. This concludes the first blog post.
  2. We will then explain why evolutionary algorithms are one of the best solutions. We will discuss how they work and why they perform well even for large feature subsets.
  3. In the third post, we will improve the proposed technique. We will simultaneously improve the accuracy of models and reduce their complexity. This is the holy grail of feature selection. The proposed method should become your new standard technique.
  4. We will end with a discussion about feature selection for unsupervised learning. It turns out that you can use the multi-objective approach even in this case. Many consider feature selection for clustering to be an unsolved problem. But by using the proposed technique you can get better clustering results as well!

Ok, so let’s get started by defining the basics of feature selection first.

Why should we care about Feature Selection?

There is consensus that feature engineering often has bigger impact on the quality of a model than the model type or its parameters. And feature selection is a key part of feature engineering. Also, Kernel functions and hidden layers are performing implicit feature space transformations. So, is feature selection then still relevant in the age of SVMs and Deep Learning? The answer is: Yes, absolutely!

First, you can fool even the most complex model types. If there is enough noise overshadowing the true patterns, it will be hard to find them. The model starts to model the noise patterns of the unnecessary features in those cases. And that means, that it does not perform well at all. Or, even worse, it starts to overfit to those noise patterns and will fail on new data points. It turns out that it even easier to fall into this trap for high number of data dimensions. And no model type is better than others in this regard. Decision trees can fall into this trap as well as multi-layer neural networks. Removing noisy feature can help the model to focus on the relevant patterns instead.

But there are two more advantages of feature selection. If you reduce the number of features, models are generally trained much faster. And the resulting model often is simpler which makes it easier to understand. And more importantly, simpler models tend to be more robust. This means that they will perform better on new data points.

To summarize, you should always try to make the work easier for your model. Focus on the features which carry the signal over those which are noise. You can expect more accurate models and train them faster. Finally, they are easier to understand and are more robust. Sweet.

Why is this a hard problem?

Let’s begin with an example. We have a data set with 10 attributes (features, variables, columns…) and one label (target, class…). The label column is the one we want to predict. We trained a model on this data and evaluated it. The accuracy of the model built on the complete data was 62%. Is there any subset of those 10 attributes where a trained model would be more accurate than that? This is exactly the question feature selection tries to answer.

We can depict any attribute subset of 10 attributes as bit vectors, i.e. as a vector of 10 binary numbers 0 or 1. A 0 means that the specific attribute is not used while a 1 depicts an attribute which used for this subset. If we, for example, want to indicate that we use all 10 attributes, we would use the vector (1 1 1 1 1 1 1 1 1 1). Feature selection is the search for such a bit vector leading to a model with optimal accuracy. One possible approach for this would be to try out all the possible combinations. We could go through them one by one and evaluate how accurate a model would be using only those subsets. Let’s start with using only a single attribute then. The first bit vector looks like this:

picture1-1024x87As you can see, we only use the first attribute and a model only built on this subset has an accuracy of 68%. That is already better than what we got for all attributes which was 62%. But maybe we can improve even more? Let’s try using only the second attribute now:

picture2-1024x129Still better than using all 10 attributes but not as good as only using the first. But let’s keep trying:

picture3-768x151We keep going through all possible subsets of size 1 and collect all accuracy values. But why we should stop there? We should also try out subsets of 2 attributes now:

picture4-1024x306Using the first two attributes immediately looked promising with 70% accuracy. We keep going through all possible subsets of all possible sizes. We collect all accuracy values of those subsets until we tried all possible combinations:

picture5-1024x351We now have the full picture and can use the best subset out of all these combinations. Because we tried all combinations, we also call this a brute force approach.

How many combinations did we try for our data set consisting of 10 attributes? We have two options for each attribute: we can decide to either use it or not. And we can make this decision for all 10 attributes which results in 2 x 2 x 2 x … = 210 or 1024 different outcomes. One of those combinations does not make any sense though, namely the one which does not use any features at all. So, this means that we only would try 210 – 1 = 1023 subsets. Anyway, even for a small data set with only 10 attributes this is already a lot of attribute subsets we need to try. Also keep in mind that we perform a model validation for every single one of those combinations. If we used a 10-fold cross-validation, we now trained 10230 models already. That is a lot. But it is still doable for fast model types on fast machines.

But what about more realistic data sets? This is where the trouble begins. If we have 100 instead of only 10 attributes in our data set, we already have 2100 – 1 combinations. This brings the total number of combinations already to 1,267,650,600,228,229,401,496,703,205,375. Even the biggest computers can no longer do this. This rules out such a brute force approach for any data set consisting of more than just a handful features.

Heuristics to the Rescue!

Going through all possible attribute subsets is not a feasible approach then. But what can we do instead? We could try to focus only on such combinations which are more likely to lead to more accurate models. We try to prune the search space and ignore feature sets which are not likely to produce good models. However, there is of course no guarantee that you will find the optimal solution any longer. If you ignore complete areas of your solution space, it might be that you also skip the optimal solution. But at least such heuristics are much faster than the brute force approach. And often you will end up with a good and sometimes even with the optimal solution in a much faster time.

There are two widely used approaches for feature selection heuristics in machine learning. We call them forward selection and backward elimination. The heuristic behind forward selection is very simple. you first try out all subsets with only one attribute and keep the best solution. But instead of trying all possible subsets with two features next, you only try specific 2-subsets. You only try those 2-subsets which contain the best attribute from the previous round. If you do not improve, you stop and deliver the best result from before, i.e. the single attribute. But if you have improved the accuracy, you keep trying by keeping the best attributes so far and try to add one more. You keep doing this until you no longer improve.

What does this mean for the runtime for our example with 10 attributes from above? We start with the 10 subsets of only one attribute which is 10 model evaluations. We then keep the best performing attribute and try the 9 possible combinations with the other attributes. This is another 9 model evaluations then. We stop if there is no improvement or keep the best 2-subset if we got a better accuracy. We now try the 8 possible 3-subsets and so on. So, instead of going brute force through all 1023 possible subsets, we only go through 10 + 9 + … + 1 = 55 subsets. And we often will stop much earlier as soon as there is no further improvement. We will see below that this is often the case. This is an impressive reduction in runtime. And the difference becomes even more obvious for the case with 100 attributes. Here we will only try at most 5,050 combinations instead of the 1,267,650,600,228,229,401,496,703,205,375 possible ones.

Things are similar with backward elimination, we just turn the direction around. We begin with the subset consisting of all attributes first. Then, we try to leave out one single attribute at a time. If we improved, we keep going. But we still leave out the attribute which led to the biggest improvement in accuracy. We then go through all possible combinations by leaving out one more attribute. This is in addition to the best ones we left already out before. We continue doing this until we no longer improve. Again, for 10 attributes this means that we will have at most 1 + 10 + 9 + 8 + … + 2 = 55 combinations we need to evaluate.

Are we done then? It looks like we found some heuristics which work much faster than the brute force approach. And in certain cases, these approaches will deliver a very good attribute subset. The problem is that in most cases, they unfortunately will not. For most data sets, the model accuracy values form a so-called multi-modal fitness landscape. This means that besides one global optimum there are several local optima as well. Both methods will start somewhere on this fitness landscape and will move from there. In the image below, we marked such a starting point with a red dot. From there, they will continue to add (or remove) attributes if the fitness improves. They will always climb up the nearest hill in the multi-modal fitness landscape. And if this hill is a local optimum they will get stuck in there since there is no further climbing possible. Hence, those algorithms do not even bother with looking out for higher hills. They take whatever they can easily get. Which is exactly why we call those algorithms “greedy” by the way. And when they stop improving, there is only a very small likelihood that they made it on top of the highest hill. It is much more likely that they missed the global optimum we are looking for. Which means that the delivered feature subset is often a sub-optimal result.

picture6-768x441

Slow vs. Bad. Anything better out there?

This is not good then, is it? We have one technique which would deliver the optimal result, but is computationally not feasible. This is the brute force approach. But as we have seen, we cannot use it at all on realistic data sets. And we have two heuristics, forward selection and backward elimination, which deliver results much quicker. But unfortunately, they will run into the first local optimum they find. And that means that they most likely will not deliver the optimal result. Don’t give up though – in the next post we will discuss another heuristic which is still feasible even for larger data sets. And it often delivers much better results than forward selection and backward elimination. This heuristic is making use of evolutionary algorithms which will be the topic of the next post.

Photo Friday: White Mountains, NH, USA – PART II

It’s time for another Photo Friday. I was traveling for business quite a bit in the past couple of weeks so I did not have the chance for longer nature walks. But I still have some images left from my last trip to the White Mountains in New Hampshire.

This is the second post in this mini-series. And as for the last time, the goal was to take some long-exposure-time shots. I am always amazed about the beauty of this area so close to my home. And must say that I rarely do one-day trips where I can take dozens of decent shots all from the same day and area. So, if you like those images, please also check out the first part of this posting.

Here are the images I have selected for today:

Why I am obsessed with Monthly Active Users

Recently, we had discussions at RapidMiner if we are too obsessed with one of our most important KPIs. This KPI is the Monthly Active Users (MAU).

I am indeed obsessed with it. So let’s dig a bit deeper and start with the definition of this KPI:

Monthly Active Users is the total amount of all users who started the product at least once in a given month.

Based on the definition of Active User, you can break this group down into several interesting subgroups:

  • Active User: started the product at least once in the given month, i.e. this is the complete group
  • New User: Active User who registered in that particular month
  • Resurrected User: been inactive for a while (at least for the previous month) but became an Active User again
  • Retained User: been an Active User in the past month and remained to be an Active User

You can define the same KPIs for weeks or days instead of months. They then become WAU and DAU (for Weekly / Daily Active Users). In any case, the result should be a chart like the following one:

MAU

As you can see in this example, roughly a third of the MAU each month are coming from new users. And in this case the company retains almost 50%. If your data looks like this, a direct consequence of this should be to improve your user retention rate. The impact of the retention rate compounds over time. This will have a significant impact on the growth of your MAU.

My colleague Tom Wentworth created a great summary about our experiences with Product Qualified Leads (PQLs). The idea of PQL is to turn people into users and raving fans before they become customers. Check out his article, but a main learning from our change to PQL is that the user needs to be at the center of the business.  The MAU is a great and simple KPI to measure if your activities have an impact in that regard.

And here are even more reasons why most businesses should pay a lot of attention to their MAU:

  • It puts the user at the center. This should be the case for any subscription business, especially for those which are using PQLs.
  • For freemium business models, the user base often is the top of the funnel. As such it becomes the baseline of a predictable business. For example, you might find that you can convert 3% of your user base into paying customers with a time delay of 3 months.
  • It grows if the new user rate grows, which measures the impact of your marketing efforts.
  • It grows even faster if you can retain or resurrect old users. This measures the impact of product, onboarding, and education efforts.
  • It is very easy to understand.

No other one-number-KPI I can think off covers all 5 aspects at the same time. This is why I am currently so obsessed with it.

Photo Friday: White Mountains, NH, USA

And another Photo Friday with some images I just took last weekend. My parents are visiting us in the US and so we went on a day trip to the White Mountains in New Hampshire. This is my favorite hiking area in New England. But this time we focused more on long-exposure-time photography.

My parents brought a new gray filter for their camera and asked me for some advice on how to use it in the field. I knew some nice rivers and waterfalls in the White Mountains and so we went there for a day and worked on some images.

I drive up to NH about 5 times each year. If you live in the region here and like the outdoors you should give it a try. I guess, if you live here and like the outdoors you already did.

Here are the images I have selected for today:

Naïve Bayes – Not so naïve after all!

Family:
Supervised learning
Modeling types:
Classification
Group:
Bayesian Learners
Input data:
Numerical, Categorical
Tags:
Fast, Probabilities
5MwI_bayes

One of my quirky videos from the “5 Minutes with Ingo” series. It explains the basic concepts of Naive Bayes in 5 minutes. And has unicorns.

Concept

Despite its simplicity, Naïve Bayes is a powerful machine learning technique.  It only requires one single data scan for building the model.  This is as fast as it can get.  And for many problems, it shows excellent results.  Learn more about this classifier below and make it part of your standard toolbox.

Let’s start with a small example.  Somebody asks you to predict if a given day is a summer day or if this is a day in the middle of winter.  Hence, the two classes of our classification problem are “summer” and “winter”.  All this person is telling you about this day is that it is raining.  You do not know anything else.  What do you think?  Is this rainy day a summer day or a winter day?

The answer obviously depends on where you are living.  People from North America or Europe will often say: “It is probably a winter day since it rains much more in winter.”

This seems to be true.  Below is the average amount of precipitation for Seattle, WA, USA – a town which is well known for having a lot of rain:

nb_seattle_precipitation

The average amount of rain per month in Seattle, WA, USA.  Seattle is known for two things: lots of rain and lots of hospitals.  Including made-up ones like the one from the TV show Grey’s Anatomy.

And this argument is exactly the basic idea of a Naïve Bayes classifier.  It generally rains more in winter than in summer.  If all I know is that the day in question is rainy, it is just more likely that this is a winter day.  Case closed.

This thought leads to the concept of conditional probabilities and the Bayes rule.  We will discuss the theory behind this a little bit later.  For now, it is sufficient to know that we can use this rule to infer the probabilities for all possible classes based on the given observations.

The basic idea is easy to grasp from the image below.  We can see that we have roughly the same amount of summer and winter days.  Based on this alone, the probability that any given time would be winter day is about 50%:

nb_rainy_day

We have roughly the same amount of winter and summer days.  So for any given day the probability to be a summer day is about 50%.  But 80% of all days in winter are rainy and only 30% of days in summer.  This knowledge reduces the probability for being a summer day.

But if you also factor in that the day in question is rainy, then this probability changes a bit.  We will see later that we can calculate a likelihood of the day being a winter day if we know it is a rainy day.  This likelihood is 80% * 50% = 40%.  While the likelihood of being a summer day when it rains is only 30% * 50% = 15%.  The likelihood that the day is in winter is much higher which leads us to the desired decision: it probably is a winter day!

Keep in mind that the outcome can also depend on a combination of events (“it rains and temperature is low”) instead of a single event (“it rains”).  In machine learning, we represent such a combination of events as attributes or features of each observation.  You could for example measure if it is raining or not, what the temperature is, and the wind speed.  Those three factors, or events, would then become the attributes in your data set.  And here is where the Naïve Bayes algorithm makes a somewhat naïve assumption (hence the name).  The algorithm treats all attributes in the same way, i.e. it assumes that all attributes are equally important.  And it also assumes that they are statistically independent of each other.  This means that knowing the value of one attribute says nothing about the value of another.  We can see already in our simple weather example that this assumption is never correct.  A windy, rainy day usually also has a lower temperature.

But – and this is the surprising thing – although this naïve assumption is almost always incorrect, this machine learning scheme works very well in practice.

Theory

We will now translate the idea described above into an algorithm. We can use it to classify new observations based on the probabilities seen in past data.  Let’s discuss a slightly more complex example which is still related to weather data.  Below is a table with 14 observations (or “examples”).  For each observation, we have some weather information, such as the outlook or the temperature.  We also know the humidity and if the day was windy or not.  This is the information we base our decisions on.  We also have another column in the data which is our target, i.e. the attribute we want to predict.  Let’s call this “play”, and this target attribute indicates if we are going on a round of golf or not.

nb_golf_data

The Golf data set.  We want to predict if somebody goes on a round of golf or not based on weather information like outlook or temperature.  Not that true golfers would care.  They always play.

Our Naïve Bayes algorithm is now very simple.  First, you go through the data set once and count how often each combination of an attribute value (like “sunny”, “overcast”, or “rainy”) with each of the possible classes (here: “yes” or “no”) occurs. This leads to a much more compact representation of the information in the data set:

nb_data_counts

We can count all combinations of the attribute values with the possible values for the target “Play”.  This can be done in a single data scan.

For example, you got 4 cases of days with a mild temperature at which you have been playing golf in the past.  And 2 cases of mild days where you did not play.  The beautiful thing is that you can generate all combination counts in the same data scan.  This makes the algorithm very efficient.

The second step of the Naïve Bayes algorithm turns those counts into probabilities.  We can simply divide each count by the number of observations in each class.  We have 9 observations with Play = Yes and 5 cases with Play = No.  Therefore, we divide the counts for the “Yes” combinations by 9 and the counts for combinations with “No” by 5:

nb_data_ratios

We turn the counts from above into ratios by dividing each count by the number of corresponding cases, i.e. 9 for “Yes” and 5 for “No”.  We also calculate the overall ratios for “Yes” and “No” which are 9/14 and 5/14.

In the same step, we can also turn the counts for “Yes” and “No” into the correct probabilities.  To do so, we divide them by the total number of observations in our data which is 14.

Those two simple steps are all you are doing for modeling in fact.  The model can now be applied whenever you want to find out what the most likely outcome is for a new combination of weather events.  This phase is also called “scoring”.

Let’s assume that we have a new day which is sunny and cool, but has a high humidity and is windy.  We calculate the likelihood for playing golf on such a day by multiplying the ratios which correspond to the combination of those weather attributes with “yes”.  The table below highlights those ratios:

nb_data_ratios_likelihood_yes

For a new prediction, we use the ratios of the weather condition at hand for each of the potential classes “Yes” and “No”.  Above we can see the ratios in combination with “Yes”.

The likelihood for “yes” is then 2/9 * 3/9 * 3/9 * 3/9 * 9/14.  The first four factors are the conditional probabilities that the specific weather events occur on a golfing day.  The last probability, 9/14, is the overall probability to play golf independent of any weather information.

If you calculate the result for this formula you will end up with a likelihood of 0.0053.
We can now do the same calculation for our day (sunny, cool, high, true), but this time we assume that the day might be not a good day for golfing:

nb_data_ratios_likelihood_no

Finally, we can also calculate the likelihood for “No” by multiplying up the ratios of the weather conditions for the “No” case.

What is the likelihood for “no” in this case?  We multiply the ratios which are 3/5 * 1/5 * 4/5 * 3/5 * 5/14 which results in 0.0206.

You can see that the total likelihood for “no” (0.0206) is much higher than the one for “yes” (0.0053).  Therefore, our prediction for this day would be “no”.  By the way, you can turn these likelihoods into true probabilities.  Simply divide both values by the sum of them.  The probability for “no” then is 0.0206 / (0.0206 + 0.0053) = 79.5% and the probability for “yes” is 0.0053 / (0.0053 + 0.0206) = 21.5%.  Which confirms that we should not do a round of golf on such a day…

This is great.  But how did we come up with the multiplication formula above?  This where the Bayes Theorem comes into play.

This theorem states that

nb_formula_bayes.png

In our example, A means “yes, we play golf” and B means a specific weather condition like the one above.  This is exactly the probability we are looking for.  Is it likely that we play or not?  We calculate this by multiplying the probability for the weather condition B given that we play (=P(B|A)) with the probability for playing (=P(A)). The latter one is very simple. This is just the ratio of yes-cases divided by the total number of observations, i.e. 5/14.

But what about P(B|A), i.e. the probability for a certain weather condition given that we play golf on such a day? This is exactly where the naïve assumption of attribute independence comes into play. If the attributes are independent of each other, we can calculate this probability by multiplying the elements:

nb_formula_p_b_a.png

where B1,…,Bm are the m attributes of our data set.  And this is exactly the multiplication we have used above.

Finally, what happens to P(B) in the denominator of the Bayes theorem?  In our example, this would be the likelihood of a certain weather condition independent of the question of playing golf.  We do not know about this probability from our data above, but this is not a problem.  Since we are only interested in the prediction “yes” or “no”, we can omit this probability.  The reason is that the denominator would be the same for both cases.  And therefore, it would not influence the ranking of the likelihoods.  This is also the reason why we are only up with likelihoods above and not with true probabilities.  To turn them into probabilities again we needed to divide the likelihoods by their sum.

Practical Usage

As pointed out above, Naïve Bayes should be another standard tool in your toolbox.  It only requires a single data scan which makes it one of the fastest machine learning methods available.  It often has a good accuracy even though the naïve assumption of attribute independence is often not fulfilled.

There is a downside though.  The model is not easy to understand.  You cannot see how the interaction of all attributes affects the outcome of the prediction.

Data Preparation

The algorithm is very robust and works on a variety of data types.  We only have discussed categorical data above but it also works well on numerical data.  Instead of counting, you calculate the mean values and standard deviations for the numerical values depending on the class.  You can then use Gaussian distributions to estimate the probabilities for the values given each class.  Some more sophisticated versions of the algorithm even allow to create more complex estimations for the distributions by using so-called kernel functions.  But in both cases, the result will be a probability for the given value just like we got them for nominal attributes and their counts.

Alternatively, you can of course discretize your numerical column before you use Naïve Bayes.

Finally, most implementations of Naïve Bayes cannot deal with missing values.  You need to take care of this before you build the model in such a case.

Parameters to Tune

One of the most beautiful things about Naïve Bayes is that it does not need the tuning of any parameters.  If you have a more fancy version of the algorithm which uses kernel functions on numerical data, you might need to define the number of kernels used or similar parameters.

Memory Usage & Runtimes

We mentioned before that the algorithm only requires a single data scan to build the model.  This makes it one of the fastest algorithms available.

Memory consumption is usually also quite moderate.  The algorithm transforms the data into a table of counts.  The size of this table depends on the number of classes you are predicting.  It also depends on the number of attributes and how many different values each can take.  While this can grow a bit, it usually is smaller than the original data table or not much bigger at least.

The runtime for scoring is also very fast.  After all, you only look up the correct ratios for all attribute values and multiply them all.  This does not cost a lot of time.

RapidMiner Processes

You can download RapidMiner here.  Then you can download the process below to build this machine learning model yourself in RapidMiner.

Please download the Zip-file and extract its content.  The result will be an .rmp file which can be loaded into RapidMiner via “File” -> “Import Process…”.

Photo Friday: Acadia National Park, Maine, USA

It’s photo Friday again.  This time I will share some very recent pictures which I took during the last weekend.  Nadja and I decided to do a trip to the Acadia National Park in Maine for some hiking.

I was in Acadia the last time exactly 4 years ago, shortly before I moved to the US.  Last time, we went up there to convince ourselves that this move was a good idea.  And we indeed followed through with this short after.  This time, we came back as regular travelers.  And to enjoy the great hikes and all other things Acadia offers.

It might not be the biggest or most impressive of the National Parks.  But they allow our furry friend Marla on the trails.  And it is a special place to us which we can reach in a nice 5-hour drive.

Here are the images I have selected for today:

Which is the Right Open Source Business Model for You?

This is the third and last part of a series.

Summary: When should you consider to use an open source license

Here is the essence of the two previous posts:

  • Open source licenses support the creation of communities and accelerate innovation.  They do this by allowing people to change or even embed your intellectual property,
  • Your go-to-market strategy should match the decision for your license.  If communities and innovation are vital, open source is a good option.  If only fast market penetration is important, a freemium model might be better.  If none of this matters, traditional enterprise sales might be the better way to go.
  • Open-source-based strategies can work very well when you plan to bring a new business model into an established market.  Good examples for this are MySQL or Red Hat.  Here a “good enough”-product-strategy with a lower TCO can win the race.
  • Open-source-based strategies also lean themselves to platform products in a complex ecosystem.  A good example here is the Hadoop / Spark landscape.  A fast land grab is of benefit here as well.
  • Finally, open-source based strategies also work better for very developer-centric fields (like Hadoop again, or Atlassian etc.).  Open APIs, large communities, and marketplaces are winning strategies here.  But Atlassian (see below) is also a good example where a hybrid model can work very well.

Open-Source-Based Business Models

Let me start this blog post by saying that I am a big believer in open-source-models.  If the go-to-market strategy calls for it that is. Otherwise, there are likely better options for creating a business around your software.

The last question now is how to turn the decisions you made into a working business model. For this purpose, I won’t dive into the fine differences between “free software” and “open source.” But I would like to discuss the most important business models around open-source-software. The table below shows a quick overview over the most successful approaches:

Business Model What is open? What do you sell? Pros Cons
Open Source (e.g. Red Hat) All your software Services only, like training, support, and guarantees Supports the original ideas of open source in the strongest way. Very hard to create a scalable business. Most companies fail.
Business Source (e.g. Jedox) Older software versions Latest version of the software, support, and services Ultimately everyone gets access to everything Majority of users does not benefit from innovation, maintenance of multiple versions.
Open Core (e.g. MySQL, Talend, Pentaho) The core of the software Additional software features, support, and services Clear feature-based differentiation, a good balance between open source concepts and commercialization. Some features will never be available to the general community.

A pure open source model has seldom been a successful commercial business model. Maybe the only successful example is Red Hat. Not a surprise since it’s an operating system and if it breaks down, everything running on the machine breaks. So, selling support and guarantees can be enough of a value proposition in this case. But for business applications this is generally not a sustainable business model.  In general you can say that the only thing you can really sell is services and guarantees – which also does not scale as well as selling software.

The other two models, open core and business source, try to find a balance between community benefits and commercial success for the vendor. Let’s keep in mind that most vendors have developers who need to feed their families, too.

The idea behind an open core model is that you are not giving away all features for free, but enough features to build a meaningful community. The paid version of your product comes with more features which are valued enough by users to pay for them.

Business source is the least known of the three models. The idea here is that only paying customers are getting the latest version with all new features. Older versions, or all source code after a time delay of let’s say 3 years, turn into a standard open source license.

I have to admit that I liked the business source model for quite some time. But it has significant disadvantages: your community is no longer getting access to the latest versions. And it cannot contribute to the product any longer. So, you are losing one of the biggest advantages of open source licenses: a faster innovation. This time delay is also problematic from a quality assurance perspective. Your paying customers are now getting the version which is least tested.

Adjacent Business Models: Freemium & The Atlassian Model

Finally, I would like to discuss two adjacent models. They are not based on open-source licenses, but share many of their characteristics. Those are the freemium model and the Atlassian model. A freemium model offers a limited version of your product for free and makes you pay for the whole thing. That sounds very much like the open core model discussed above. If building a developer community and higher levels of innovation are less important for you, a freemium model is often the better choice.

The Atlassian model was introduced by the very successful software company of the same name. It has two interesting twists: the first is that using the software with only a few users is VERY cheap (but not free!). But the pricing curve is rather steep if you add users beyond a certain threshold. This has similar dynamics to freemium models and open core. And it can work very well if the value of the software scales with the number of users. The second twist is to make your APIs very well documented and developer friendly. This allows developers to hook into the product. While this is not exactly an open-source approach, it can still create a lot of innovation and a massive developer community.

I hope this series of blog posts is helpful to figure out if an open source license is the right thing for you or not.  And then ultimately how to match your go-to-market strategy and business model to it.

Photo Friday: Norway

It’s photo Friday.  And that means that it is time for another set of pictures I took in the past.  Today I will share some images from Norway.

I went to Norway twice.  The first time I was still in school and traveled across Norway for 6 weeks on my bicycle.  Yes, this is correct.  I was driving 2,500 miles (about 4,000 km) on my bicycle through those mountains.  This is as crazy as it sounds.  But on the upside: this was the first big trip with Nadja who I married later.  If you survive a trip like that when you are 15 years old, there is nothing you cannot survive as a couple later in life.

We went to Norway again 6 years ago, which was about 20 years after our first trip.  This time we drove a car and visited some of the places we have been to before – and many more.  This country is so beautiful.  We will be back, probably in 20 years to keep the pattern.

Here are the images I have selected for today:

What Artificial Intelligence and Machine Learning can do – and what not

I have written on Artificial Intelligence (AI) before.  Back then I focused on the technology side of it: what is part of an AI system and what isn’t.  But there is another question which might be even more important.  What are we DOING with AI?

Part of my job is to help investors with their due diligence.  I discuss companies with them in which they might want to invest. Here is a quick observation:  By now, every company pitch is full with stuff about how they are using AI to solve a given business problem.

Part of me loves this since some of those companies are on something and should get the chance.  But I also have a built-in “bullshit-meter”.  So, another part of me wants to cringe every time I listen to a founder making stuff up about how AI will help him.  I listened to many founders who do not know a lot about AI, but they sense that they can get millions of dollars of funding.  Just by adding those fluffy keywords to their pitch.  The bad news is that it sooner or later actually works.  Who am I to blame them?

I have seen situations where AI or at least machine learning (ML) has an incredible impact.  But I also have seen situations where this is not the case.  What was the difference?

In most of the cases where organizations fail with AI or ML, they used those techniques in the wrong context.  ML models are not very helpful if you have only one big decision you need to make.  Analytics still can help you in such cases by giving you easier access to the data you need to make this decision.  Or by presenting this data in a consumable fashion.  But at the end of the day, those single big decisions are often very strategic.  Building a machine learning model or an AI to help you making this decision is not worth doing it.  And often they also do not yield better results than just making the decision on your own.

Here is where ML and AI can help. Machine Learning and Artificial Intelligence deliver most value whenever you need to make lots of similar decisions quickly. Good examples for this are:

  • Defining the price of a product in markets with rapidly changing demands,
  • Making offers for cross-selling in an E-Commerce platform,
  • Approving a credit or not,
  • Detecting customers with a high risk for churn,
  • Stopping fraudulent transactions,
  • …among others.

You can see that a human being who would have access to all relevant data could make those decisions in a matter of seconds or minutes.  Only that they can’t without AI or ML, since they would need to make this type of decision millions of times, every day.  Like sifting through your customer base of 50 million clients every day to identify those with a high churn risk.  Impossible for any human being.  But no problem at all for an ML model.

So, the biggest value of artificial intelligence and machine learning is not to support us with those big strategic decisions.  Machine learning delivers most value when we operationalize models and automate millions of decisions.

The image below shows this spectrum of decisions and the times humans need to make those.  The blue boxes are situations where analytics can help, but it is not providing its full value. The orange boxes are situations where AI and ML show real value. And the interesting observation is: the more decisions you can automate, the higher this value will be (upper right end of this spectrum).

automating_decisions_with_ML_and_AI

One of the shortest descriptions of this phenomenon comes from Andrew Ng, who is a well-known researcher in the field of AI.  Andrew described what AI can do as follows:

“If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”

I agree with him on this characterization. And I like that he puts the emphasis on automation and operationalization of those models – because this is where the biggest value is. The only thing I disagree with is the time unit he chose. It is safe to go already with a minute instead of a second.

Photo Friday: Sailing in Croatia

It’s photo Friday again.  I often will publish some of the images I took in the past years on Fridays.

Today I would like to share some pictures from a sailing trip to Croatia in 2009.  We first visited the Plitvice Lakes National Park which is one of the most beautiful places in the world.  The waterfall picture below was taken there.  From there, we drove to Murter to pick up our boat.  And then we sailed three weeks among the about 1200 islands of Croatia.

It truly was a great trip.  Here are the images I have selected for today: