Machine Learning is a vast field involving a multitude of mathematical means to approach a problem. This allows an endless amount of approaches to tackle a problem. In this post, we explain how machine learning can be seen as a toolbox of algorithms.
Presently, proposing intelligent models to solve everyday problems is accepted as a standard. The constant appearance of new specializations of the primary and classic machine learning algorithms favor their use in different types of problems.
If we focus on “decision tree” type algorithms, we can find different variations in classic ones such as CART and ID3 and adaptations such as C4.5, C5.0 SLIQ, or SPRINT.
Some are specially designed to work with numerical attributes and others with textual parameters, presenting different capacities for supporting or detecting atypical values (outliers). In this method, the set of currently existing machine learning algorithms could be seen as a great “toolbox.” Each algorithm can have one or more specific functions to tie a particular problem. In the event of a combination of several of them, it can solve a more significant issue in different steps. For more firm examples below, I am describing 3 hypothetical and complex scenarios that can be solved by combining a different set of algorithms from the “toolbox.”
Scenario 1: Customer segmentation for an energy distributor
Suppose that an electricity supply company has 100 million records with information on the electricity consumption of its 500,000 customers. The electricity company knows that it has 3 extensive groups of customers (individuals, SMEs, and large companies). Still, it wishes to analyze in greater detail the possible existing subgroups within each of these categories to understand its customers better and launch more personalized promotional campaigns.
A machine learning process could be carried out in two steps when faced with this problem. In the first step, a clustering algorithm could be applied partitioning, such as K-means, establishing a value K=3 to force a first division of the total of its 500,000 clients into the 3 groups (individuals, SMEs, and large companies).
In the second step, a hierarchical clustering algorithm could be applied to each generated group to obtain a dendrogram. A dendrogram is a visual representation of the elements that make up a group, where the similarity or distance between each of them is easily observable. Then when viewing and analyzing the dendrogram generated for one of the 3 initial groups, detailed and valuable information can be obtained to detect possible subgroups of clients within each category.
Scenario 2: Analysis for the installation of an electric charging point
Let’s imagine that the Ministry of Energy of a particular country needs to assess the most suitable areas of a city for the implementation of free energy supply points for electric vehicles. In order to do this, the Ministry maintains an extensive list with information on the ownership of existing electric vehicles in that city. Along with the geopositioning coordinates of the address of each of the owners. Based on this information, the Ministry intends to analyze and better understand which areas have a higher density in the number of electric vehicles to estimate the best positioning for its free charging points.
A two-phase process could be applied to the data using a density-based machine learning algorithm to solve this problem. The first phase is to run DBSCAN, which would be capable of delimiting into groups in those areas with different density levels (in this case, the number of electric vehicles). Then while running DBSCAN, it is necessary to set the value for two essential parameters: ε and MinPts. The first establishes the radius for group detection, while the second indicates the minimum number of elements that a group should consider.
In considering this, DBSCAN could be executed in a first phase on the entire data set, thus obtaining the first collection of groups (city areas) with different levels of density (number of electric vehicles). After then, DBSCAN could be applied to each of the obtained regions, specifying a different ε and MinPts value for each of them. In this manner, a division of the areas of the city could be obtained, with sufficient granularity, to be able to know in detail those points with the most significant influx of electric vehicles, and therefore more interesting for the establishment of a recharging point.
Scenario 3: Estimation of electric bill according to conditions and consumption habits
Now, let’s envision an electricity company developing a web application that allows potential customers to enter their personal information. The information to include is their household and energy consumption habits, intending to estimate their monthly bill with the rates applied. To develop this system, it starts from a comprehensive database with the consumption history of its current customers. This database also has extensive personal information on each of them (age, marital status, number of children, etc.), information about their homes (area, number of rooms, etc.), and their energy consumption habits (times of greatest demand, types of electrical appliances, etc.).
For example, each customer has 50 characteristics that define their consumption and lifestyle. With the database, it could be easy to design a system in which a potential customer, providing information on each of these 50 characteristics, obtains an estimate of their electricity consumption. Yet, asking a potential customer to answer 50 questions to estimate their consumption does not seem like an acceptable solution.
A two-phase system based on the use of machine learning algorithms, such as decision trees, can be developed. Decision trees are simple but powerful supervised learning algorithms that allow the classification of elements and regression (estimation) of values or events. In the first phases, this technique can be applied to obtain the degree of relevance of each of the 50 characteristics in energy consumption, discarding irrelevant ones. In doing this, a selection could be achieved, with the 5 most relevant ones while aiming to consult the potential client with what is essential. In a second phase, a system can be developed, also based on decision trees, using the answers given by the client for that selection of relevant questions, calculate and display the estimate of electricity consumption.
As can be seen, it is impossible to unequivocally define the most appropriate machine learning technique for solving each type of problem. In most cases, the best solution is achieved by combining different algorithms in different phases. Without a doubt, the evolution and appearance of new versions of machine learning algorithms will continue unstoppable for many years to come. But we must get used to understanding the set of available algorithms as a “toolbox” in which, individually, each one fulfills a function, but combined, they can solve big problems.