TEAM_NAME — BRAINLESSGENIUS
Machine learning is a type of artificial intelligence that enables computers to detect patterns and establish baseline behavior using algorithms that learn through training or observation. It can process and analyze vast amounts of data that are simply impractical for humans.
Machine learning tasks are classified into two main categories
- Supervised learning — the machine is presented with a set of inputs and expected outputs, later given a new input the output is predicted.
- Unsupervised learning — the machine aims to find patterns, within a dataset without explicit input from a human as to what these patterns might look like.
More importantly, however, is that within unsupervised machine learning, there are several different techniques that can be used to identify patterns, and ultimately yield valuable analysis. Understanding the problem domain is key to being able to correctly choose which of these techniques to use. One of the key decisions data scientists make is which approach to use. And if a data scientist doesn’t understand the problem domain, they cannot choose the right approach.
Clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects.
Each object is described by a set of characters called features. The first step of dividing objects into clusters is to define the distance between the different objects. Defining an adequate distance measure is crucial for the success of the clustering process.
There are many clustering algorithms, each has its advantages and disadvantages. A popular algorithm for clustering is K-means, which aims to identify the best k cluster centers in an iterative manner. Cluster centers are served as “representatives” of the objects associated with the cluster. k-means’ key features are also its drawbacks:
- The number of clusters (k) must be given explicitly. In some cases, the number of different groups is unknown.
- k-means iterative nature might lead to an incorrect result due to convergence to a local minimum.
- The clusters are assumed to be spherical.
A different clustering algorithm is OPTICS, which is a density-based clustering algorithm. Density-based clustering, unlike centroid-based clustering, works by identifying “dense” clusters of points, allowing it to learn clusters of arbitrary shape and densities. OPTICS can also identify outliers (noise) in the data by identifying scattered objects.
The OPTICS approach yields a very different grouping of data points than k-means; it classifies outliers and more accurately represents clusters that are by nature not spherical. An example of running k-means versus OPTICS on moon-like data.
In the field of machine learning, it is useful to apply a process called dimensionality reduction to high dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects.
Why is dimensionality reduction important? As more features are added, the data becomes very sparse and analysis suffers from the curse of dimensionality. Additionally, it is easier to process smaller data sets.
Dimensionality reduction can be executed using two different methods:
- Selecting from the existing features (feature selection)
- Extracting new features by combining the existing features (feature extraction)