10 Machine Learning: Techniques and Experimental Setup

This section outlines various machine learning techniques, categorized by their primary approach.

10.1 Supervised Learning

Supervised learning algorithms learn from labeled data, where each data point has a known outcome or class. The goal is to learn a mapping function that can predict the outcome for new, unseen data.

10.1.1 Regression

Regression techniques aim to predict continuous values. They model the relationship between independent variables (features) and a dependent variable (the value to be predicted).

Core Idea: Guessing a function’s values by modeling dependent variables as a function of independent variables.
Learning Method: Typically uses the method of least squares to minimize the difference between predicted and actual values.
Use Cases: Interpolation (predicting values within the range of existing data) and extrapolation (predicting values outside the range).

10.1.2 Classification

Classification techniques aim to assign a category or class label to an observation based on its features.

Core Idea: Learning models to classify new observations whose class is unknown.
Goal: Prediction of discrete class labels.

10.1.2.1 K-Nearest Neighbors (KNN)

KNN is a non-parametric, lazy learning algorithm.

Core Idea: Classify a new observation based on the majority class of its ‘k’ closest neighbors in the feature space.
Algorithm:
1. Collect ‘n’ classified instances (prototypes).
2. For a new observation, compute its distance to all prototype instances.
3. Identify the ‘k’ closest prototype instances (k is typically odd to avoid ties and not too large, e.g., \sqrt{n}).
4. Assign the new observation to the majority class among these ‘k’ neighbors.
Distance Metrics:
- Euclidean Distance: Most common; straight-line distance between two points.
- Manhattan Distance: Sum of the absolute differences of their Cartesian coordinates.
- Minkowski Distance: A generalization of Euclidean (p=2) and Manhattan (p=1) distances.
- Hamming Distance: For categorical variables; counts the number of positions at which the corresponding symbols are different.

10.1.2.2 Neural Networks (NN)

Neural networks are models inspired by the human brain, capable of learning complex patterns.

Model Structure: A directed acyclic graph where:
- Nodes with no incoming arcs are input attributes.
- Internal nodes represent intermediate (latent) concepts.
- Nodes with no outgoing arcs are output classes.
- Weighted arcs represent the strength (degree of influence) of one node on another.
Learning Process:
1. Set up the graph structure.
2. Reinforce the weights of arcs that lead from inputs to correct outputs, typically using backpropagation.

10.1.2.3 Bayesian Networks

Bayesian Networks model probabilistic relationships among a set of variables. Observations are typically feature vectors in a propositional setting.

Core Idea: Capture probabilistic dependencies among many variables.
Model Structure: A directed acyclic graph where:
- Nodes represent random variables.
- Weighted arcs represent conditional dependencies between variables.

10.1.2.4 Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes’ theorem with a strong (naive) independence assumption between features.

Core Idea: Calculate the likelihood of a class C given an observation O’s features using Bayes’ theorem: P(C|O) = \frac{P(O|C) P(C)}{P(O)}
Naive Assumption: Features are statistically independent. This allows calculating the likelihood of the class given features as the product of individual feature likelihoods: P(C|O) = \frac{P(C) \prod_{i=1}^n P(O_i|C)}{P(O)}
Learning Process: Estimate the likelihood P(O_i|C) of each feature given the class, and the prior probability P(C) of the class from the training data.
Classification: Compute the posterior probability P(C|O) for each class and select the class with the highest probability.
Pros: Very fast, easy to implement, often works well.
Cons: The independence assumption is often violated, potentially oversimplifying the model.

10.1.2.5 Support Vector Machines (SVM)

SVMs are powerful binary classifiers that aim to find the optimal hyperplane separating two classes in a high-dimensional space.

Core Idea: Based on the Structural Risk Minimization Principle – find a hypothesis (hyperplane) that guarantees the minimum empirical error and maximizes the margin between classes.
Representation: Instances are points in a vector space; hypotheses are hyperplanes.
Support Vectors: Data points that lie closest to the decision boundary (within the maximum margin).
Kernels: If data is not linearly separable, kernels (non-linear transformation functions) map data to a higher-dimensional space where linear separation might be possible.

10.1.2.6 Classification Rules

Classification rule systems use a set of IF-THEN rules to classify observations. Observations are logical conjunctions (propositional or first-order setting).

Model Structure: A set of implications (rules), where:
- Premises (IF part) are conditions to be checked on an observation’s features.
- Conclusion (THEN part) is the assigned class.
Usage:
1. For a given observation, check if its facts satisfy a rule’s conditions.
2. If satisfied, classify the observation according to the rule’s conclusion.
- This can be done via forward chaining (checking each rule) or backward chaining (checking rules for a specific class).

10.1.2.7 Decision Trees

Decision Trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification. Observations are feature vectors (propositional setting).

Model Structure: An N-ary tree where:
- Non-leaf nodes represent tests on attributes.
- Arcs (branches) represent possible results of these tests.
- Leaf nodes represent class labels.
Usage: Start at the root and traverse the tree by performing tests on the observation’s attributes. The class label of the reached leaf node is the prediction.
Training (Induction):
1. Each node is associated with a subset of training examples (root = all examples).
2. Children of a node partition the parent’s examples based on an attribute test.
3. The “best” attribute for a test is chosen (e.g., using entropy or information gain to maximize discrimination).
4. Leaves represent subsets of examples predominantly belonging to a single class.
Overfitting Prevention:
- Pruning: Remove tree nodes that don’t improve classification accuracy on a validation set.
- Stopping Criteria: Limit tree growth based on the number of examples in a node or tree depth.
Rule Generation: Every path from the root to a leaf in a decision tree can be converted into a classification rule.

10.2 Unsupervised Learning

Unsupervised learning algorithms work with unlabeled data, aiming to find hidden patterns, structures, or relationships within the data.

10.2.1 Clustering

Clustering groups a set of observations such that observations in the same group (cluster) are more similar to each other than to those in other groups.

Core Idea: Group observations into internally homogeneous and mutually dishomogeneous subsets based on a similarity or distance measure.
Evaluation: A quality measure is used to evaluate the resulting clustering, especially when the number of desired clusters is not predefined.
Types of Clustering:
- Distributive: Directly assigns observations to a predefined number of groups.
- Hierarchical: Creates a tree of clusters (dendrogram).
  - Partitive (Top-down): Starts with all observations in one cluster and recursively splits them.
  - Agglomerative (Bottom-up): Starts with each observation as a cluster and recursively merges them.
- Fuzzy: Allows an observation to belong to multiple clusters with varying degrees of membership.
- Conceptual: Learns an explicit model (description) for each identified cluster.
Cluster Representation:
- Centroid: The mean point of a cluster (may not be an actual observation).
- Medoid: The most centrally located observation within a cluster (always an actual observation, useful when centroids are invalid, e.g., in predicate logic).
Agglomerative Clustering Training: Merge clusters if their similarity is above (or distance below) a given threshold.
K-Means Algorithm:
1. Randomly choose ‘k’ observations as initial seeds (centroids/medoids).
2. Assign each remaining observation to the cluster of its most similar seed.
3. Recalculate seeds as the centroids/medoids of the newly formed clusters.
4. Repeat steps 2-3 until cluster assignments no longer change.
- Challenges: Requires pre-specifying ‘k’; can converge to a local minimum.

10.2.2 Frequent Itemsets Mining

Discovers sets of items (subparts of descriptions) that frequently co-occur in a dataset.

Core Idea: Identify regularities by finding itemsets whose occurrence frequency (support) exceeds a given threshold.
Use Cases: Identifying behaviors that often happen simultaneously, anomaly detection.
APRIORI Algorithm:
- Principle: If an itemset is not frequent, none of its supersets can be frequent.
1. Start: Identify frequent individual items based on a support threshold.
2. Expansion: Generate candidate itemsets by adding an item to frequent itemsets from the previous step.
  - Prune candidates that are not frequent.
  - Save frequent itemsets that cannot be expanded further.
3. Termination: Stop when no new frequent itemsets are generated.

10.2.3 Association Rules Mining

Discovers rules that imply certain co-occurrences of items in a dataset.

Core Idea: Find relevant co-occurrences of situations (items or itemsets) using a relevance threshold (e.g., confidence, lift) to discover regularities.
- Example: “If a customer buys X, they are also likely to buy Y.”

10.3 Ensemble Methods

Ensemble methods combine multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

10.3.1 Random Forest

Random Forests are an ensemble of decision trees.

Core Idea: Construct multiple decision trees, each trained on a random subset of features and/or data.
Classification: The final classification is determined by a combination (e.g., majority vote) of the classifications from individual trees.

10.3.2 Bagging, Boosting, and Stacking

These are general techniques for combining multiple models:

Bagging (Bootstrap Aggregating):
- Trains each model on a different random subset of the training data (with replacement, i.e., bootstrapping).
- Final prediction is by averaging (regression) or voting (classification).
- Effect: Reduces variance and helps prevent overfitting.
Boosting:
- Trains models sequentially; each new model focuses on correcting errors made by previous models.
- Misclassified instances are given higher weight for subsequent models.
- Final prediction is a weighted combination of individual model outputs.
- Effect: Reduces bias and variance; can risk overfitting if not tuned carefully.
Stacking (Stacked Generalization):
- Trains multiple base models on the entire training set.
- A meta-model (meta-classifier or meta-regressor) is then trained to combine the predictions of the base models.
- Effect: Leverages the strengths of different models to improve overall performance.

10.4 Other Techniques

10.4.1 Genetic Algorithms (GA)

GAs are search heuristics inspired by the process of natural selection and biological evolution.

Core Idea: Evolve a population of candidate solutions (chromosomes) towards an optimal solution.
Terminology:
- Chromosome: A potential solution, often represented as a feature vector.
- Gene: A component (bit or symbol) of a chromosome.
Process:
1. Start with an initial population of random solutions.
2. Evaluate solutions using a fitness measure.
3. Select the fittest solutions to produce the next generation.
4. Generate new solutions (offspring) via:
  - Crossover: Combines genetic material from two parents (e.g., one-point, two-point, uniform, arithmetic).
  - Mutation: Randomly alters genes in a chromosome.
5. Repeat steps 2-4 until a stopping criterion is met.
Challenges: Can be time-consuming; may not converge to the global optimum; no guarantee of finding the best solution.

10.4.2 Hidden Markov Models (HMM)

HMMs are statistical models used for modeling sequences of observable events that depend on underlying, unobservable (hidden) states. Observations are sequences of symbols (propositional settings).

Model Structure: A directed graph with:
- Nodes:
  - States (hidden, not directly observable).
  - Symbols (observable outputs).
- Arcs (Weighted with probabilities):
  - State-to-State (transitions between hidden states).
  - State-to-Symbol (emission/output probabilities from hidden states).
Use Cases: Best suited for sequential data where the current state depends on the previous one (e.g., speech recognition, bioinformatics).

10.5 Experimental Setting

10.5.2 System-Related Steps

System (Algorithm) Choice: Depends on:
- Dataset size and dimensionality.
- Instance representation (e.g., propositional, relational).
- The specific task (e.g., classification, regression, clustering).
- Presence of unbalanced classes.
- Data reliability (e.g., statistical models often handle noisy data better).
Performance Measure Choice: Select metrics relevant to the task and approach.
- Common for Numeric & Symbolic: Predictive performance.
- Numeric Approach Focus: Robustness to noise.
- Symbolic Approach Focus: Interpretability.
- Common Classification Metrics:
  - False Positive Rate (FPR) = FP / (FP + TN) = Type I error (false alarm).
  - False Negative Rate (FNR) = FN / (TP + FN) = Type II error (missed alarm).
  - Precision = TP / (TP + FP) (Proportion of positive identifications that were actually correct).
  - Recall (Sensitivity, True Positive Rate) = TP / (TP + FN) (Proportion of actual positives that were identified correctly).
  - F-measure (F1-score for \beta=1): Harmonic mean of Precision and Recall. F_\beta = (1 + \beta^2) \frac{\text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision}) + \text{Recall}}
Evaluation Procedure Choice: How to split data for training and testing to get reliable performance estimates.
- Goal: Obtain reliable statistics, often by running multiple evaluations.
- k-fold Cross-Validation: Dataset is divided into ‘k’ folds. The model is trained on k-1 folds and tested on the remaining fold. Repeated ‘k’ times.
- Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold where k equals the number of instances.
- Holdout (Random Split):
  1. Extract a percentage (e.g., 70-80%) of instances for the training set.
  2. Learn a model on the training set.
  3. Evaluate the model on the remaining instances (test set).
- Prequential (Online) Approach:
  1. Consider the next instance.
  2. Test the current version of the model on this instance.
  3. Update (retrain) the model, especially if the instance is misclassified.
  - Suitable for incremental learning and streaming data.
Statistical Test Choice: To determine if observed differences in performance are statistically significant.
- Confidence Interval: Range of values likely to contain the true population parameter.
- Hypothesis Testing:
  - Null Hypothesis (H_0): No significant difference between two “treatments” (e.g., models).
  - Alternative Hypothesis (H_a or H_1): There is a significant difference.
- Common Tests:
  - T-test (Student’s): Parametric test for comparing means of two groups. Assumes normal distribution.
  - Kolmogorov-Smirnov Test: Non-parametric test to compare distributions. Does not require a priori hypotheses about distributions. Useful for qualitative features and often advised for k-fold cross-validation results. More robust than T-test for small samples.
  - Rank Test (e.g., Wilcoxon Signed-Rank Test): Non-parametric test for checking hypotheses about the central tendency. Advisable for data splitting cross-validation. Suitable for small data samples.
Running the Experiment: Execute the planned evaluations.
Results Analysis: Interpret the outcomes, draw conclusions, and identify areas for further work.