Supervised and unsupervised learning with Scikit-Learn

Scikit-Learn supports both supervised and unsupervised learning, which are two of the main categories of machine learning.

Supervised learning involves building a model to predict an output variable (also known as the response variable or dependent variable) based on one or more input variables (also known as predictors or independent variables), using a labeled dataset. Scikit-Learn provides a wide range of supervised learning algorithms, including.

Linear regression: Used to predict a continuous output variable.
Logistic regression: Used to predict a binary or categorical output variable.
Decision trees: Used to predict a categorical or continuous output variable.
Random forests: An ensemble method that combines multiple decision trees.
Support vector machines (SVMs): Used to predict a categorical or continuous output variable.
Naive Bayes: Used to predict a categorical output variable.
Neural networks: Used to predict a categorical or continuous output variable.

Here’s an example of how to use Scikit-Learn to build a logistic regression model for the Iris dataset.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)

# Create a logistic regression model and fit the training data
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

In this example, we first load the Iris dataset using Scikit-Learn’s built-in load_iris() function. We then split the data into training and testing sets using the train_test_split() function. Next, we create a logistic regression model using the LogisticRegression class and fit the training data using the fit() method. Finally, we make predictions on the testing data using the predict() method and calculate the accuracy of the model using the accuracy_score() function.

Unsupervised learning, on the other hand, involves discovering patterns and relationships in a dataset without a specific output variable, using an unlabeled dataset. Scikit-Learn provides a wide range of unsupervised learning algorithms, including.

Clustering: Used to group similar data points together.
Dimensionality reduction: Used to reduce the number of input variables while preserving important information.
Anomaly detection: Used to identify unusual or anomalous data points.

Here’s an example of how to use Scikit-Learn to perform k-means clustering on the Iris dataset.

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

# Load the Iris dataset
iris = load_iris()

# Create a k-means clustering model with 3 clusters
model = KMeans(n_clusters=3, random_state=42)

# Fit the model to the data
model.fit(iris.data)

# Get the cluster labels for each data point
labels = model.labels_

print(labels)

In this example, we first load the Iris dataset using Scikit-Learn’s built-in load_iris() function. We then create a k-means clustering model using the KMeans class with 3 clusters and fit the model to the data using the fit() method. Finally, we get the cluster labels for each data point using the labels_ attribute.

Tech insights for the curious mind