Machine learning (ML) is a subfield of artificial intelligence (AI) that focuses on the creation of computer systems that can learn, adapt, predict, and correlate, all without following explicit instructions.
The goal of machine learning is to understand and process a large amount of data by leveraging algorithms and making generalized models that can produce user-friendly outputs.
Machine learning commonly works by following the steps below:
- Gathering data from various sources
- Cleaning data to have homogeneity
- Building a model using an ML algorithm
- Gaining insights from the model's results
- Data visualization and transforming results into visual graphs
1. Gathering data from various sources
Machine learning requires a lot of data to make a production-ready model.
Data gathering for ML is done in two ways: automated and manual.
- Automated data gathering utilizes programs and scripts that scrape data from the web.
- Manual data gathering is a process of manually gathering data and preparing it homogeneously.
Automated data gathering utilizing web scraping with Python:
import requests from bs4 import BeautifulSoup # Scrape data from a website url = 'https://example.com' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract relevant information from the website data = soup.find('div', class_='data-container').text # Store the gathered data with open('data.txt', 'w') as file: file.write(data)
2. Cleaning data to have homogeneity
Ensuring data homogeneity is a crucial step to make machine learning work and generate results.
Data cleaning for ML is either done manually or automatically with the help of algorithms and consists of fixing and/or removing incorrect, corrupted, wrongly formatted, duplicate, and incomplete data within the dataset.
Cleaning data using Python and pandas:
import pandas as pd # Read data from a CSV file data = pd.read_csv('data.csv') # Remove duplicates data = data.drop_duplicates() # Fix missing values by filling with mean data['column_name'].fillna(data['column_name'].mean(), inplace=True) # Remove incorrect or corrupted data data = data[data['column_name'] > 0] # Save cleaned data to a new file data.to_csv('cleaned_data.csv', index=False)
3. Building a model using an ML algorithm
An ML (machine learning) model is a file that contains the results of machine learning algorithms and is used to reason over dynamic input.
An ML (machine learning) model works by containing a list of patterns that are matched against real-time input, then producing the output according to the matched pattern.
ML models can have various structure types, with the most common types being: binary classification, multiclass classification, and regression.
- The binary classification model predicts a binary outcome, meaning one of two possible outcomes.
- The multiclass classification model predicts one of more than two outcomes.
- The regression model predicts numeric values.
The process of building a machine learning model is called training.
Machine learning training is done with the help of algorithms and is divided into two categories: supervised learning and unsupervised learning.
- Supervised learning (SL) is when the ML model is trained using labeled data, meaning the data that has both input and output values.
- Unsupervised learning (UL) is when the ML model is trained using unlabelled data, meaning the data that have no tags or known results.
Neural networks (NNs) are at the core of unsupervised learning and consist of mapping between the data within the dataset, allowing to make correlations.
Creating a binary classification model using Python's scikit-learn library:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load the dataset X, y = load_dataset() # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Create a Logistic Regression model model = LogisticRegression() # Train the model model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Evaluate the model's accuracy accuracy = accuracy_score(y_test, y_pred)
4. Gaining insights from the model's results
Gaining insights from the ML models means understanding the previously unknown patterns and testing the model's ability to make predictions and conclusions.
Gaining insights is very important to verify the model's validity and determine if any changes need to be made to the learning algorithm(s).
Analyzing feature importance in a trained model with Python:
import matplotlib.pyplot as plt # Get the feature importance scores importances = model.coef_ # Sort feature importance in descending order sorted_indices = importances.argsort()[::-1] sorted_importances = importances[sorted_indices] # Plot the feature importance plt.bar(range(len(sorted_importances)), sorted_importances) plt.xticks(range(len(sorted_importances)), sorted_indices) plt.xlabel('Feature Index') plt.ylabel('Importance Score') plt.title('Feature Importance') plt.show()
5. Data visualization and transforming results into visual graphs
Data visualization of the ML model consists of putting the output data on a graph and providing the interactive API.
Creating a scatter plot of predicted values with Python:
import matplotlib.pyplot as plt # Get the predicted values y_pred = model.predict(X) # Create a scatter plot plt.scatter(X[:, 0], X[:, 1], c=y_pred) plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('Predicted Values') plt.show()
The above code examples demonstrate practical implementations for each step in machine learning, from data gathering and cleaning to model building, insights, and data visualization.