Modeling

Object2Vec can be used to find semantically similar objects such as questions. BlazingText Word2Vec can only find semantically similar words.
mode is the mandatory hyperparameter for both the Word2Vec (unsupervised) and Text Classification (supervised) modes of the SageMaker BlazingText algorithm.
Incremental Training in Amazon SageMaker

Over time, you might find that a model generates inference that are not as good as they were in the past. With incremental training, you can use the artifacts from an existing model and use an expanded dataset to train a new model. Incremental training saves both time and resources.

Use incremental training to:
- Train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance.
- Use the model artifacts or a portion of the model artifacts from a popular publicly available model in a training job. You don’t need to train a new model from scratch.
- Resume a training job that was stopped.
- Train several variants of a model, either with different hyperparameter settings or using different datasets.
Rekognition doesn’t support Incremental training.
Amazon Rekognition Content Moderation enables you to streamline or automate your image and video moderation workflows using machine learning. Using fully managed image and video moderation APIs, you can proactively detect inappropriate, unwanted, or offensive content containing nudity, suggestiveness, violence, and other such categories.
Only three built-in algorithms currently support incremental training: Object Detection Algorithm, Image Classification Algorithm, and Semantic Segmentation Algorithm.
BlazingText algorithm can be used in both supervised and unsupervised learning modes.
SageMaker DeepAR algorithm specializes in forecasting new product performance.
LDA: Observations are referred to as documents. The feature set is referred to as vocabulary. A feature is referred to as a word. And the resulting categories are referred to as topics.
LDA is a “bag-of-words” model, which means that the order of words does not matter.
Factorization Machines algorithm specializes in building recommendation systems.
Factorization Machine can be used to capture click patterns for a click prediction system.
Image Classification is used to classify images into multiple classes such as cat vs dog. Object Detection is used to detect objects in an image. Semantic Segmentation is used for pixel level analysis of an image and it can be used in this computer vision system to detect misalignment.
Object Detection : is the technology that is related to computer vision and image processing. It’s aim? detect objects in an image.

Semantic Segmentation : is a technique that detects , for each pixel , the object category it belongs to , all object categories ( labels ) must be known to the model.

Instance Segmentation : same as Semantic Segmentation, but dives a bit deeper, it identifies , for each pixel, the object instance it belongs to. The main difference is that differentiates two objects with the same labels in comparison to semantic segmentation.
feature_dim and k are the required hyperparameters for the SageMaker K-means algorithm.
When you use automatic model tuning, the linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel models, num_models, to 1.
You can think of L1 as reducing the number of features in the model altogether. L2 “regulates” the feature weight instead of just dropping them. Please review the concept of L1 and L2 regularization in more detail:

https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c
The residuals plot would indicate any trend of underestimation or overestimation. Ideally, residual values should be equally and randomly spaced around the horizontal axis. Both Mean Absolute Error and RMSE would only give the error magnitude.
AUC/ROC is the best evaluation metric for a binary classification model. This metric does not require you to set a classification threshold.

For imbalanced datasets, you are better off using another metric called - PR AUC - that is also used in production systems for a highly imbalanced dataset, where the fraction of positive class is small, such as in case of credit card fraud detection.
A “vanishing gradient” results from multiplying together many small derivates of the sigmoid activation function in multiple layers. ReLU does not have a small derivative, and avoids this problem.
Fixing the “vanishing gradient”:
- Multi-level heirarchy: break up levels into their own sub-networks trained individually
- Long short-term memory(LSTM)
- Residual Networks
  - Resnet
  - Ensemble of shorter networks
- Better choice of activation function
  - ReLu
Transfer learning generally involves using an existing model, or adding additional layers on top of one.
A learning rate that is too large may overshoot the true minima, while a learning rate that is too small will slow down convergence.
Learning rate affects the speed at which the algorithm reaches (converges to) the optimal weights. The SGD algorithm makes updates to the weights of the linear model for every data example it sees. The size of these updates is controlled by the learning rate.
Music is fundamentally a timeseries problem, which RNN’s (recurrent neural networks) are best suited for. You might see the term LSTM used as well, which is a specific kind of RNN.
RNN is good for:
- Timeseries data(predict future based on past; logs; where to drive the self-driving car based on past trajectories)
- Data that consist of sequence of arbitrary length
  - machine translation
  - image captions
  - Machine-generated music
Custom entity recognition extends the capability of Amazon Comprehend by enabling you to identify new entity types not supported as one of the preset generic entity types. This means that in addition to identifying entity types such as LOCATION, DATE, PERSON, and so on, you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs.
To get inferences for an entire dataset, use batch transform. With batch transform, you create a batch transform job using a trained model and the dataset, which must be stored in Amazon S3. Amazon SageMaker saves the inferences in an S3 bucket that you specify when you create the batch transform job.

You can use Amazon SageMaker Batch Transform to exclude attributes before running predictions. You can also join the prediction results with partial or entire input data attributes when using data that is in CSV, text, or JSON format. This eliminates the need for any additional pre-processing or post-processing and accelerates the overall ML process.
Two methods of deploying a model for inference:
- Amazon SageMaker Hosting Services
  - Provides a persistent HTTPS endpoint for getting predictions one at a time.
  - Suited for web applications that need sub-second latency response.
- Amazon SageMaker Batch Transform
  - Doesn’t need a persistent endpoint
  - Get inferences for an entire dataset
IP Insights algorithm supports only CSV file type as training data.
In XGBoost,subsample prevents overfitting.

eta step size shrinkage, prevent overfitting

gamma: minimum loss reduction to create a partition; larger = more conservation

alpha L1 regularization = more conservation

lambda L2 regularization= more conservation

eval_metric optimize on AUC, errer, rmse…

scale_pos_weight Adust balance of positive and negative weights; helpful for unbalanced classes; might set to sum(negative class)/sum(positive classes)

max_depth maximum depth of the tree; too high and you may overfit

Other XGBoost hyperparameters: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
XGBoost Regularization:

alpha: L1 regularization. Default 0;

lambda: L2 regularization, default 1.
Boosting generally yields better accuracy

Bagging avoids overfitting, and easier to parallize
Bullseyes desmonstrates

Dart-throwing Demo
Batch Normalization should not be a method of regularization because the main purpose of it is to speed up the training by selecting a batch and forcing the weight to be distributed near 0, not too large, not too small.
Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as “bike”, “car”, “train”, “mileage”, and “speed” are likely to share a topic on “transportation” for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a latent representation because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.
1×1 convolutions are called bottleneck structure in CNN, which can:
- It suffers less overfitting due to small kernel size
- It can be used for feature pooling
- can help in dementionality reduction
CNN is called “feature-location invariant”
CNN typical usage:

Conv2D -> Maxpooling2D -> Dropout -> Flatten -> Dense -> Dropout -> Softmax
Softmax usually used as the last layer of the multiple classification problem. It can’t product more than one label(sigmoid can)
Logistic activation, Sigmoid, or Soft Step all represent the same function: Logistic (Sigmoid or Soft Step): f(x)=sigma(x)=1/(1-exp(-x))
Entropy is a measure of the uncertainty associated with a given distribution – it measures how much information is required, on average, to identify random samples from that distribution. Cross entropy can be used to define a loss function in machine learning and optimization. Cross entropy is related to log-likelihood: maximizing the likelihood is the same as minimizing the cross-entropy.
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of the text which has been machine-translated from one natural language to another. The range of this metric is from 0.0 (a perfect translation mismatch) to 1.0 (a perfect translation match).
For stochastic gradient descent, the batch size equals 1.

For batch gradient descent, the batch size equals the size of the training set.

And, for mini-batch gradient descent, the batch size is greater than 1 but less than the size of the training set.

The small batch size can result:

  1) **Faster updates in the model weights**

  2) Noise and oscillations in the training process, which **might be able to escape the local minima**

For missing data, Deep learning is better suited to the imputation of categorical data. Square footage is numerical, which is better served by kNN.
In classifications tasks with imbalanced class distributions, we should prefer StratifiedKFold over KFold.

SageMaker Service Overview

Data Copy from S3 to Training Instance

File Mode:

Training job copies entire dataset from S3 to training instance beforehand
Space Needed: Entire data set + Final model artifacts
slower than Pipe mode
used for incremental training

Pipe Mode:

Training job streams data from S3 to training instance
Faster start time and Better Throughput
Space Needed: Final model artifacts
You MUST use protobuf RecordIO as your training data format before you can take advantage of the Pipe mode.

Data Format in SageMaker

Training Data Format

CSV

RecordIO: Data types needs to be int32, float 32, float 64

Algorithm Specipic formats( LibSVM, JSON, Parquet)

Data needs to be stored in S3

ContentTypes for Built-in Algorithms
ContentType	Algorithm
application/x-image	Object Detection Algorithm, Semantic Segmentation
application/x-recordio	Object Detection Algorithm
application/x-recordio-protobuf	Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence
application/jsonlines	BlazingText, DeepAR
image/jpeg	Object Detection Algorithm, Semantic Segmentation
image/png	Object Detection Algorithm, Semantic Segmentation
text/csv	IP Insights, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, XGBoost
text/libsvm	XGBoost

Infetence Format

CSV

JSON

RecordIO

SageMaker Build-in Algos

BlazingText

Unsupervised -> word2vec

called word embeding
has multiple modes:
- Chow(continuous bag of words)
- Skip-gram
- Batch-skin-gram

Supervised -> multiclass, multilebel classification

It’a useful for NLP, but it is not a NLP algo.

Data Type

Supervised mode(text classification): one sentence per line; first word is the string _label_ followed by the label.
word2vec wants a text file with one training sentence per line.
Training and Validation Data Format for the Word2Vec Algorithm

For Word2Vec training, upload the file under the train channel. No other channels are supported. The file should contain a training sentence per line.
Training and Validation Data Format for the Text Classification Algorithm

For supervised mode, you can train with file mode or with the augmented manifest text format.
Train with File Mode

For supervised mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string label. Here is an example of a training/validation file:

__label__4  linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2  bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .

Note

The order of labels within the sentence doesn’t matter.

Upload the training file under the train channel, and optionally upload the validation file under the validation channel.

Train with Augmented Manifest Text Format

The supervised mode also supports the augmented manifest format, which enables you to do training in pipe mode without needing to create RecordIO files. While using the format, an S3 manifest file needs to be generated that contains the list of sentences and their corresponding labels. The manifest file format should be in JSON Lines format in which each line represents one sample. The sentences are specified using the source tag and the label can be specified using the label tag. Both source and label tags should be provided under the AttributeNames parameter value as specified in the request.

{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label":2}

Multi-label training is also supported by specifying a JSON array of labels.

{"source":"linux ready for prime time , intel says , despite all the linux hype", "label": [1, 3]}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label": [2, 4, 5]}

Hyperparameters

Word2vec:
- mode( batch_skipgram, skipgram, cbow)
- learning_rate
- window_size
- vector_dim
- negative_samples
Text Classification
- epochs
- learning_rate
- word_ngrams
- vector_dim

Instance Types

For cbow and skip gram: any single CPU or GPU works (single ml.p3.2xlarge)
For batch_skipgram, can use single or multi CPU
For text classification: C5 if less than 2GB training data. For larger datasets, use a single GPU

Object2Vec

Supervised -> Classification, Regression

Unsupervised

Data Types

data must be tokenized into integers; training data consists of pairs/sequence of tokens: sentence - sentence, labels - sentences, customer - customer, product - product, user - item

Hyperparameters

Usual ones
enc1-network, enc2_network - choose hcnn, bilstm, pooled_embedding…

Instance Type

can only train on a single machine( GPU or CPU)
Inference: use GPUs(ml.p2.2xlarge). Use INFERENCE_PREDDERED_MODEto optimize for encoder embeddings rather than clssification or regression

Factorization Machines

Supervised -> Classification, Regression

Dealing with sparse data

Click prediction
Item recommendations

Limited tp pair-wise interactions: user -> items for example

Can use to predict following things gavin a matrix representing some pairs of things( users & items )

classification( click or not? Purchase or not? )
Value(predicted rating)

Usually used in the context of recommender system

Data Types

recordIO-protobuf with float32
sparse data means CSV isn’t practical

Instance Type

GPU or CPU
CPU recommended
GPU only works with dense data

K-Nearest Neighbors, KNN

Supervised -> Classification, Regression

Data Types

recordIO-protobuf or CSV, first column is label

File or Pipe mode on either

Hyperparameters

K!
sample_size

Instance Type

Training on CPU or GPU
Inference:
- CPU for lower latency
- GPU for higher throughput on large batches

Linear Learner

Supervised -> Classification, Regression

Data Types

recordio-protobuf float32, csv

Processsing

Training data must be normalized(all features weighted the same),
input data should be shuffled

Training

uses stochastic gradient descent
Choose an optimization algo (Adam, Adagrad, SGD,…)
Multiple models are optimized in parallel
Tune L1, L2 regularization

Validation

most optimal model is selected

Hyperparameters

Balance_mulyiclass_weights
- Gives each class equal importance in loss functions
Learning_rate, mini_batch_size
L1
wd
- Weight decay (L2 Regularization)

Instance Types

single or multiple-machine CPU or GPU

XGBoost

Supervised -> Classification, Regression

Data Types

Csv, libsvm, recordIO-protobuf, parquet

Instance Types

Use CPUs only for multiple instance training; memory-bound, not compute bound; M5 is a good choice.
Use GPUs for single-instance training; like P3; must set tree_method to gpu_hist

DeepAR

Supervised -> Timeseries Forecasting

Uses RNN’s

Allows you to train the same model over several related timeseries.

Find frequencies and seasonality.

Data Types

JSON lines format(GZIP or Parquet)
Each record must contain:
- start: the starting timestamp
- target: the timeseries values
Each record can contain:
- Dynamic_feat: dynamic features
- ᓚᘏᗢ categorical features

Hyperparameters

Contaxt_length number of time points the model sees before making a prediction; can be smaller than seasonalities - the model will lag one year anyhow.
epoches
mini_batch_size
learning_rate
num_cells

Instance Types

Can use CPU or GPU, single or multi machine
CPU only for inference

Object Detection

Supervised -> Classification

Takes in images, output all instances of objects with categories and confidence scores

Transfer learning mode/ incremental training.

Data Types

RecordIO or image format (jpg, png)
with image format, supply a JSON file for annotation data for each image

Hyperparameters

Usual ones
Optimizer ( sgd, adam, rmsprop, adadelta…)

Instance Type

Use GPU for training
Use CPU or GPU for inference

Image Classification

Supervised -> Classification

What objects are in the image

Resnet CNN under the hood.

Full training mode: network initialized with random weights

Transfer learning mode: initialized with pre-trained weights; the top fully-connected layer is initialized with random weights

Data Types

Apache MXNet RecordIO (Not PRotobuf)
Raw jpg or png
image format requires .lst files to associate image index, class label and path to the image
Augmented Manifest Image Format enables Pipe Mode

Hyperparameters

Usual ones
Optimizer ( weight decay, beta1, beta2, eps, gamma…)

Instance Type

Use GPU for training
Use CPU or GPU for inference

Semantic Segmentation

Supervised -> Classification

it can detect objects in an image, shape of each object along with location and pixels that are part of the object.

Useful for self-driving vehicles.

Data Types

jpg images and png annotations
jpg images accepted for inference
Label maps to describe annotations
Augmented Manifest Image Format enables Pipe Mode

Hyperparameters

Usual ones
blackbone

Instance Type

Only GPU for training on a single machine only
Use CPU or GPU for inference

Seq2Seq

Supervised -> Convert seq of tokens to another seq to tokens

Seq2Seq algorithm is used for text summarization – It accepts a series of tokens as input and outputs another sequence of tokens.

Data Types

RecordIO-protobuf(must be int), start with tokenized text files

Must provide training, validation and vocabulary data.

Hyperparameters

batch_size
optimizer_type(Adam,sgd, rmsprop)
learning_rate
num_layers_encoder, num_layers_decoder
Can optimize on:
- accuracy(vs provided validation dataset)
- BLEU score(compares against multiple reference translations)
- Perplexity(cross-entropy)

Instance Types

Can only use GPU instance types.
Can only use a single machine for training, bur can use multi-GPS’s on one machine

K-Means

Unsupervised -> clsutering

Data Types

Two data channels: train is required, test optional
- Train sharedByS3Key, test FullyReplicated
recordIO-protobuf or CSV
File or Pipe on either

Hyperparameters

K!
- Plot within-cluster sum of squares as function of K
- Use elbow method
- optimize for tightness of clusters
mini_batch_size
extra_center_factor
init_method

Instance Type

CPU recommended

LDA

Unsupervised -> Topic Modeling (Document level)

need to Define how many topics you want CPU based

Data Types

Two data channels: train is required, test optional
recordIO-protobuf or CSV
words must be tokenized into int.
- every document must contain a count for every word in the vocab in csv
Pipe mode only supported with recordIO

Hyperparameters

num_topics
alpha0
- initial guess for concentration parameter
- smaller values generate sparse topic mixtures

Instance Type

Single instance CPU training

Neural Topic Model(NTM)

Unsupervised -> Topic modeling, similiar to LDA

need to Define how many topics you want

Data Types

Four data channels: train is required, validation, test, auxiliary optional
recordIO-protobuf or CSV
words must be tokenized into int.
- every document must contain a count for every word in the vocab in csv
- the auxiliary channel is for the vocab.
File or Pipe mode

Hyperparameters

lowering mini_batch_size and learning_rate can reduce validation loss - at expense of training time
num_topics

Instance Type

GPU or CPU

GPU recommended for training
CPU ok for inference

PCA: Principal Component Analysis

Unsupervised -> Dimensioality reduction

Data Types

recordIO-protobuf or CSV
File or Pipe mode on either

Hyperparameters

Algorithm_mode
subtract_mean

Instance Type

GPU or CPU

Random Cut Forest

Unsupervised -> anomaly detection

Data is sampled randomly

shows up in Kinesis Analytics as well, it works on streaming data too.

Data Types

RecordIO-protobuf or CSV
Can use File or Pipe mode on either

Hyperparameters

num_trees: increasing reduces noise
num_samples_per_tree: should be chosen such that 1/num_samples_per_tree appox the ratio of anomalous to normal data

Instance Type

CPU for training
Use CPU inference ( ml.c5.xl)

IP Insights

Unsupervised -> Detect unusual network activity

automatically generates negative samples during training by random pairing entites and IPs

Data Types

CSV only - entity and IP

Hyperparameters

num_entity_vectors:
- Hash size
- set to twice the number of unique entity identifiers
vectoe_dim:
- size of embedding vectors
- Scales model size
- too large results in overfitting
Epochs learning_rate, batch_size, …

Instance Type

CPU or GPU
GPU recommended
can use multiple GPUs
size of CPU instance depends on vector_dim and num_entity_vectors

AWS Machine Learning Specialty Cheatsheet(3)

Modeling

SageMaker Service Overview

Data Copy from S3 to Training Instance

Data Format in SageMaker

Training Data Format

Infetence Format

SageMaker Build-in Algos

BlazingText

Data Type

Training and Validation Data Format for the Word2Vec Algorithm

Training and Validation Data Format for the Text Classification Algorithm

Train with Augmented Manifest Text Format

Hyperparameters

Instance Types

Object2Vec

Data Types

Hyperparameters

Instance Type

Factorization Machines

Data Types

Instance Type

K-Nearest Neighbors, KNN

Data Types

Hyperparameters

Instance Type

Linear Learner

Data Types

Processsing

Training

Validation

Hyperparameters

Instance Types

XGBoost

Data Types

Instance Types

DeepAR

Data Types

Hyperparameters

Instance Types

Object Detection

Data Types

Hyperparameters

Instance Type

Image Classification

Data Types

Hyperparameters

Instance Type

Semantic Segmentation

Data Types

Hyperparameters

Instance Type

Seq2Seq

Data Types

Hyperparameters

Instance Types

K-Means

Data Types

Hyperparameters

Instance Type

LDA

Data Types

Hyperparameters

Instance Type

Neural Topic Model(NTM)

Data Types

Hyperparameters

Instance Type

PCA: Principal Component Analysis

Data Types

Hyperparameters

Instance Type

Random Cut Forest

Data Types

Hyperparameters

Instance Type

IP Insights

Data Types

Hyperparameters

Instance Type