Tutorial: TMVA PyTorch Interface
Get started with TMVA PyTorch Interface. Combine the power of PyTorch & ROOT!
- Introduction
- Imports
- Setup TMVA
- DataLoader
- Generate model
- Save & Store model
- Book TMVA methods
- Training
- Testing
- Evaluation
- Plots
Introduction
This tutorial aims to walkthrough the latest addition in the ROOT TMVA module, The PyTorch Interface! This is specifically designed to utilize the puissance of ROOT for working with high energy physics data, while leveraging the power and flexibility of the popular Machine Learning framework, PyTorch.
The PyTorch interface allows HEP Scientists to be more ingenious with ideas and provides the ability to customize Machine Learning models to a far more preponderant extent than the Keras Interface.
Here, we'll build a simple classifier in python using PyTorch and compare the performance to a few other methods on a test example root dataset. The same model can be achieved in C/C++ like other TMVA methods which are implemented in C/C++. You can follow this if you prefer a C++ implementation.
Imports
We start with importing the necessary modules required for the tutorial
from ROOT import TMVA, TFile, TTree, TCut
from subprocess import call
from os.path import isfile
import torch
from torch import nn
Setup TMVA
TMVA requires initialization the PyMVA to utilize PyTorch. PyMVA is the interface for third-party MVA tools based on Python. It is created to make powerful external libraries easily accessible with a direct integration into the TMVA workflow. All PyMVA methods provide the same plug-and-play mechanisms as the TMVA methods. Because the base method of PyMVA is inherited from the TMVA base method, all options of internal TMVA methods apply for PyMVA methods as well.
TMVA.Tools.Instance()
TMVA.PyMethodBase.PyInitialize()
output = TFile.Open('TMVA.root', 'RECREATE')
factory = TMVA.Factory('TMVAClassification', output,
'!V:!Silent:Color:DrawProgressBar:'
'Transformations=D,G:AnalysisType=Classification')
if not isfile('tmva_class_example.root'):
call(['curl', '-O', 'http://root.cern.ch/files/tmva_class_example.root'])
data = TFile.Open('tmva_class_example.root')
signal = data.Get('TreeS')
background = data.Get('TreeB')
dataloader = TMVA.DataLoader('dataset')
for branch in signal.GetListOfBranches():
dataloader.AddVariable(branch.GetName())
dataloader.AddSignalTree(signal, 1.0)
dataloader.AddBackgroundTree(background, 1.0)
dataloader.PrepareTrainingAndTestTree(TCut(''),
'nTrain_Signal=4000:'
'nTrain_Background=4000:'
'SplitMode=Random:'
'NormMode=NumEvents:!V')
model = nn.Sequential()
model.add_module('linear_1', nn.Linear(in_features=4, out_features=64))
model.add_module('relu', nn.ReLU())
model.add_module('linear_2', nn.Linear(in_features=64, out_features=2))
model.add_module('softmax', nn.Softmax(dim=1))
Define loss function and the Optimizer.
loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD
Define the train and predict function. Note that the arguments to train and predict function need to be fixed since we call the same method internally in the TMVA interface backend. A user may control the training process and the loop inside.
tqdm
for a nice progress bar! 😁
def train(model, train_loader, val_loader, num_epochs,
batch_size, optimizer, criterion, save_best, scheduler):
trainer = optimizer(model.parameters(), lr=0.01)
schedule, schedulerSteps = scheduler
best_val = None
for epoch in range(num_epochs):
# Training Loop
# Set to train mode
model.train()
running_train_loss = 0.0
running_val_loss = 0.0
for i, (X, y) in enumerate(train_loader):
trainer.zero_grad()
output = model(X)
train_loss = criterion(output, y)
train_loss.backward()
trainer.step()
# print train statistics
running_train_loss += train_loss.item()
if i % 64 == 63: # print every 64 mini-batches
print(f"[Epoch {epoch+1}, {i+1}] train loss:"
f"{running_train_loss / 64 :.3f}")
running_train_loss = 0.0
if schedule:
schedule(optimizer, epoch, schedulerSteps)
# Validation Loop
# Set to eval mode
model.eval()
with torch.no_grad():
for i, (X, y) in enumerate(val_loader):
output = model(X)
val_loss = criterion(output, y)
running_val_loss += val_loss.item()
curr_val = running_val_loss / len(val_loader)
if save_best:
if best_val==None:
best_val = curr_val
best_val = save_best(model, curr_val, best_val)
# print val statistics per epoch
print(f"[Epoch {epoch+1}] val loss: {curr_val :.3f}")
running_val_loss = 0.0
print(f"Finished Training on {epoch+1} Epochs!")
return model
Define predict function.
predict
method. Similarly, we need to return the predicted numpy array back to TMVA.
def predict(model, test_X, batch_size=32):
# Set to eval mode
model.eval()
X = torch.Tensor(test_X)
with torch.no_grad():
predictions = model(X)
return predictions.numpy()
Now that we have defined the necessary components required for downloading, preprocessing, dataloaders, building our model, training loop, and a prediction method for evaluation. We need to share some of these components with the TMVA Interface backend.
This can be simply done by defining dictionary: load_model_custom_objects
Keys:
- "optimizer"
- "criterion"
- "train_func"
- "predict_func"
# Pass optimizer, loss, train, predict function objects,
# defined earlier, as values to the dictionary
load_model_custom_objects = {"optimizer": optimizer, "criterion": loss,
"train_func": train, "predict_func": predict}
print(model)
m = torch.jit.script(model)
torch.jit.save(m, "model.pt")
factory.BookMethod(dataloader, TMVA.Types.kFisher, 'Fisher',
'!H:!V:Fisher:VarTransform=D,G')
factory.BookMethod(dataloader, TMVA.Types.kPyTorch, 'PyTorch',
'H:!V:VarTransform=D,G:FilenameModel=model.pt:'
'NumEpochs=20:BatchSize=32')
factory.TrainAllMethods()
factory.TestAllMethods()
factory.EvaluateAllMethods()
roc = factory.GetROCCurve(dataloader)
roc.Draw()
With this I've wrapped up my project. Stay tuned for my final post about my GSoC journey at CERN.
Feel free to ask questions below in the comments or on the root forum.
Until next time,
Anirudh Dagar