ExecuTorch
ExecuTorch is a framework built from the ground up by the PyTorch Team (which is the backbone of the majority of AI libraries you use day to day) for inferencing AI models on edge devices (such as mobile phones, wearables, etc.). For more info, read here: https://pytorch.org/executorch/stable/index.html
This gives better performance by optimising models suitable for mobile use.
Compare this to llama.cpp (which runs GGUF models): llama.cpp is more of a general inference engine that runs on a variety of platforms, mainly directed desktop/workstations/servers. It works on mobile as well. However, ExecuTorch is written for mobile use first and foremost, thus giving better performance optimisations. For more info about GGUF models, read here: https://www.layla-network.ai/post/what-are-gguf-models-what-are-model-quants
How to enable ExecuTorch support in Layla
Layla supports running ExecuTorch models.
The first step is to enable the ExecuTorch mini-app within Layla:
This downloads some of the necessary libraries needed to run PTE models.
Opening the ExecuTorch app, you will see a list of recommended models that work with Layla:
You can download them and get started right away!
If you want to try different models, you can find them here: https://huggingface.co/l3utterfly
Look for the model repositories ending in "executorch":
Go to the Files and Versions tab in the repo, you will find three options:
ExecuTorch models are pre-compiled, so their context size is fixed (unlike llama.cpp where you can change them on the fly). Higher context sizes uses more memory, so choose one that is suitable for your phone. I suggest starting with 4096 and move up and down depending on the results.
Go to your Inference Settings to check if your models are loaded properly:
You can see the model icon has changed to include the Executorch logo when a suitable PTE is loaded. Make sure to select the Llama3 prompt! This is because all ExecuTorch models are LLama3 based (for now).
Differences between GGUF and PTE models
GGUFs and PTE (Executorch) models work transparently in Layla. This means all features will work out of the box no matter which model you select.
However, there are a few considerations:
ExecuTorch model uses more memory. This is because ExecuTorch loads the whole model into memory at once, instead of using memory mapped files. This gives a more consistent performance, but at the cost of using more memory. So make sure to close all your background apps and make sure your phone has enough free RAM
ExecuTorch does not support context shifting. GGUF models transparently extends the context by removing information from the start of the conversation. This gives the illusion that the conversation that continues indefinitely. ExecuTorch models will give an error when reaching the maximum context length.
Comments