We're excited to announce significant updates to Layla, bringing powerful new capabilities and improvements across the board. This release focuses on expanding hardware support, enhancing the user interface, and fixing several important issues to provide a more robust experience.
Important change in this version
ARM quants have now been consolidated into Q4_0 in this version.
Previously, you were able to choose 3 different ARM quants that runs on mobile hardware:
Q4_0_4_4
Q4_0_4_8
Q4_0_8_8
In this version, all 3 have been consolidated into just one Q4_0. This means your previous models will not work anymore, and you should download the new quant from our official Hugging Face repository: https://huggingface.co/l3utterfly
Layla supports GPU inference for LLMs!
One of the most significant additions in this update is GPU inference support for LLMs, with compatibility for both Vulkan and OpenCL backends. GPU support offloads the inference to your phone's dedicated graphics hardware.
Note that this is not expected to give dramatically faster response times, since mobile GPUs are not as powerful as desktop ones. GPU inference can relieve the CPU for performing other tasks such as background apps, or long-term memory, providing a more consistent experience.
OpenCL works better for Adreno GPUs, while Vulkan works for a wider variety of phone hardware.
Dramatic speed-up of image generation via the NPU!
We've also introduced NPU inference support for Stable Diffusion, marking a major step forward in processing capabilities.
NPU (Neural processing units) are special processors to run neural networks (AI). Layla implements Qualcomm AI Engine to offload Stable Diffusion models to their dedicated hardware called the HTP (Hexagon Tensor Processor):
- generates an image in ~10 seconds!
- negligible RAM usage
- low power consumption
In the Stable Diffusion mini-app, choose models tagged with "qnn", featuring the Qualcomm NPU icon at the top right corner.
IMPORTANT: Only the following chipsets are supported: Snapdragon8 Gen2, Snapdragon8 Gen3, Snapdragon Elite
Note: the actual image generation takes less than a second (~100ms per iteration), but loading the model from disk, copying it to the NPU etc. takes a few seconds. So in real world usage, the full process takes about 10 seconds. In Qualcomm's advertisement video, the model is pre-loaded
Improved User Experience
We've made substantial improvements to the user interface, starting with a redesigned Lorebook UI that better handles large document collections. The model import interface has been refined for greater clarity and ease of use. We've also enhanced the Long-term Memory feature by adding timestamps to the table view, making it easier to track and manage your conversation history.
The backup process has been streamlined, now allowing direct selection of save folders. We've introduced a new Download Manager app that provides visibility and control over download tasks, including the ability to cancel stuck downloads. The chat interface has been enhanced with redesigned quick actions, featuring an always-visible copy button and a new context menu accessible via tap and hold.
Expanded Model Support
This update brings several new model options to Layla:
- Addition of Whisper Base and Whisper Base (English) models with configurable language detection
- Support for the sherpa-onnx TTS engine APK
- Automatic conversion of Q4_0 quantizations to match your current architecture
Enhanced Creation Tools
Character creation has been improved with direct TavernPNG saving to the file system. The AI-powered character image generation now utilizes the default negative prompt configured in the SD mini-app, ensuring more consistent results.
Bug Fixes and Stability Improvements
We've addressed several important issues to improve stability and reliability:
- Resolved issues with chat history imports
- Fixed Layla Cloud's handling of extensive conversation histories
- Corrected memory ingestion failures caused by single memory errors
- Improved the chat interface to prevent quick actions from overwhelming the screen
- Fixed character response styling to properly reflect chat accent colors
- Resolved issues with default character image generation fallback phrases
The full changelog can be found here
New features:
Layla supports GPU inference! Supports Vulkan and OpenCL backends
Layla supports NPU inference for Stable Diffusion!
Layla supports reasoning models Deepseek R1 family!
Improvements:
redesigned Lorebook UI to handle lots of documents better
improved UI of model import
added timestamps to Long-term Memory table view
backup data now directly allows you to choose a folder to save to
added a Download Manager app to give the ability to view/cancel download tasks in case they get stuck
added Whisper Base and Whisper Base (English) models
added ability to configure the language Whisper models listen in
Q4_0 quants are now automatically converted on the fly to support your current architecture
allows saving TavernPNG directly to file system in character creation
supports sherpa-onnx TTS engine APK
redesigned chat message quick actions (copy button is now always visible, tap & hold the message to bring up a context menu with more action)
Create Character (AI) image generation now uses the default negative prompt configured in the SD mini-app
Bug fixes:
fixed bug when importing chat history
fixed bug in Layla Cloud when handling very long conversation histories
fixed bug where an error in one memory will stop ingestion of all LTM memories
fixed bug where too many quick actions take up all your screen in chat
fixed bug where chat accent colour was not being applied to character responses
fixed bug in default character image generation fallback phrase