Monday, June 29, 2026
HomeArtificial IntelligenceEnvironment friendly Sequence Modeling for On-System ML

Environment friendly Sequence Modeling for On-System ML

[ad_1]

The rising demand for machine studying (ML) mannequin inference on-device (for cell gadgets, tablets, and so forth.) is pushed by the rise of compute-intensive purposes, the necessity to maintain sure knowledge on gadget for privateness and safety causes, and the need to supply providers when a community connection will not be obtainable. Nevertheless, on-device inference introduces a myriad of challenges, starting from modeling to platform assist necessities. These challenges relate to how totally different architectures are designed to optimize reminiscence and computation, whereas nonetheless making an attempt to take care of the standard of the mannequin. From a platform perspective, the difficulty is figuring out operations and constructing on prime of them in a approach that may generalize nicely throughout totally different product use circumstances.

In earlier analysis, we mixed a novel method for producing embeddings (referred to as projection-based embeddings) with environment friendly architectures like QRNN (pQRNN) and proved them to be competent for numerous classification issues. Augmenting these with distillation methods supplies an extra bump in end-to-end high quality. Though that is an efficient method, it isn’t scalable to greater and extra intensive vocabularies (i.e., all potential Unicode or phrase tokens that may be fed to the mannequin). Moreover, the output from the projection operation itself doesn’t include trainable weights to benefit from pre-training the mannequin.

Token-free fashions offered in ByT5 are a very good start line for on-device modeling that may deal with pre-training and scalability points with out the necessity to improve the scale of the mannequin. That is potential as a result of these approaches deal with textual content inputs as a stream of bytes (every byte has a price that ranges from 0 to 255) that may cut back the vocabulary dimension for the embedding tables from ~30,000 to 256. Though ByT5 presents a compelling various for on-device modeling, going from word-level illustration to byte stream illustration will increase the sequence lengths linearly; with a median phrase size of 4 characters and a single character having as much as 4 bytes, the byte sequence size will increase proportionally to the phrase size. This may result in a big improve in inference latency and computational prices.

We deal with this downside by creating and releasing three novel byte-stream sequence fashions for the SeqFlowLite library (ByteQRNN, ByteTransformer and ByteFunnelTransformer), all of which could be pre-trained on unsupervised knowledge and could be fine-tuned for particular duties. These fashions leverage latest improvements launched by Charformer, together with a quick character Transformer-based mannequin that makes use of a gradient-based subword tokenization (GBST) method to function instantly on the byte stage, in addition to a “mushy” tokenization method, which permits us to be taught token boundaries and cut back sequence lengths. On this put up, we deal with ByteQRNN and show that the efficiency of a pre-trained ByteQRNN mannequin is akin to BERT, regardless of being 300x smaller.

Sequence Mannequin Structure

We leverage pQRNN, ByT5 and Charformer together with platform optimizations, resembling in-training quantization (which tracks minimal and most float values for mannequin activations and weights for quantizing the inference mannequin) that reduces mannequin sizes by one-fourth, to develop an end-to-end mannequin referred to as ByteQRNN (proven under). First, we use a ByteSplitter operation to separate the enter string right into a byte stream and feed it to a smaller embedding desk that has a vocabulary dimension of 259 (256 + 3 extra meta tokens).

The output from the embedding layer is fed to the GBST layer, which is provided with in-training quantization and combines byte-level representations with the effectivity of subword tokenization whereas enabling end-to-end studying of latent subwords. We “mushy” tokenize the byte stream sequences by enumerating and mixing every subword block size with scores (computed with a quantized dense layer) at every strided token place (i.e., at token positions which might be chosen at common intervals). Subsequent, we downsample the byte stream to manageable sequence size and feed it to the encoder layer.

The output from the GBST layer could be downsampled to a decrease sequence size for environment friendly encoder computation or can be utilized by an encoder, like Funnel Transformer, which swimming pools the question size and reduces the self-attention computation to create the ByteFunnelTransformer mannequin. The encoder within the end-to-end mannequin could be changed with some other encoder layer, such because the Transformer from the SeqFlowLite library, to create a ByteTransformer mannequin.

A diagram of a generic end-to-end sequence mannequin utilizing byte stream enter. The ByteQRNN mannequin makes use of a QRNN encoder from the SeqFlowLite library.

Along with the enter embeddings (i.e., the output from the embedding layer described above), we go a step additional to construct an efficient sequence-to-sequence (seq2seq) mannequin. We achieve this by taking ByteQRNN and including a Transformer-based decoder mannequin together with a quantized beam search (or tree exploration) to go along with it. The quantized beam search module reduces the inference latency when producing decoder outputs by computing the more than likely beams (i.e., potential output sequences) utilizing the logarithmic sum of earlier and present possibilities and returns the ensuing prime beams. Right here the system makes use of a extra environment friendly 8-bit integer (uint8) format, in comparison with a typical single-precision floating-point format (float32) mannequin.

The decoder Transformer mannequin makes use of a merged consideration sublayer (MAtt) to scale back the complexity of the decoder self-attention from quadratic to linear, thereby reducing the end-to-end latency. For every decoding step, MAtt makes use of a fixed-size cache for decoder self-attention in comparison with the rising cache dimension of a conventional transformer decoder. The next determine illustrates how the beam search module interacts with the decoder layer to generate output tokens on-device utilizing an edge gadget (e.g., cell phones, tablets, and so forth.).

A comparability of cloud server decoding and on-device (edge gadget) implementation. Left: Cloud server beam search employs a Transformer-based decoder mannequin with quadratic time self-attention in float32, which has an rising cache dimension for every decoding step. Proper: The sting gadget implementation employs a quantized beam search module together with a fixed-size cache and a linear time self-attention computation.

Analysis

After creating ByteQRNN, we consider its efficiency on the civil_comments dataset utilizing the space underneath the curve (AUC) metric and examine it to a pre-trained ByteQRNN and BERT (proven under). We show that the fine-tuned ByteQRNN improves the general high quality and brings its efficiency nearer to the BERT fashions, regardless of being 300x smaller. Since SeqFlowLite fashions assist in-training quantization that reduces mannequin sizes by one-fourth, the ensuing fashions scale nicely to low-compute gadgets. We selected multilingual knowledge sources that associated to the duty for pre-training each BERT and byte stream fashions to attain the very best efficiency.

Comparability of ByteQRNN with fine-tuned ByteQRNN and BERT on the civil_comments dataset.

Conclusion

Following up on our earlier work with pQRNN, we consider byte stream fashions for on-device use to allow pre-training and thereby enhance mannequin efficiency for on-device deployment. We current an analysis for ByteQRNN with and with out pre-training and show that the efficiency of the pre-trained ByteQRNN is akin to BERT, regardless of being 300x smaller. Along with ByteQRNN, we’re additionally releasing ByteTransformer and ByteFunnelTransformer, two fashions which use totally different encoders, together with the merged consideration decoder mannequin and the beam search driver to run the inference via the SeqFlowLite library. We hope these fashions will present researchers and product builders with worthwhile assets for future on-device deployments.

Acknowledgements

We wish to thank Khoa Trinh, Jeongwoo Ko, Peter Younger and Yicheng Fan for serving to with open-sourcing and evaluating the mannequin. Due to Prabhu Kaliamoorthi for all of the brainstorming and ideation. Due to Vinh Tran, Jai Gupta and Yi Tay for his or her assist with pre-training byte stream fashions. Due to Ruoxin Sang, Haoyu Zhang, Ce Zheng, Chuanhao Zhuge and Jieying Luo for serving to with the TPU coaching. Many due to Erik Vee, Ravi Kumar and the Learn2Compress management for sponsoring the undertaking and their assist and encouragement. Lastly, we wish to thank Tom Small for the animated determine used on this put up.

[ad_2]

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments