This is a Java implementation of a GPT3/4 tokenizer, loosely ported from Tiktoken with the help of ChatGPT. ...that all 3.5-turbo models released after 0613 now have ...
C++ Vietnamese tokenizer used in Cốc Cốc Search and Ads. Ships three binding surfaces: CLI tools (`tokenizer`, `vn_lang_tool`), a pure-Java Maven module (`java/`), and Cython Python bindings ...
I have implemented a parallel tokenizer (in Java) for my Polymorph Data Language (PDL) which can use all the CPU cores of my machine (14 cores, 20 threads). The PDL scripts are divided into blocks ...