Different types of tokens used in Generative AI use cases

Token Type	Description	Use Cases	Examples
Word Tokens	Represent individual words in the text	Language modeling, text generation, machine translation	Example: "cat," "dog," "house"
Subword Tokens	Represent smaller text units like subwords	Handling complex word structures, text generation	Example: "un-" (as in "undo"), "happi" (from "happiness")
Character Tokens	Represent individual characters in the text	Text generation, handwriting recognition, OCR	Example: "A," "b," "7"
Byte Tokens	Treat each byte of text as a separate token	Text encoding, binary data analysis	Example: "01011010," "1A," "FF"
Special Tokens	Used for various purposes (e.g., [CLS], [SEP])	Start/end markers, padding, segment separation	Example: [CLS] (classification start), [SEP] (sequence separation)
Position Tokens	Indicate the position of tokens in a sequence	Transformer models for positional awareness	Example: [POS_1], [POS_2], [POS_3]
Mask Tokens	Mask out tokens to be ignored in operations	Masked language modeling, information retrieval	Example: [MASK], [PAD]
Entity Tokens	Represent recognized entities in the text	Named entity recognition (NER), information extraction	Example: [ORG], [PERSON], [LOCATION]
Image Tokens	Represent image features or captions	Multimodal models, image-text interaction	Example: [IMG_FEATURES], [CAPTION]
Custom Tokens	Domain-specific tokens for specific tasks	Code generation, medical text analysis, specialized NLP	Example: [FUNCTION], [DIAGNOSIS], [LAW]

An Architect's vision