Silly question but how is the Vector quantization reduction of the encoded Floats to quantized Ints a representation of music that is more able to be treated like text? In other words, how is the constant stream of Ints more Text like than the vector of Floats, when speaking abstractly? Do you mean because we use ASCII and already use a stream of Ints for language on computers, and that is the natural form rather than encoding language into tokens? - Matt Jackson
@julian_d_parker7 ай бұрын
@@MattJackson808 Yes, I probably should have explained that distinction better. Text (at least in the context of transformer-based models like LLMs) is usually represented as a stream of integer 'tokens' which are derived from the input text using a tokenizer (usually not with a 1:1 correspondence to characters, but rather common groups of characters). LLMs are learning the categorical probability distribution over this discrete set of tokens given the previous tokens, very much like an advanced autocomplete. You could do the same with float vectors, but it doesn't usually work as well because you have to make assumptions about the continuous distribution which results in a much less expressive model. There's also a bunch of nitty-gritty architecture reasons why integer tokens work well with transformers. So ignoring the deeper philosophical aspect, the answer is basically "Ints work better for text in practice".