The proposed scheme breaks another genius property of UTF-8: that it's self-synchronizing. You can always determine if a byte is the beginning of a character just by looking at it. This is crucial not only for iterating back and forth through the string, but also for being able to search for substrings using a simple strstr. You can fix your scheme by filling in those ones into the x'es of 10xxxxxx bytes. Eg: 11111111 10111111 10111111 10111111 10110xxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ...
@MatheusAugustoGames3 жыл бұрын
Ok I just want to point out the genius that was the creation of UTF-8. Old computers, if they found 8 bits set to 0 in a byte, would interpret the string as finished. This pattern on UTF guarantees that will never happen accidentally.
@lelouchvibritannia69yearsa782 жыл бұрын
The beginning of a Legendary Game Developer's journey!
@lelouchvibritannia69yearsa782 жыл бұрын
Ayo I hot a heart from the creator! Les gooooooooo
@sarahdehart10275 жыл бұрын
Lol! That ending was epic! Loved it!
@halftwins2 жыл бұрын
I see a couple problems with this, mainly for example, not having clarification on if a character has just started with a byte or is preceded by 11111111. Maybe there's something I'm not noticing, but it seems like for it to really last forever an ending sequence of some kind would be needed(?) Anyway, the video was great and early congrats on 1k!
@Magnogen2 жыл бұрын
That's a good point, I was half expecting him to say that if the byte started with 0, then _that_ would be the terminating byte. Something like *1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx* would then be the corresponding utf-infinity code, and ascii would be the base case of just 0xxxxxxx. Backwards compatibility and all. But hey, that's just a thought. I'm not sure how feasible it'd be in practice, as I don't tend to work with memory allocation, but I'd like to know how well it'd work/if it'd work at all.
@BGBTech2 жыл бұрын
@@Magnogen That scheme is actually used for encoding numbers in some file formats. One other scheme I had used in some of my formats is: 0xxxxxxx (0-127), 10xxxxxx-xxxxxxxx (0-16K), 110xxxxx-xxxxxxxx-xxxxxxxx (0-2M), ... A lot depends on what properties one wants. There are also various ways these schemes can be extended for signed numbers, to encode variable-length floating point values, ... OTOH, while UTF-8 doesn't have the most efficient representation, it does allow re-synchronizing, and in a few odd-cases non-standard encodings are possible (for example, I had used "transposed UTF-8" values in string tables as to encode string length prefixes), noting that it is possible to unambiguously differentiate between normal coded and transposed encodings (and in some cases, it might be preferable to have some way to be able to encode an explicit string length, without needing to count characters until the NUL byte).
@luca__30442 жыл бұрын
Cant wait to express my feelings in a 420bit alien langue!
@PC_YouTube_Channel2 жыл бұрын
lmao amazing ending. your channel really gives off some Tom 7 vibes.
@sullivanbarnett69045 жыл бұрын
Thank you jacob!
@TimJSwan2 жыл бұрын
lol 256 bits enough? more than all the plank lengths in the universe represented...
@bored_person2 жыл бұрын
Patents expire after 20 years.
@robloxxer5932 жыл бұрын
Wait why tf are they adding four entire 1's two chracters already had 4 combinations and wouldn't you know when it ends from the bits that told you how long it is? what's the point of the bits in the front of the byte
@decare6962 жыл бұрын
it's so that a byte that's in the middle of some character can't be mistaken for a correct ascii byte by old or bad/lazy software