Expanding the UTF-8 Character Set to Infinity

Рет қаралды 4,042

Mashpoe

5 жыл бұрын

Expanding the UTF-8 character set to infinity

Пікірлер: 16

@ybungalobill 2 жыл бұрын

The proposed scheme breaks another genius property of UTF-8: that it's self-synchronizing. You can always determine if a byte is the beginning of a character just by looking at it. This is crucial not only for iterating back and forth through the string, but also for being able to search for substrings using a simple strstr. You can fix your scheme by filling in those ones into the x'es of 10xxxxxx bytes. Eg: 11111111 10111111 10111111 10111111 10110xxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ...

@MatheusAugustoGames 3 жыл бұрын

Ok I just want to point out the genius that was the creation of UTF-8. Old computers, if they found 8 bits set to 0 in a byte, would interpret the string as finished. This pattern on UTF guarantees that will never happen accidentally.

@lelouchvibritannia69yearsa78 2 жыл бұрын

The beginning of a Legendary Game Developer's journey!

@lelouchvibritannia69yearsa78 2 жыл бұрын

Ayo I hot a heart from the creator! Les gooooooooo

@sarahdehart1027 5 жыл бұрын

Lol! That ending was epic! Loved it!

@halftwins 2 жыл бұрын

I see a couple problems with this, mainly for example, not having clarification on if a character has just started with a byte or is preceded by 11111111. Maybe there's something I'm not noticing, but it seems like for it to really last forever an ending sequence of some kind would be needed(?) Anyway, the video was great and early congrats on 1k!

@Magnogen 2 жыл бұрын

That's a good point, I was half expecting him to say that if the byte started with 0, then _that_ would be the terminating byte. Something like *1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx* would then be the corresponding utf-infinity code, and ascii would be the base case of just 0xxxxxxx. Backwards compatibility and all. But hey, that's just a thought. I'm not sure how feasible it'd be in practice, as I don't tend to work with memory allocation, but I'd like to know how well it'd work/if it'd work at all.

@BGBTech 2 жыл бұрын

@@Magnogen That scheme is actually used for encoding numbers in some file formats. One other scheme I had used in some of my formats is: 0xxxxxxx (0-127), 10xxxxxx-xxxxxxxx (0-16K), 110xxxxx-xxxxxxxx-xxxxxxxx (0-2M), ... A lot depends on what properties one wants. There are also various ways these schemes can be extended for signed numbers, to encode variable-length floating point values, ... OTOH, while UTF-8 doesn't have the most efficient representation, it does allow re-synchronizing, and in a few odd-cases non-standard encodings are possible (for example, I had used "transposed UTF-8" values in string tables as to encode string length prefixes), noting that it is possible to unambiguously differentiate between normal coded and transposed encodings (and in some cases, it might be preferable to have some way to be able to encode an explicit string length, without needing to count characters until the NUL byte).

@luca__3044 2 жыл бұрын

Cant wait to express my feelings in a 420bit alien langue!

@PC_YouTube_Channel 2 жыл бұрын

lmao amazing ending. your channel really gives off some Tom 7 vibes.

@sullivanbarnett6904 5 жыл бұрын

Thank you jacob!

@TimJSwan 2 жыл бұрын

lol 256 bits enough? more than all the plank lengths in the universe represented...

@bored_person 2 жыл бұрын

Patents expire after 20 years.

@robloxxer593 2 жыл бұрын

Wait why tf are they adding four entire 1's two chracters already had 4 combinations and wouldn't you know when it ends from the bits that told you how long it is? what's the point of the bits in the front of the byte

@decare696 2 жыл бұрын

it's so that a byte that's in the middle of some character can't be mistaken for a correct ascii byte by old or bad/lazy software

@robloxxer593 2 жыл бұрын

@@decare696 stupid lazy old software