Codepoints java offset

6/17/2023

But by graphemes you mean a speaking-language-specific sequence that has to be decoded, right? I don't know about Swift's one.Ībout graphemes, I must tell I've not much idea of what they are, maybe it's because English isn't my primary language. And even if there was a mutable String type, the StringIndex is an assumption.Īnd for memory-efficiency reasons, StringIndex only contains 2 fields (the ones I mentioned earlier, default and utf8) and then the programmer uses assumptions. About mutable strings, my language doesn't have these. (It's just a memcpy, but it still invalidates "known good" indices.)īut in my language the assumption would be that StringIndex is used in the same string it was extracted from, so the only check would be decoding the code point at the UTF-8 index. ↩︎Ī) indexing still has to be checked anyway, since there's no guarantee that an index is used on the same string that created it (and you don't really want that requirement anyway, as doing something at the same position in multiple strings is a reasonably common example string manipulation task) and b) after a string is changed, that changes what indices are "good" as the string representation shifts to accommodate the edit. Unless you're specifically implementing a Unicode Algorithm and thus in that case hopefully understand the subtleties as a prerequisite. The main difference is I'd consider making a higher level and more opaque Text type my default type, which treats text more like a localization system would mostly opaque with only a limited amount of manipulation without dropping down an abstraction level. If designing a new stdlib from scratch, I'd crib Rust's String practically wholesale (though maybe rename it StrBuf). It just unfortunately wasn't a bet that paid off and UCS-2 isn't a thing that formally exists anymore. This isn't to put down other stdlib String types, though - just as an example, Java's String class API was designed when people legitimately thought UCS-2 was going to be the final form of Unicode. Rust's String API is the best design I've personally seen which doesn't try to artificially simplify text by pretending some of the complexities don't exist. It's very unfortunate that string manipulation is such a common task to use as a beginner task, because text manipulation is anything but. But papering over UTF-8 and presenting a -like view doesn't remove that - Unicode is still a variable-length encoding at the codepoint level because of grapheme clusters. The cost is that the developer not just treating indices as opaque needs to understand that String uses a variable length encoding. You can still count/search/slice by codepoints (though doing so is extremely rarely what you want ) even if it takes a bit more effort to do so. You can treat indices as opaque and get fast reindexing just fine. (It's just a memcpy, but it still invalidates "known good" indices.)īytewise indexing offers the full benefit without any tangible costs. The main problem with any sort of opaque index is twofold: a) indexing still has to be checked anyway, since there's no guarantee that an index is used on the same string that created it (and you don't really want that requirement anyway, as doing something at the same position in multiple strings is a reasonably common example string manipulation task) and b) after a string is changed, that changes what indices are "good" as the string representation shifts to accommodate the edit. So if there's any part of the OP that's about the design of your language, that's not going to be touched on here.įor comparison purposes, Swift also actually has a separate StringIndex type for its strings, which help manage the fact that Swift actually provides a relatively high-level view of strings as a sequence of graphemes as part of the language. This forum's purpose is the discussion of development and implementation of the Rust programming language, so other languages are on-topic only insomuch as case studies in the choices others have made and the outcome of that. Have you considered something similiar in Rust? What if it were part of the standard library?

It can be handy when the string is short or it is useful for prototyping purposes. This is flexible in that you can manipulate Code Points directly.

StringIndex contains these 2 for different purposes, but utf8 is always used for indexing. So in my language one will be able to do '\u, where default is the index in Code Points, and utf8 is the index in UTF-8 octets. In my language ― which is not yet buildable, sorry! ― I thought of the approach of indexing with either Code Points indexes (simply integers) or immutable StringIndex objects, while maintaining an UTF-8 encoding. The nearest to slicing with Code Point positions is something like s.chars().collect::>().to_string(). The str type can only be directly sliced with octet positions ( s, for example).

0 Comments

Codepoints java offset

Leave a Reply.

Author

Archives

Categories