Dear all,
Determining the language - Unicode
There is a string / Textbox which can be filled using one of Unicode language out of three different languages (2 Unicode and 1 English) and stores in CSPro database. Is there any method to determine the language or get the Unicode Character and compare with the reference of that language to determine the strings’ written language?
Thank you
Disala
Determining the language - Unicode
-
- Posts: 238
- Joined: November 21st, 2022, 4:41 pm
Re: Determining the language - Unicode
Hello,
The CSPro Unicode Primer is focused mainly on UTF-8 encoding a file rather than encoding a text box. If the Unicode includes a BOM, you can use a substring to check the first characters of the string for a BOM which will let you know which encoding style is being used. If you are not using a BOM, I recommend having the user select the language as a field or application setting that you can check. Automatically guessing the language from the characters is a hard problem (given how languages share characters which need to be interpreted by context) which would likely have to rely on resources outside of CSPro, such as machine learning APIs.
Hope this helps,
Justin
The CSPro Unicode Primer is focused mainly on UTF-8 encoding a file rather than encoding a text box. If the Unicode includes a BOM, you can use a substring to check the first characters of the string for a BOM which will let you know which encoding style is being used. If you are not using a BOM, I recommend having the user select the language as a field or application setting that you can check. Automatically guessing the language from the characters is a hard problem (given how languages share characters which need to be interpreted by context) which would likely have to rely on resources outside of CSPro, such as machine learning APIs.
Hope this helps,
Justin
-
- Posts: 1883
- Joined: December 5th, 2011, 11:27 pm
- Location: Washington, DC
Re: Determining the language - Unicode
I understand your question differently from how Justin interpreted it. First, there is no such thing as "Unicode language." Unicode is a system for encoding characters, including English characters. The characters used by different languages are defined in different ranges, as you can see here:
https://en.wikipedia.org/wiki/List_of_U ... characters
Current CSPro functionality does not let you get the numerical value for a single character, which would let you determine what Unicode range it falls in, and thus determine, approximately, what language is being used.
However, you can use string comparisons to see if a character falls within a range. For example, in that Wikipedia article, it shows the lowest Armenian character as "Ա" and the highest as "֏". This would check if a string uses Armenian characters:
https://en.wikipedia.org/wiki/List_of_U ... characters
Current CSPro functionality does not let you get the numerical value for a single character, which would let you determine what Unicode range it falls in, and thus determine, approximately, what language is being used.
However, you can use string comparisons to see if a character falls within a range. For example, in that Wikipedia article, it shows the lowest Armenian character as "Ա" and the highest as "֏". This would check if a string uses Armenian characters:
function StringContainsArmenianCharacters(string text)
do numeric ctr = 1 while ctr <= length(text)
string this_character = text[ctr:1];
if this_character >= "Ա" and this_character <= "֏" then
exit true;
endif;
enddo;
exit false;
end;
// ...
errmsg("%d", StringContainsArmenianCharacters("English text")); // false
errmsg("%d", StringContainsArmenianCharacters("Հայերեն տեքստ")); // true
do numeric ctr = 1 while ctr <= length(text)
string this_character = text[ctr:1];
if this_character >= "Ա" and this_character <= "֏" then
exit true;
endif;
enddo;
exit false;
end;
// ...
errmsg("%d", StringContainsArmenianCharacters("English text")); // false
errmsg("%d", StringContainsArmenianCharacters("Հայերեն տեքստ")); // true
Re: Determining the language - Unicode
This is what I meant Thank you all for replying.
Disala
Disala
Gregory Martin wrote: July 5th, 2024, 12:26 pm I understand your question differently from how Justin interpreted it. First, there is no such thing as "Unicode language." Unicode is a system for encoding characters, including English characters. The characters used by different languages are defined in different ranges, as you can see here:
https://en.wikipedia.org/wiki/List_of_U ... characters
Current CSPro functionality does not let you get the numerical value for a single character, which would let you determine what Unicode range it falls in, and thus determine, approximately, what language is being used.
However, you can use string comparisons to see if a character falls within a range. For example, in that Wikipedia article, it shows the lowest Armenian character as "Ա" and the highest as "֏". This would check if a string uses Armenian characters:
function StringContainsArmenianCharacters(string text)
do numeric ctr = 1 while ctr <= length(text)
string this_character = text[ctr:1];
if this_character >= "Ա" and this_character <= "֏" then
exit true;
endif;
enddo;
exit false;
end;
// ...
errmsg("%d", StringContainsArmenianCharacters("English text")); // false
errmsg("%d", StringContainsArmenianCharacters("Հայերեն տեքստ")); // true