Supported Text Types
Out-of-the-box we include support for the following text types:
Developers can add custom text types by contributing to the org.eclipse.equinox.bidi.bidiTypes
extension point.
Unless specified otherwise, we assume that the relative progression of the tokens and separators for display
should always be from left to right, while the text of each token will go LTR or
RTL depending on its content and according to the Unicode Bidirectional Algorithm.
(In the examples, "@" represents an LRM, "&" represents an RLM.)
Comma-delimited List (comma
)
Pattern
[first list item]
, [second list item]
, . . .
, [last list
item]
Detailed Design
The general algorithm applies, with the
following adaptations:
- There is only one separator, the comma(,).
- This design can easily be adapted to accomodate a different separator,
like a semicolon (;) or a tab character, etc...
Example:
Logical order (without LRM): ABC,DE,FGH
Display (without LRM): HGF,ED,CBA
Logical order (with LRM): ABC@,DE@,FGH
Display (without LRM): CBA,ED,HGF
Name or Path of File or Directory
Patterns
Windows full path: [drive letter]:\
[sub-path] \
. . . \
[sub-path]
Windows relative path: [sub-path]
\ . . .
\ [sub-path]
Windows full file path: [drive letter]:\
[sub-path] \
. . . \
[sub-path] \
[file name] .
[extension]
Windows relative file path: [sub-path]
\ . . .
\ [sub-path]
\ [file name]
. [extension]
Linux full path: /
[sub-path] /
. . . /
[sub-path]
Linux relative path: [sub-path]
/ . . .
/ [sub-path]
Linux full file path: /
[sub-path] /
. . . /
[sub-path] /
[file name] .
[extension]
Linux relative file path: [sub-path]
/ . . .
/ [sub-path]
/ [file name]
. [extension]
Detailed Design
The general algorithm applies, with the
following adaptation:
- The separators are colon (:), backslash (\) and full stop (.) for
Windows, slash (/) and full stop (.) for Linux.
Example:
Logical order (without LRM): c:\DIR1\DIR2\MYFILE.ext
Display (without LRM): c:\ELIFYM\2RID\1RID.ext
Logical order (with LRM): c:\DIR1@\DIR2@\MYFILE.ext
Display (without LRM): c:\1RID\2RID\ELIFYM.ext
Java Code
Requirement
We can classify elements of a Java program as:
- white space
- operators
- String literals: they start with a double quote and end with a double
quote which is not escaped (not preceded by a backslash).
- comments: they start with /* and end with */ or start with // and end at
the end of the line.
- tokens: anything delimited by the previous items.
The requirement is to make the relative order of elements left-to-right,
while each element by itself will be presented according to the Unicode Bidirectional Algorithm.
Detailed Design
The general algorithm applies, with the
following adaptations:
- Each String literal or comment is considered as one token.
- The separators are all the characters used as operators and separators
in the Java language: plus (+), minus (-), asterisk (*), slash (/), percent
(%), less-than (<), greater-than (>), ampersand (&), vertical bar (|),
circumflex (^), tilde (~), left and right parentheses ( ( ) ), left and
right square brackets ([ ]), left and right curly brackets ( { } ), comma
(,), full stop (.), semicolon (;), exclamation mark (!), question mark (?),
colon (:), spaces which are not part of a String literal or a comment.
- If a String literal or a comment includes LRE or RLE characters but do
not include the proper number of matching PDF characters, missing PDF
characters must be added at the end of the literal or comment.
Example:
Logical order (without LRM): A = /*B+C*/ D;
Display (without LRM): D /*C+B*/ = A;
Logical order (with LRM): A@ = /*B+C@*/ D;
Display (without LRM): A = /*C+B*/ D;
Regular Expression
Requirement
Preserve the relative order of the regular expression components identical to
the order in which they appear when exclusively Latin characters are used.
Detailed Design
The general algorithm applies, with the
following adaptations:
- Regular expressions consist of operators, pattern characters, and
" in most implementations of extended syntax " named identifiers.
- Since the syntax of regular expression is not standardized, the
list of operators should be adapted to the specific implementation at hand.
- Common operators include: question mark (?), circumflex (^), dollar ($), plus
(+), minus (-), asterisk (*), vertical bar (|), tilde (~), left and right
parentheses ( ( ) ), left and right square brackets ([ ]), left and right
curly brackets ( { } ), commercial at (@), number sign (#), ampersand (&),
backslash (\).
- The separators will be the characters used as operators for regular
expressions.
- Characters which are not operators are pattern characters.
If an operator is immediately preceded by a backslash, both the
backslash and the operator must be handled as pattern characters.
- Each pattern character is a separate token, so pattern characters
will always be ordered according to the base text direction of the
expression.
- Identifiers appear in certain syntactic constructs, and are treated as
tokens. For example, the strings "digit" and "number" in the expression
"total: (?<number>[:digit:]+)\s" are identifiers, whereas "total" is just
a sequence of 5 pattern characters.
- The following constructs must be recognized as delimiting tokens
(note: this list should be adapted to the specific syntax of regular
expressions in a given environment):
(?<name>
(?'name'
(?(<name>)
(?('name')
(?(name)
(?&name)
(?P<name>
\k<name>
\k'name'
\k{name}
(?P=name)
\g{name}
\g<name>
\g'name'
(?(R&name)
[:class:]
- Comments of the form (?# . . . ) must be handled as individual tokens.
- Quoted sequences of the form \Q . . . \E must be handled as individual tokens.
- Numbers used as quantifiers (numbers of occurrences) or as group references
must be handled as individual tokens.
- If the first strong directional character in a regular expression is an
Arabic letter, the base direction of the expression must be RTL.
- If the first strong directional character in a regular expression is a
Hebrew letter or a LTR letter, the base direction of the expression must be
LTR.
- If the regular expression contains no strong directional character, its
base direction must be LTR for Hebrew users. For Arabic users, its base
direction should follow the user interface direction (RTL if mirrored, LTR otherwise).
Example (Hebrew):
Logical order (without LRM): ABC(?'DEF'GHI
Display (without LRM): IHG'FED'?(CBA
Logical order (with LRM): A@B@C@(?'DEF'@G@H@I
Display (without LRM): ABC(?'FED'GHI
Example (Arabic):
Logical order (without LRM): ABC(?'DEF'GHI
Display (without LRM): IHG'FED'?(CBA
Logical order (with LRM): ABC(?'DEF'GHI
Display (without LRM): IHG'FED'?(CBA
Compound Name with Underscores
Pattern
[first part] _
[second part] _
[third part]
Note: name parts must not include underscores.
Detailed Design
The general algorithm applies, with the
following adaptation:
- There is only one separator, the underscore (_).
Example:
Logical order (without LRM): MYPACKAGE_MYPROGRAM
Display (without LRM): MARGORPYM_EGAKCAPYM
Logical order (with LRM): MYPACKAGE@_MYPROGRAM
Display (without LRM): EGAKCAPYM_MARGORPYM
URL, URI, IRI
Patterns
http://
[domain label] .
. . . .
[domain label]
http://
[domain label] .
. . . .
[domain label] /
[sub-path] /
. . . /
[sub-path] /
[file name] .
[extension]
http://
[domain label] .
. . . .
[domain label] /
[sub-path] /
. . . /
[sub-path] /
[file name] .
[extension] #
[local reference]
http://
[domain label] .
. . . .
[domain label] /
[sub-path] /
. . . /
[sub-path] /
[file name] .
[extension] ?
[key1] =
[value1] &
[key2] =
[value2]
Detailed Design
The general algorithm applies, with the
following adaptations:
- The detailed syntax of URLs, URIs, IRIs is described in
RFC 3986 and
RFC 3987. A rigorous
analysis to identify tokens and separators is not simple.
- For most practical cases, it is sufficient to consider the following
separators: colon (:), question mark (?), number sign (#), slash (/),
commercial at (@), full stop (.), left bracket ([), right bracket (]).
Example:
Logical order (without LRM): www.DOC.MYDOMAIN.com\HEB\LESSON1.html
Display (without LRM): www.NIAMODYM.COD.com\1NOSSEL\BEH.html
Logical order (with LRM): www.DOC@.MYDOMAIN.com\HEB@\LESSON1.html
Display (without LRM): www.COD.NIAMODYM.com\BEH\1NOSSEL.html
XPath
Patterns
/ book
/ chapter
/ paragraph
/ year
/ month
[@name
= "April"]
Detailed Design
The general algorithm applies, with the
following adaptations:
- Strings
- Strings are started by a quotation mark which can be a
double-quote (") or an apostrophe ('), and are closed by the same
character.
- Double-quotes may appear within a string limited by apostrophes and
vice versa, and must be handled as characters internal to the string.
- A string started on one line is not necessarily closed on the same
line.
- Whitespace (e.g. blanks and tab characters) appearing
outside of strings constitutes a delimiter for tokens.
- Each occurrence of a string must be handled as one token.
- After isolating strings, the following characters are separators: white
space, slash (/), square brackets ( [ and ] ), less-than (<), greater-than
(>), equal sign (=), exclamation mark (!), colon (:), at sign (@), period
(.), vertical bar (|), parentheses ( ( and ) ), plus (+), minus (-),
asterisk (*).
- Some operators are words like "and", "or", "div", "mod". For our
purpose, they can be handled as tokens.
- Some operators are represented by a pair of symbols like "not equal"
(!=), "descendant-or-self" (//), "parent" (..). For our purpose, they can be
handled as 2 successive operators represented by one symbol each.
Example:
Logical order (without LRM): DEF!GHI 'A!B'=JK
Display (without LRM): KJ='B!A' IHG!FED
Logical order (with LRM): DEF@!GHI@ 'A!B'@=JK
Display (without LRM): FED!IHG 'B!A'=KJ