Supported Text Types

Out-of-the-box we include support for the following text types:

Comma-delimited List
Full Path - Relative Path - File Name
Java Code
Regular Expression
Compound Name with Underscores
URL, URI, IRI
XPath

Developers can add custom text types by contributing to the org.eclipse.equinox.bidi.bidiTypes extension point.

Unless specified otherwise, we assume that the relative progression of the tokens and separators for display should always be from left to right, while the text of each token will go LTR or RTL depending on its content and according to the Unicode Bidirectional Algorithm.

(In the examples, "@" represents an LRM, "&" represents an RLM.)

Comma-delimited List (`comma`)

Pattern

[first list item] , [second list item] , . . . , [last list item]

Detailed Design

The general algorithm applies, with the following adaptations:

There is only one separator, the comma(,).
This design can easily be adapted to accomodate a different separator, like a semicolon (;) or a tab character, etc...

Example:

   Logical order (without LRM):   ABC,DE,FGH
   Display (without LRM):         HGF,ED,CBA
   Logical order (with LRM):      ABC@,DE@,FGH
   Display (without LRM):         CBA,ED,HGF

Name or Path of File or Directory

Patterns

Windows full path: [drive letter]:\ [sub-path] \ . . . \ [sub-path]

Windows relative path: [sub-path] \ . . . \ [sub-path]

Windows full file path: [drive letter]:\ [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]

Windows relative file path: [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]

Linux full path: / [sub-path] / . . . / [sub-path]

Linux relative path: [sub-path] / . . . / [sub-path]

Linux full file path: / [sub-path] / . . . / [sub-path] / [file name] . [extension]

Linux relative file path: [sub-path] / . . . / [sub-path] / [file name] . [extension]

Detailed Design

The general algorithm applies, with the following adaptation:

The separators are colon (:), backslash (\) and full stop (.) for Windows, slash (/) and full stop (.) for Linux.

Example:

   Logical order (without LRM):   c:\DIR1\DIR2\MYFILE.ext
   Display (without LRM):         c:\ELIFYM\2RID\1RID.ext
   Logical order (with LRM):      c:\DIR1@\DIR2@\MYFILE.ext
   Display (without LRM):         c:\1RID\2RID\ELIFYM.ext

Java Code

Requirement

We can classify elements of a Java program as:

white space
operators
String literals: they start with a double quote and end with a double quote which is not escaped (not preceded by a backslash).
comments: they start with /* and end with */ or start with // and end at the end of the line.
tokens: anything delimited by the previous items.

The requirement is to make the relative order of elements left-to-right, while each element by itself will be presented according to the Unicode Bidirectional Algorithm.

Detailed Design

The general algorithm applies, with the following adaptations:

Each String literal or comment is considered as one token.
The separators are all the characters used as operators and separators in the Java language: plus (+), minus (-), asterisk (*), slash (/), percent (%), less-than (<), greater-than (>), ampersand (&), vertical bar (|), circumflex (^), tilde (~), left and right parentheses ( ( ) ), left and right square brackets ([ ]), left and right curly brackets ( { } ), comma (,), full stop (.), semicolon (;), exclamation mark (!), question mark (?), colon (:), spaces which are not part of a String literal or a comment.
If a String literal or a comment includes LRE or RLE characters but do not include the proper number of matching PDF characters, missing PDF characters must be added at the end of the literal or comment.

Example:

   Logical order (without LRM):   A = /*B+C*/ D;
   Display (without LRM):         D /*C+B*/ = A;
   Logical order (with LRM):      A@ = /*B+C@*/ D;
   Display (without LRM):         A = /*C+B*/ D;

Regular Expression

Requirement

Preserve the relative order of the regular expression components identical to the order in which they appear when exclusively Latin characters are used.

Detailed Design

The general algorithm applies, with the following adaptations:

Regular expressions consist of operators, pattern characters, and " in most implementations of extended syntax " named identifiers.
Since the syntax of regular expression is not standardized, the list of operators should be adapted to the specific implementation at hand.
Common operators include: question mark (?), circumflex (^), dollar ($), plus (+), minus (-), asterisk (*), vertical bar (|), tilde (~), left and right parentheses ( ( ) ), left and right square brackets ([ ]), left and right curly brackets ( { } ), commercial at (@), number sign (#), ampersand (&), backslash (\).
The separators will be the characters used as operators for regular expressions.
Characters which are not operators are pattern characters. If an operator is immediately preceded by a backslash, both the backslash and the operator must be handled as pattern characters.
Each pattern character is a separate token, so pattern characters will always be ordered according to the base text direction of the expression.
Identifiers appear in certain syntactic constructs, and are treated as tokens. For example, the strings "digit" and "number" in the expression "total: (?<number>[:digit:]+)\s" are identifiers, whereas "total" is just a sequence of 5 pattern characters.
The following constructs must be recognized as delimiting tokens (note: this list should be adapted to the specific syntax of regular expressions in a given environment):
   (?<name>
   (?'name'
   (?(<name>)
   (?('name')
   (?(name)
   (?&name)
   (?P<name>
   \k<name>
   \k'name'
   \k{name}
   (?P=name)
   \g{name}
   \g<name>
   \g'name'
   (?(R&name)
   [:class:]
Comments of the form (?# . . . ) must be handled as individual tokens.
Quoted sequences of the form \Q . . . \E must be handled as individual tokens.
Numbers used as quantifiers (numbers of occurrences) or as group references must be handled as individual tokens.
If the first strong directional character in a regular expression is an Arabic letter, the base direction of the expression must be RTL.
If the first strong directional character in a regular expression is a Hebrew letter or a LTR letter, the base direction of the expression must be LTR.
If the regular expression contains no strong directional character, its base direction must be LTR for Hebrew users. For Arabic users, its base direction should follow the user interface direction (RTL if mirrored, LTR otherwise).

Example (Hebrew):

   Logical order (without LRM):   ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA
   Logical order (with LRM):      A@B@C@(?'DEF'@G@H@I
   Display (without LRM):         ABC(?'FED'GHI

Example (Arabic):

   Logical order (without LRM):   ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA
   Logical order (with LRM):      ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA

Compound Name with Underscores

Pattern

[first part] _ [second part] _ [third part]

Note: name parts must not include underscores.

Detailed Design

The general algorithm applies, with the following adaptation:

There is only one separator, the underscore (_).

Example:

   Logical order (without LRM):   MYPACKAGE_MYPROGRAM
   Display (without LRM):         MARGORPYM_EGAKCAPYM
   Logical order (with LRM):      MYPACKAGE@_MYPROGRAM
   Display (without LRM):         EGAKCAPYM_MARGORPYM

URL, URI, IRI

Patterns

http:// [domain label] . . . . . [domain label]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] # [local reference]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] ? [key1] = [value1] & [key2] = [value2]

Detailed Design

The general algorithm applies, with the following adaptations:

The detailed syntax of URLs, URIs, IRIs is described in RFC 3986 and RFC 3987. A rigorous analysis to identify tokens and separators is not simple.
For most practical cases, it is sufficient to consider the following separators: colon (:), question mark (?), number sign (#), slash (/), commercial at (@), full stop (.), left bracket ([), right bracket (]).

Example:

   Logical order (without LRM):   www.DOC.MYDOMAIN.com\HEB\LESSON1.html
   Display (without LRM):         www.NIAMODYM.COD.com\1NOSSEL\BEH.html
   Logical order (with LRM):      www.DOC@.MYDOMAIN.com\HEB@\LESSON1.html
   Display (without LRM):         www.COD.NIAMODYM.com\BEH\1NOSSEL.html

XPath

Patterns

/ book / chapter / paragraph

/ year / month [@name = "April"]

Detailed Design

The general algorithm applies, with the following adaptations:

Strings
- Strings are started by a quotation mark which can be a double-quote (") or an apostrophe ('), and are closed by the same character.
- Double-quotes may appear within a string limited by apostrophes and vice versa, and must be handled as characters internal to the string.
- A string started on one line is not necessarily closed on the same line.
Whitespace (e.g. blanks and tab characters) appearing outside of strings constitutes a delimiter for tokens.
Each occurrence of a string must be handled as one token.
After isolating strings, the following characters are separators: white space, slash (/), square brackets ( [ and ] ), less-than (<), greater-than (>), equal sign (=), exclamation mark (!), colon (:), at sign (@), period (.), vertical bar (|), parentheses ( ( and ) ), plus (+), minus (-), asterisk (*).
Some operators are words like "and", "or", "div", "mod". For our purpose, they can be handled as tokens.
Some operators are represented by a pair of symbols like "not equal" (!=), "descendant-or-self" (//), "parent" (..). For our purpose, they can be handled as 2 successive operators represented by one symbol each.

Example:

   Logical order (without LRM):   DEF!GHI 'A!B'=JK
   Display (without LRM):         KJ='B!A' IHG!FED
   Logical order (with LRM):      DEF@!GHI@ 'A!B'@=JK
   Display (without LRM):         FED!IHG 'B!A'=KJ