Every Ceylon source file is a sequence of Unicode characters. Lexical analysis of the character stream, according to the grammar specified in this chapter, results in a stream of tokens. These tokens form the input of the parser grammar defined in the later chapters of this specification. The Ceylon lexer is able to completely tokenize a character stream in a single pass.
The lexer distinguishes Unicode uppercase letter, lowercase letter, and numeric characters as defined by the Unicode standard. A LowercaseLetter is any character whose General_Category is Lowercase_Letter. An UppercaseLetter is any character whose General_Category is Uppercase_Letter, Titlecase_Letter, or Other_Letter. A Number is any character whose General_Category is Decimal_Number, Letter_Number, or Other_Number.
Whitespace characters are the ASCII SP, HT, FF, LF and CR characters.
Whitespace: " " | Tab | Formfeed | Newline | Return
Outside of a comment, string literal, or single quoted literal, whitespace acts as a token separator and is immediately discarded by the lexer. Whitespace is not used as a statement separator.
There are two kinds of comments:
a multiline comment that begins with /* and extends until */, and
an end-of-line comment begins with // or #! and extends until a line terminator: an ASCII LF, CR or CR LF.
Both kinds of comments can be nested.
LineComment: ("//"|"#!") ~(Newline|Return)* (Return Newline | Return | Newline)?MultilineComment: "/*" (MultilineCommmentCharacter | MultilineComment)* "*/"
MultilineCommmentCharacter: ~("/"|"*") | ("/" ~"*") => "/" | ("*" ~"/") => "*"The following examples are legal comments:
//this comment stops at the end of the line
/* but this is a comment that spans multiple lines */
#!/usr/bin/ceylon
Comments are treated as whitespace by both the compiler and documentation compiler. Comments may act as token separators, but their content is immediately discarded by the lexer and they are not visible to the parser.
Identifiers may contain upper and lowercase letters, digits and underscores.
LowercaseChar: LowercaseLetter | "_"
UppercaseChar: UppercaseLetter
IdentifierChar: LowercaseChar | UppercaseChar | Number
All identifiers are case sensitive: Person and person are two different legal identifiers.
The Ceylon lexer distinguishes identifiers which begin with an initial uppercase character from identifiers which begin with an initial lowercase character or underscore. Additionally, an identifier may be qualified using the prefix \i or \I to disambiguate it from a reserved word or to explicitly specify whether it should be considered an uppercase or lowercase identifier.
LIdentifier: LowercaseChar IdentifierChar* | "\i" IdentifierChar+
UIdentifier: UppercaseChar IdentifierChar* | "\I" IdentifierChar+
The following examples are legal identifiers:
Person
name
personName
_id
x2
\I_id
\Iobject
\iObject
\iclass
The prefix \I or \i is not considered part of the identifier name. Therefore, \iperson is just a lowercase identifier named person and \Iperson is an uppercase identifier named person.
The following reserved words are not legal identifier names unless they appear escaped using \i or \I:
module package import alias class interface object given value assign void function of extends satisfies adapts abstracts in out return break continue throw assert dynamic if else switch case for while try catch finally then this outer super is exists nonempty
An integer literal may be expressed in decimal, hexadecimal, or binary notation:
IntegerLiteral: DecimalLiteral | HexLiteral | BinLiteral
A decimal literal has this form:
DecimalLiteral: Digits Magnitude?
Hexadecimal literals are prefixed by #:
HexLiteral: "#" HexDigits
Binary literals are prefixed by $:
BinLiteral: "$" BinDigits
A floating point literal has this form:
FloatLiteral: Digits ("." FractionalDigits (Exponent | Magnitude | FractionalMagnitude)? | FractionalMagnitude)Decimal digits may be separated into groups of three using an underscore.
Digits: Digit+ | Digit{1..3} ("_" Digit{3})+FractionalDigits: Digit+ | (Digit{3} "_")+ Digit{1..3} Hexadecimal or binary digits may be separated into groups of four using an underscore. Hexadecimal digits may even be separated into groups of two.
HexDigits: HexDigit+ | HexDigit{1..4} ("_" HexDigit{4})+ | HexDigit{1..2} ("_" HexDigit{2})+BinDigits: BinDigit+ | BinDigit{1..4} ("_" Digit{4})+A digit is a decimal, hexadecimal, or binary digit.
Digit: "0".."9"
HexDigit: "0".."9" | "A".."F" | "a".."f"
BinDigit: "0"|"1"
A floating point literal may include either an exponent (for scientific notation) or a magnitude (an SI unit prefix). An integer literal may include a magnitude.
Exponent: ("E"|"e") ("+"|"-")? Digit+Magnitude: "k" | "M" | "G" | "T" | "P"
FractionalMagnitude: "m" | "u" | "n" | "p" | "f"
The following examples are legal numeric literals:
69
6.9
0.999e-10
1.0E2
10000
1_000_000
12_345.678_9
1.5k
12M
2.34p
5u
$1010_0101
#D00D
#FF_FF_FF
The following are not valid numeric literals:
.33 //Error: floating point literals may not begin with a decimal point
1. //Error: floating point literals may not end with a decimal point
99E+3 //Error: floating point literals with an exponent must contain a decimal point
12_34 //Error: decimal digit groups must be of length three
#FF.00 //Error: floating point numbers may not be expressed in hexadecimal notation
A single character literal consists of a character, surrounded by single quotes.
CharacterLiteral: "'" Character "'"
Character: ~("'" | "\") | EscapeSequenceEscapeSequence: "\" ("b" | "t" | "n" | "f" | "r" | "\" | """ | "'" | "`" | "{" CharacterCode "}")A Unicode codepoint escape is a four-digit or eight-digit hexadecimal literal surrounded by braces.
CharacterCode: "#" HexDigit{4} | HexDigit{8}The following are legal character literals:
'A'
'#'
' '
'\n'
'\{#212B}'TODO: should we support an escape sequence for Unicode character names \{LATIN SMALL LETTER A} like Python does?
A character string literal is a character sequence surrounded by double quotes.
StringLiteral: """ StringCharacter* """
StringCharacter: ~( "\" | """ | "`" ) | "`" ~"`" | EscapeSequence
A sequence of two backticks is used to delimit an interpolated expression embedded in a string template.
StringStart: """ StringCharacter* "``"
StringMid: "``" StringCharacter* "``"
StringEnd: "``" StringCharacter* """
A verbatim string is a character sequence delimited by a sequence of three double quotes. Verbatim strings do not contain escape sequences or interpolated expressions, so every character occurring inside the verbatim string is interpreted literally.
VerbatimStringLiteral: """"" VerbatimCharacter* """""
VerbatimCharacter: ~""" | """ ~""" | """ """ ~"""
The following are legal strings:
"Hello!"
"\{00E5}ngstr\{00F6}ms"" \t\n\f\r,;:"
"""This program prints "hello world" to the console."""
TODO: specify how initial whitespace is stripped from multiline string literals.
The following character sequences are operators and/or punctuation:
, ; ... { } ( ) [ ] ? . ?. *. = => + - * / % ^ ** ++ -- .. : -> ! && || ~ & | === == != < > <= >= <=> += -= /= *= %= |= &= ^= ~= ||= &&= ::
Certain symbols serve dual or multiple purposes in the grammar.
A source directory contains Ceylon source code in files with the extension .ceylon and Java source code in files with the extension .java. The module and package to which a compilation unit belongs is determined by the subdirectory in which the source file is found.
The name of the package to which a compilation unit belongs is formed by replacing every path directory separator character with a period in the relative path from the source directory to the subdirectory containing the source file. In the case of a Java source file, the subdirectory must agree with the package specified by the Java package declaration.
The name of the module to which a compilation unit belongs is determined by searching all containing directories for a module descriptor. The name of the module is formed by replacing every path directory separator character with a period in the relative path from the source directory to the subdirectory containing the module descriptor.
Thus, the structure of the source directory containing the module org.hello might be the following:
source/
org/
hello/
module.ceylon //the module descriptor
main/
hello.ceylon
default/
DefaultHello.ceylon
personalized/
PersonalizedHello.ceylonThe source code for multiple modules may be contained in a single source directory.
Note: the tools and IDE support compilation and execution of source not contained in a well-defined module. This "default" module is not specified here, and is intended only as a convenience for experimental code.