guile/devel/translation/langtools.text

* Introduction

Version: $Id: langtools.text,v 1.5 2000-08-13 04:47:26 mdj Exp $

This is a proposal for how Guile could interface with language
translators.  It will be posted on the Guile list and revised for some
short time (days rather than weeks) before being implemented.

The document can be found in the CVS repository as
guile-core/devel/translation/langtools.text.  All Guile developers are
welcome to modify and extend it according to the ongoing discussion
using CVS.

Ideas and comments are welcome.

For clarity, the proposal is partially written as if describing an
already existing system.

MDJ 000812 <djurfeldt@nada.kth.se>

* Language names

A translator for Guile is a certain kind of Guile module, implemented
in Scheme, C, or a mixture of both.

To make things simple, the name of the language is closely related to
the name of the translator module.

Languages have long and short names.  The long form is simply the name
of the translator module: `(lang ctax)', `(lang emacs-lisp)',
`(my-modules foo-lang)' etc.

Languages with the long name `(lang IDENTIFIER)' can be referred to
with the short name IDENTIFIER, for example `emacs-lisp'.

* How to tell Guile to read code in a different language (than Scheme)

There are four methods of specifying which translator to use when
reading a file:

** Command option

The options to the guile command are parsed linearly from left to
right.  You can change the language at zero or more points using the
option

 -t, --language LANGUAGE

Example:

  guile -t emacs-lisp -l foo -l bar -t scheme -l baz

will use the emacs-lisp translator while reading "foo" and "bar", and
the default translator (scheme) for "baz".

You can use this technique in a script together with the meta switch:

#!/usr/local/bin/guile \
-t emacs-lisp -s
!#

** Commentary in file

When opening a file for reading, Guile will read the first few lines,
looking for the string "-*- LANGNAME -*-", where LANGNAME can be
either the long or short form of the name.

If found, the corresponding translator is loaded and used to read the
file.

** File extension

Guile maintains an alist mapping filename extensions to languages.
Each entry has the form:

  (REGEXP . LANGNAME)

where REGEXP is a string and LANGNAME a symbol or a list of symbols.

The alist can be accessed using `language-alist' which is exported
by the module `(core config)':

  (language-alist)			--> current alist
  (language-alist ALIST) 		sets the alist to ALIST
  (language-alist ALIST :prepend)	prepends ALIST onto the current list
  (language-alist ALIST :append)	appends ALIST after current list

The `load' command will match filenames against this alist and choose
the translator to use accordingly.

There will be a default alist for common translators.  For translators
not listed, the alist has to be extended in .guile just as Emacs users
extend auto-mode-alist in .emacs.

** Module header

You specify the language used by a module with the :language option in
the module header.  (See below under "Module configuration language".)

* Module system

This section describes how the Guile module system is adapted to use
with other languages.

** Module configuration language

*** The `(config)' module

Guile has a sophisticated module system.  We don't require each
translator implementation to implement its own syntax for modules.
That would be too much work for the implementor, and users would have
to learn the module system anew for each syntax.

Instead, the module `(config)' exports the module header form
`(define-module ...)'.

The config module also exports a number of primitives by which you can
customize the Guile library, such as `language-alist' and `load-path'.

*** Default module environment

The bindings of the config module is available in the default
interaction environment when Guile starts up.  This is because the
config module is on the module use list for the startup environment.

However, config bindings are *not* available by default in new
modules.

The default module environment provides bindings from the R5RS module
only.

*** Module headers

The module header of the current module system is the form

  (define-module NAME OPTION1 ...)

You can specify a translator using the option

  :language LANGNAME

where LANGNAME is the long or short form of language name as described
above.

The translator is being fed characters from the module file, starting
immediately after the end-parenthesis of the module header form.

NOTE: There can be only one module header per file.

It is also possible to put the module header in a separate file and
use the option

  :file FILENAME

to point out a file containing the actual code.

Example:

foo.gm:
----------------------------------------------------------------------
(define-module (foo)
  :language emacs-lisp
  :file "foo.el"
  :export (foo bar)
  )
----------------------------------------------------------------------

foo.el:
----------------------------------------------------------------------
(defun foo ()
  ...)

(defun bar ()
  ...)
----------------------------------------------------------------------

** Repl commands

Up till now, Guile has been dependent upon the available bindings in
the selected module in order to do basic operations such as moving to
a different module, enter the debugger or getting documentation.

This is not acceptable since we want be able to control Guile
consistently regardless of in which module we are, and sinc we don't
want to equip a module with bindings which don't have anything to do
with the purpose of the module.

Therefore, the repl provides a special command language on top of
whatever syntax the current module provides.  (Scheme48 and RScheme
provides similar repl command languages.)

[Jost Boekemeier has suggested the following alternative solution:
 Commands are bindings just like any other binding.  It is enough if
 some modules carry command bindings (it's in fact enough if *one*
 module has them), because from such a module you can use the command
 (in MODULE) to walk into a module not carrying command bindings, and
 then use CTRL-D to exit.

 However, this has the disadvantage of mixing the "real" bindings with
 command bindings (the module might want to use "in" for other
 purposes), that CTRL-D could cause problems since for some channels
 CTRL-D might close down the connection, and that using one type of
 command ("in") to go "into" the module and another (CTRL-D) to "exit"
 is more complex than simply "going to" a module.]

*** Repl command syntax

Normally, repl commands have the syntax

  ,COMMAND ARG1 ...

Input starting with arbitrary amount of whitespace + a comma thus
works as an escape syntax.

This syntax is probably compatible with all languages.  (Note that we
don't need to activate the lexer of the language until we've checked
if the first non-whitespace char is a comma.)

(Hypothetically, if this would become a problem, we can provide means
of disabling this behaviour of the repl and let that particular
language module take sole control of reading at the repl prompt.)

Among the commands available are

*** ,in MODULE

Select module named MODULE, that is any new expressions typed by the
user after this command will be evaluated in the evaluation
environment provided by MODULE.

*** ,in MODULE EXPR

Evaluate expression EXPR in MODULE.  EXPR has the syntax supplied by
the language used by MODULE.

*** ,use MODULE

Import all bindings exported by MODULE to the current module.

* Language modules

Since code written in any kind of language should be able to implement
most tasks, which may include reading, evaluating and writing, and
generally computing with, expressions and data originating from other
languages, we want the basic reading, evaluation and printing
operations to be independent of the language.

That is, instead of supplying separate `read', `eval' and `write'
procedures for different languages, a language module is required to
use the system procedures in the translated code.

This means that the behaviour of `read', `eval' and `write' are
context dependent.  (See further "How Guile system procedures `read',
`eval', `write' use language modules" below.)

** Language data types

Each language module should try to use the fundamental Scheme data
types as far as this is possible.

Some data types have important differences in semantics between
languages, though, and all required data types may not exist in
Guile.

In such cases, the language module must supply its own, distinct, data
types.  So, each language supported by Guile uses a certain set of
data types, with the basic Scheme data types as the intersection
between all sets.

Specifically, syntax trees representing source code expressions should
normally be a distinct data type.

** Foreign language escape syntax

Note that such data can flow freely between modules.  In order to
accomodate data with different native syntaxes, each language module
provides a foreign language escape syntax.  In Scheme, this syntax
uses the sharp comma extension specified by SRFI-10.  The read
constructor is simply the last symbol in the long language name (which
is usually the same as the short language name).

** Example 1

Characters have the syntax in Scheme and in ctax.  Lists currently
have syntax in Scheme but lack ctax syntax.  Ctax doesn't have a
datatype "enum", but we pretend it has for this example.

The following table now shows the syntax used for reading and writing
these expressions in module A using the language scheme, and module B
using the language ctax (we assume that the foreign language escape
syntax in ctax is #LANGUAGE EXPR):

	  A		   B

chars	  #\X		   'X'

lists	  (1 2 3)	   #scheme (1 2 3)

enums	  #,(ctax ENUM)	   ENUM

** Example 2

  A user is typing expressions in a ctax module which imports the
  bindings x and y from the module `(foo)':

  ctax> x = read ();
  1+2;
  1+2;
  ctax> x
  1+2;
  ctax> y = 1;
  1
  ctax> y;
  1
  ctax> ,in (guile-user)
  guile> ,use (foo)
  guile> x
  #,(ctax 1+2;)
  guile> y
  1
  guile>

The example shows that ctax uses a distinct representation for ctax
expressions, but Scheme integers for integers.

** Language module interface

A language module is an ordinary Guile module importing bindings from
other modules and exporting bindings through its public interface.

It is required to export the following variable and procedures:

*** language-environment --> ENVIRONMENT

Returns a fresh top-level ENVIRONMENT (a module) where expressions
in this language are evaluated by default.

Modules using this language will by default have this environment
on their use list.

The intention is for this procedure to provide the "run-time
environment" for the language.

*** native-read PORT --> OBJECT

Read next expression in the foreign syntax from PORT and return an
object OBJECT representing it.

It is entirely up to the language module to define what one
expression is, that is, how much to read.

In lisp-like languages, `native-read' corresponds to `read'.  Note
that in such languages, OBJECT need not be source code, but could
be data.

The representation of OBJECT is also chosen by the language
module.  It can consist of Scheme data types, data types distinct for
the language, or a mixture.

There is one requirement, however: Distinct data types must be
instances of a subclass of `language-specific-class'.

This procedure will be called during interactive use (the user
types expressions at a prompt) and when the system `read'
procedure is called at a time when a module using this language is
selected.

Some languages (for example Python) parse differently depending if
its an interactive or non-interactive session.  Guile prvides the
predicate `interactive-port?' to test for this.

*** language-specific-class

This variable contains the superclass of all non-Scheme data-types
provided by the language.

*** native-write OBJECT PORT

This procedure prints the OBJECT on PORT using the specific
language syntax.

*** write-foreign-syntax OBJECT LANGUAGE NATIVE-WRITE PORT

Write OBJECT in the foreign language escape syntax of this module.
The object is specific to language LANGUAGE and can be written using
NATIVE-WRITE.

Here's an implementation for Scheme:

(define (write-foreign-syntax object language native-write port)
  (format port "#(~A " language))
  (native-write object port)
  (display #\) port)

*** translate EXPRESSION --> SCHEMECODE

Translate an EXPRESSION into SCHEMECODE.

EXPRESSION can be anything returned by `read'.

SCHEMECODE is Scheme source code represented using ordinary Scheme
data.  It will be passed to `eval' in an environment containing
bindings in the environment returned by `language-environment'.

This procedure will be called duing interactive use and when the
system `eval

*** translate-all PORT [ALIST] --> THUNK

Translate the entire stream of characters PORT until #<eof>.
Return a THUNK which can be called repeatedly like this:

  THUNK --> SCHEMECODE

Each call will yield a new piece of scheme code.  The THUNK signals
end of translation by returning the value *end-of-translation* (which
is tested using the predicate `end-of-translation?').

The optional argument ALIST provides compilation options for the
translator:

  (debug . #t) means produce code suitable for debugging

This procedure will be called by the system `load' command and by
the module system when loading files.

The intensions are:

1. To let the language module decide when and in how large chunks
   to do the processing.  It may choose to do all processing at
   the time translate-all is called, all processing when THUNK is
   called the first time, or small pieces of processing each time
   THUNK is called, or any conceivable combination.

2. To let the language module decide in how large chunks to output
   the resulting Scheme code in order not to overload memory.

3. To enable the language module to use temporary files, and
   whole-module analysis and optimization techniques.

*** untranslate SCHEMECODE --> EXPRESSION

Attempt to do the inverse of `translate'.  An approximation is OK.  It
is also OK to return #f.  This procedure will be called from the
debugger, when generating error messages, backtraces etc.

The debugger uses the local evaluation environment to determine from
which module an expression come.  This is how the debugger can know
which `untranslate' procedure to call for a given expression.

(This is used currently to decide whether which backtrace frames to
display.  System modules use the option :no-backtrace to prevent
displaying of Guile's internals to the user.)

Note that `untranslate' can use source-properties set by `native-read'
to give hints about how to do the reverse translation.  Such hints
could for example be the filename, and line and column numbers for the
source expression, or an actual copy of the source expression.

** How Guile system procedures `read', `eval', `write' use language modules

*** read

The idea is that the `read' exported from the R5RS library will
continue work when called from other languages, and will keep its
semantics.

A call to `read' simply means "read in an expression from PORT using
the syntax associated with that port".

Each module carries information about its language.

When an input port is created for a module to be read or during
interaction with a given module, this information is copied to the
port object.

read uses this information to call `native-read' in the correct
language module.

*** eval

[To be written.]

*** write

[To be written.]

* Error handling

** Errors during translation

Errors during translation are generated as usual by calling scm-error
(from Scheme) or scm_misc_error etc (from C).  The effect of
throwing errors from within `translate-all' is the same as when they
are generated within a call to the THUNK returned from
`translate-all'.

scm-error takes a fifth argument.  This is a property list (alist)
which you can use to pass extra information to the error reporting
machinery.

Currently, the following properties are supported:

  filename  filename of file being translated
  line	    line number of errring expression
  column    column number

** Run-time errors (errors in SCHEMECODE)

This section pertains to what happens when a run-time error occurs
during evaluation of the translated code.

In order to get "foreign code" in error messages, make sure that
`untranslate' yields good output.  Note the possibility of maintaining
a table (preferably using weak references) mapping SCHEMECODE to
EXPRESSION.

Note the availability of source-properties for attaching filename,
line and column number, and other, information, such as EXPRESSION, to
SCHEMECODE.  If filename, line, and, column properties are defined,
they will be automatically used by the error reporting machinery.

* Proposed changes to Guile

** Implement the above proposal.

** Add new field `reader' and `translator' to all module objects

Make sure they are initialized when a language is specified.

** Use `untranslate' during error handling.

** Implement the use of arg 5 to scm-error

(specified in "Errors during translation")

** Implement a generic lexical analyzer with interface similar to read/rp

Mikael is working on this.  (It might take a few days, since he is
busy with his studies right now.)

** Remove scm:eval-transformer

This is replaced by new fields in each module object (environment).

`eval' will instead directly the `transformer' field in the module
passed as second arg.

Internal evaluation will, similarly, use the transformer of the module
representing the top-level of the local environment.

Note that this level of transformation is something independent of
language translation.  *This* is a hook for adding Scheme macro
packages and belong to the core language.

We also need to check the new `translator' field, potentially using
it.

** Package local environments as smobs

so that environment list structures can't leak out on the Scheme
level.  (This has already been done in SCM.)

** Introduce new fields in input ports

These carries state information such as

*** which keyword syntax to support

*** whether to be case sensitive or not

*** which lexical grammar to use

*** whether the port is used in an interactive session or not

There will be a new Guile primitive `interactive-port?' testing for this.

** Move configuration of keyword syntax and case sensitivity to the read-state

Add new fields to the module objects for these values, so that the
read-state can be initialized from them.

  *fixme* When? Why? How?

Probably as soon as the language has been determined during file loading.

Need to figure out how to set these values.


Local Variables:
mode: outline
End: