TCA2 can be run either from the jar file
or with a command line from a command/terminal window.

Running from a command/terminal window is useful
if you suspect there are bugs in the program.
Any java runtime error messages will show
in the command/terminal window.
(There might also be messages from the program itself,
usually harmless ones.)

There is a bat file supplied for Windows users.
Running the bat file will normally bring up the Command window.
The bat file is also useful if you need to know the command line.

- - - - - - - - -

Recent changes:

- - - - - - - - -

2006-10-04

(1)

In later versions, when running more than one alignment in a row
(in skip 1-1 or automatic modes), the program shows progress
by continuously updating the boxes containing
aligned elements and unaligned elements.
In the current version this feature has been removed.
Progress is only indicated in the status line bottom left,
as numbers telling how many elements have been aligned so far,
out of the total number of elements.
Not until the program is finished
(e.g, when comming across a non-1-1 alignment in skip 1-1 mode)
is the rest of the window updated.

The change makes the program faster and less memory-hungry.
Before the change the program could easily run out of memory
on long, continuous runs, i.e., when running automatically
with a limit in the order of hundreds or more.

To the program author it looks like memory (heap space)
is not garbage collected well enough.
Garbage collection is normally done automatically in java.

(2)

Even with the change described in (1) the program
can run out of memory. There are some remaining
memory leak issues that need to be investigated.
In the meantime some useful changes
have been made to the program.
Together with advice below the changes will help the user
avoid serious consequences,

The program now constantly monitors available memory
and halts gracefully with a warning.

At that point, fiddling with the interface might trigger
garbage collection. Try for instance the Unalign button
followed by the Align button, or an Up button followed by
the corresponding Down button.

While doing this watch the new memory usage indicator
in the status line bottom right. If the figure drops after
playing around a bit with the buttons, one can restart
the alignment process with the Suggest button.

It is also recommended to start up the program with more memory
than it gets by default.
E.g, the following command line will give the program
128 MB of memory at startup, with a maximum of 512 MB.
(The default maximum might be 64 MB):

  java -Xms128m -Xmx512m -jar alignment.jar

Which values are possible might depend on the size of
the memory of one's computer.

(3)

The test for empty alignable elements, introduced in the
previous version, has been improved, and should now catch
any empty element. (Except the program still stops at
the first occurrence of an empty element.
Further empty elements in the file are not found and reported.)

- - - - - - - - -

2006-09-19 - 2006-09-22 - 2006-09-25

(1)

Later versions of the program have been slow (slower than usual).
A bug caused some calculations to be re-done a lot of times
instead of earlier results being re-used.

These calculations were anchor word/proper names/etc calculations
on 1-2 and 2-1 alignments with badly matching lengths, i.e,
cases the program considers to be hopeless candidates for alignment.
So not only were the calculations done again and again,
they were also unnecessary.

Doing these calculations made the program able to show
how well elements match word-wise, even with badly matching lengths.
Some users doing manual alignment might want that,
so the calculations weren't _completely_ useless.

In the current version these calculations aren't done at all,
and all the "manual" user sees is a message saying
the lengths match very poorly - no details about anchor word match etc.

(2)

In the middle row of boxes the program shows suggested alignments.
The user may manually change the alignments by clicking the elements
shown in the boxes. E.g, if a suggested 1-1 alignment is shown
a click on one of the elements will change the suggestion to
a 1-0 alignment + a 0-1 alignment. After clicking the user will see
the colour of one of the elements has changed.

The program has always had this functionality, but there was always
the danger a user would click and change alignments by accident.
As a security measure the new version of the program has
an "are you sure" message.

(3)

In earlier versions the last line of an alignable element sometimes
would not show. This was because of errors in the way the height
(the number of lines) needed to display the text was calculated.
This error has been corrected.

If the error still occurs, try to click the problematic element.
If that makes the element display correctly what you've seen
is an instance of a new, but less serious problem,
with a component not always refreshing properly.
The program author will try to look into this new problem later.

If the element does not display properly even after being clicked
please report the error to the author. There might be one piece of
final fine-tuning to do to get the last line always to display.

(4)

The program cannot align elements that don't contain any words.
Previous versions would stall when trying to align such elements.
The current version will in most cases issue an error message.

(5)

An error occurring when skipping already aligned elements
has been corrected.

(6)

A blunder in the Dice "phrase" matching introduced in
the previous version of the program has been corrected.

(7)

The current version of the program has a Windows look and feel
because the author doesn't like the default java metal look
and open/save dialogs.

(8)

The current version
- allows the user to save settings to a file from the Settings dialog
- allows the user to load settings from a file from the Settings dialog
- always looks for a file called tca2.cfg in the folder the program is run from,
  and if present loads settings from that file
- can load settings from file by using a -cfg command line option, i.e,

  java -jar alignment.jar -cfg=settingsfilename
  
  or
  
  alignment.jar -cfg=settingsfilename
  
Setting files are (utf-8) text files, and can be prepared
and maintained manually, but it is recommended to do it
from the Settings dialog.

(9)

The current version will display an error message if a text
contains an empty alignable element, e.g, a sentence without any words.
In earlier versions the program would halt and malfunction.

(Note: The test for empty elements might not be properly implemented
to cover all cases.)

(10)

The default values for the weights of the match methods
have been changed - for no scientific reason - to
- anchor words       - 1.0
- anchor phrases     - 1.6
- proper names       - 1.3
- dice words         - 1.3
- dice phrases       - 1.6
- numbers            - 1.3
- scoring characters - 1.3

(11)

Help | Help still doesn't give any help about the program
but now at least reports the program version, making support easier.

- - - - - - - - -

2006-04-18

(1)

The "Clear all data" button now works.
Its purpose is to clear the program of the current texts
making room for a new pair of texts to align.

The button can also be used if a wrong text has been read in.

(2) Logging

Logging has been changed. In the latest versions logging was always on, 
and the same file was re-used for the log, with the old log 
being overwritten each time the program was started anew. 
In this new version logging can be turned on and off at any time 
with the "Start logging"/"Stop logging" button. 
At startup logging is off. 
The first time logging is turned on the user is presented 
with a sensible suggestion for a log file name, 
derived from the names of the input files, 
and she gets a chance to change the suggested name and folder.

(3) Dice "phrase" matching

Dice matching has been extended in an attempt to cover compound words 
that are written as separate words in the other language. 
Earlier the Dice method compared all the words in one text 
with all the words in the other text. 
Now the words in one text are also compared 
with _pairs_ of consequtive words in the other text. 
The comparison of two consequtive words A and B in one text 
with a word C (a possible compound) in the other text, 
is done as two separate comparisons: Each of A and B must match C. 
But while normal one-to-one Dice matching looks at 
the number of matching character pairs in relation to 
the lengths of _both_ the words compared, 
in Dice phrase matching only the length of A or B is considered, 
and not C, since A and B are expected to match a _part_ of C 
and not the _whole_ of C.

As in normal one-to-one Dice matching all words 
must have a certain minimum length (settable from the Settings dialog) 
to be considered at all, so A and B will never match 
when either A or B is shorter than that minimum length.

Note one source of error in the Dice phrase matching method: 
While looking for matching character pairs 
both A and B are compared with the whole of C, 
in some cases yielding false matches. 
A better implementation might try to find the best way of 
dividing up C in substrings X Y with A matching X and B matching Y, or vice versa.

(4) New word-based matching method: number

A new word-based matching method has been added, 
where the program looks for words that are numbers - 
more precisely integers written purely with digits 
(e.g, "1", "2006", but not "3.14", "10,000", "2MB", "two"). 
Negative numbers (e.g, "-2") will also be matched, 
and be considered to be different from their unsigned versions ("2"),
unless "-" is defined as a special character to be stripped from words, 
in which case the program will not "see" the sign 
(e.g, "-2" will be seen as "2").

(5) Weighted match methods

The different kinds of match, e.g, an anchor word match versus a Dice match, 
now have weights that can be set from the Settings dialog. 
The weights should be set with values that reflect 
the confidence one has in each match method. 
E.g, if proper name matches in general are assumed to 
be correct somewhat more often than anchor word matches, 
one could set their weights to e.g 1.3 and 1 respectively.

Earlier versions of the program had no weighting, 
i.e, the weights were fixed at 1.

If several methods agree that two words are related, 
and the methods have different weights, the higher weight wins.

As in earlier versions - after all the weighted methods 
(anchor word, Dice, proper name, number, and special characters) 
have had their say the total match score might be increased (by 1 or 2), 
or decreased (reduced to 1/3), according to how well the sentence lengths match. 
This reward or punishment cannot be weighted. 
Bear that in mind when setting the weights for the weightable methods. 
The weights should probably be sort of centered around the value 1, like,
to be compatible with the length adjustment. :-)

Note that there is a special weight for a Dice phrase match, 
so a Dice phrase match can be set to a different 
(presumably higher) weight than a normal one-to-one Dice match. 
Likewise for an anchor "phrase" match, 
i.e, a match where one or both anchor words is a phrase. 
(Remember - an anchor word list can contain phrases as well as single words.)

(6) Word-based match methods overhauled; better common score calculated

The word-based match methods (anchor word, Dice, proper name, number) 
have been implemented in a more consistent manner. 
Also their results are pooled before a common score is reached, 
while earlier their individual scores simply were 
added together to make the common score.

As earlier both the individual and common scores 
can be viewed in the bottom middle box in the interface, 
along with other match information.

The main idea behind scoring for word-based method are "clusters". 
A cluster is a set of words (or phrases) the method think are related,
along with the relations between the words. 
E.g, for the anchor word method, one or more anchor word entries 
might more or less agree that two words in one text 
are related to three words in the other text. 
Each such cluster gets a score, and the scores of the individual clusters 
are added together to make the total score for the current method.

Different clusters can have different scores. 
Again consider the anchor word method.

First each cluster gets a basic score, equal to one anchor word match weight. 
If the cluster contains not only single anchor _words_ 
but one or more anchor _phrases_, the basic score is instead set to 
one anchor _phrase_ match weight (assuming that phrases have 
higher weight than single words).

Next and last there is some extra score for cluster _size_. 
The cluster in the example is two by three words in size. 
The larger dimension (three) is disregarded, yielding a size of two. 
A size of two makes the cluster one word larger than a minimum cluster. 
That extra word increases the score, not by a full weight 
but a certain percentage of the weight. 
The percentage is settable from the Settings dialog.

The common score is derived by looking at _all_ the word-based 
methods simultaneously. Then some of the clusters might merge, 
and the common score will be lower than the sum of the individual scores. 
To take a simple example: Several methods might agree that 
one particular word in one text is related to 
one particular word in the other text, 
and the two words not related to any other words. 
Each method regards the the two words as a one-by-one cluster. 
When looking at all the methods at once the clusters 
merge into one single one-by-one cluster. 
The methods might disagree about the weight assigned to the relation, 
but the method with the highest weight wins.

In earlier versions the common score for the word-based methods 
was calculated by adding the scores of the individual methods. 
E.g, the Dice and proper name methods would agree that 
"Bergen" and "Bergen" are the same word, assigning a score of 1 + 1 = 2. 
When comparing "Oslo" and "Oslo", however, the score would be only 1, 
because "Oslo" fell below the minimum length required by the Dice method 
(unless set to a lower value by the user).

(7) The order of anchor word matches listed

As earlier match information is shown in the bottom middle box 
in the interface, but the order of anchor word matches is changed. 
In earlier versions the matches were listed with one line 
per anchor word entry, in the same order as the entries 
occurred in the anchor word file. Now matches are listed in _cluster_ order, 
which sometimes breaks up the old order, 
namely when several anchor word entries agree on the same words. 
The clusters as such are not shown, however (but perhaps they should).

- - - - - - - - -

2006-03-31

(1) 'Newline' format - ancestor information

Some users might want more than just alignable elements
in their 'newline' format output files,
i.e, some information about the elements' "ancestry",
e.g, which paragraphs and divisions
the aligned sentences belong to.

The output from early versions of the program
contained alignable elements and nothing more.

With some later versions of the program
the output also contained information about all the elements'
ancestor elements, with the start tags
for all ancestors prepended to the element.
(This change to the 'newline' output format
might not have been communicated to all users, though.)

The current version of the program
offers the user various options:
- No ancestor information (default)
- All ancestor information
- Deny certain elements and attributes
- Allow only certain elements and attributes.

(2) New output format - "external"

A new output format - "external" - is introduced.
The output consist of _one_ file in utf-8 xml format,
containing no text and alignable elements,
only references or pointers into the input files.
Each alignment is represented as a <link> element
with a certain attribute listing
the id's of the aligned elements.

Some details about this format are not settled yet.

The new format's output file is saved
after the 'corresp' format files
and before the 'newline' format files.

(3) Output file names and extensions

When the user saves the alignment result
the program suggests names and extensions
for the output files.
In earlier versions of the program
the user could change settings that influenced
the suggested names and extensions.
This was never properly implemented,
e.g, resulting in 'newline' format files
being suggested with an xml extension.
In the current program version the suggestions
are more or less hardcoded, with a more suitable
'txt' extension for the 'newline' files.

(4) Bugs in manual alignment operations corrected

In the interface there is a "less" button
(= button with 'arrow down' symbol) for each text.
With these buttons the user can manually drop elements
from the suggested alignment(s),
returning the elements to the 'unaligned' area
at the bottom of the interface.

The programming for these buttons had two flaws
that have been corrected - flaws that might cause
internal confusion and loss of elements.
The first flaw could show itself when the button
was pressed one time too many,
i.e, after there were no more elements to drop.
The second one might occur when the sole element
of a 1-0 or 0-1 alignment was dropped.

- - - - - - - - -

2006-03-24

(1)

The program has been extended with an "Automatic" mode,
and the limit (see below) applies to the new "Automatic" mode
as well as the "Skip 1-1 mode".

Here is a full explanation of the program's modes:
+-----------------+-------------------------------------------+
|  Mode           | What happens when the user                |
|                 | presses the Suggest button                |
+-----------------+-------------------------------------------+
| "One at a time" | The program suggests one alignment        |
|                 | and waits for user feedback               |
+-----------------+-------------------------------------------+
| "Skip 1-1"      | The program does alignments automatically |
|                 | and doesn't wait for user feedback        |
|                 | until a non-1-1 alignment is reached,     |
|                 | or until the limit (see below) is reached |
+-----------------+-------------------------------------------+
| "Automatic"     | The program does alignments automatically |
|                 | and doesn't stop for user feedback        |
|                 | unless the limit (see below) is reached   |
+-----------------+-------------------------------------------+
The user can set a maximum limit for the number of automatic alignments
the program should perform before waiting for user feedback.
The limit is relevant for the "Skip 1-1" and "Automatic" modes.

(2)

While the program runs automatically ("Skip 1-1" and "Automatic" modes)
the user can see the aligned elements disappear from the "unaligned"
boxes at the bottom of the interface
and appear in the "aligned" boxes at the top.

(3)

In the interface alignable elements show in grey or coloured cells.
The text of the elements might wrap, and the cells should have heights
accommodating an integer number of lines.
A bug that sometimes caused cells to come out with a fractional height
has been corrected.

(Regarding the cells' height there is a flaw that has _not_ been corrected.
The calculation of the cells' height might miss,
causing cells to have one line too many, or in rare circumstances,
one line too few, hiding the final part of the element's text.)

(4)

There is a new "Clear all data" button,
but its functionality has not been implemented.
Its purpose is to clear the program of the current texts
making room for a new pair of texts to align.

- - - - - - - - -

2006-02-23

The program accepts input files with any encoding, e.g, UTF-8,
and produces output files with the same encoding as the input files.
The encoding must of course be specified in the XML header of the input files.

The anchor file must be UTF-8. Other encodings will not work.
(The anchor file is still a text file, not an XML file, and contains no heading,
so there is no easy and safe way for the program to decide which encoding is used.)

- - - - - - - - -

2006-02-24

The program produces a log containing information about each alignment,
consisting of the aligned elements, and information about how well the elements match,
the latter taken from the middle bottom box of the interface.

For each element in the 'newline' format output files
a chain of parent element start tags are shown as well.

- - - - - - - - -

2006-02-28

An error in the settings dialog corrected.
The error concerned the parameters
"Relevant elements" and "Relevant ancestors of relevant elements"
(top boxes of the settings dialog).
If the user deleted tags the error caused the tags to reappear.