This directory contains the files relevant to the smaoahpa application.
- src: the source files with the lexicon smaX (i.e., smanob, smaswe, smaeng)
- Xsma: the reverted files from smaX to Xsma
Caveat: the reverted files are already frozen, i.e.,
they are ready for extension for synonyms and the like.
The exception thereof is the swesma dir because at the moment it is not
worth to revert them, there a too few real translations for swe in the smaswe files.
In the following is the summary of the CLT meeting:
1. Topic: nobsme handling of MWEs and stat="pref"
1.1 extract all sma-MWEs into a separate file;
- done
1.2 add ID as in smenob for possible entries that would be doubled in the sense of lemma string and pos string;
Ex.: entry for "ungen"
- done
1.3 delete possible entries that got stat="pref" only based on MWEs entires;
- done
1.4 according to the latest specifications by Lene, don't merge nob entries with stat="pref":
1.4.1 add the disprefered sma-translation to each created
entry with stat="pref"
-done
1.4.2 for entries with the same nob lemma, add prefered sma-translations to each other as
acceptable answers
-done
@cip: From my point of view is now the merging process of the reverted
nobsma data finished.
Ex. 1 (only one entry with this lemma in the whole file)
<== 1. every entry has an ID 2. stat=pref flag from the smanob-data in the nob-entry
rovdyrfritt
<== structure simplification: no apps/apps, just sources
<== sematics element on the mg-level, NOT on the tg-level anymore
<== 'tg' means 'target language group' and is flagged with the language flag
aales <== because of the new meaning of 'tg' there is no need for lang-flag on the t-level; default flag for stat="pref" that can be changed manually as needed
Ex. 2 (several entries with the samme lemma string): Et godt eksempel for det er "dårlig"!
1. a number in initial position of the ID means that there are more
than one entry with the same lemma string in the file
2. in addition to the t-element from the reverted entry, there are
t-values of the parallel entries with the same nob lemma string
as acceptable answers for the LEKSA play, each of them carries
the infos on semantic class, book, and a flag nob-stat meaning "I
am a default t in a parallel nob entry"
mådtan
as well as all t-element values that don't have a stat="pref"
flag in the smanob files, i.e., only sem-cl and book infos.
nåekies
dårlig
geerve
madtan
mådtan
nåekies
dårlig
madtan
geerve
mådtan
nåekies
dårlig
nåekies
geerve
madtan
mådtan
============
VERY IMPORTANT:
============
Due to the changed format, following places have to be adapted
accordingly:
1. for the work with XMLmind: dtd and css file
2. for the db feeding: Ryan's Pythons scripts
=============================================
Observations when feature merging:
O-1: books that come from different types of entries (pref
vs. non-pref) have to be marked as such
O-2: sma translations that stem from different types of entries (pref
vs. non-pref) have to be marked as such wrt. semantics
because these features will be merged
Test FØR unifisering av mg in nobsma:
data_sma>grep -h '
307
77
19
5
2
1
b. Difficult automatic processing (the rest):
b.1:
- When to unifiy two or more mgs stemming from a nob-translation with
stat="pref"?
- When total overlapping of sem-classes or even for partial
overlapping?
- What about if their sem-classes are totally different? Shall they
get separate entries with different IDs as with the sme-oahpa data
or not?
186
19
4
1
b.2: same questions as in b.2 but in addition is also the question of
which mg from the prefered ones shall get which translation
from the disprefered ones?
54
21
10
6
6
2
2
2
1
1
1
1
1
Another question is about the interplay between the scope of semantic
classes and that of the books after reverting the smanob to nobsma.
New statistics after cleaning up the only morfa-relevant entries
marked in the semantic class with an initial "m":
data_sma>grep -h '
292
186
77
53
19
19
19
10
6
5
5
4
2
2
2
2
1
1
1
1
1
1
1
2. Topic: level simplification in the dictionary from 3 to 2 levels in the meaning groups
2.1 structurally there are still three levels:
- mg: meaning groups
- tg: target language group
THIS is the difference, this group denotes NOT a slight difference
in translation wrt. some meaning shadows but it only groups
transaltions similar translation based on targe language.
Ex. from the original Cip's dream files:
láibi
brød
fladbrød
leipä
bread
vs. not grouped based on target language
láibi
brød
fladbrød
leipä
bread
The CLT-group voted unanimously FOR the Cip's dream solution!
Here a small note wrt. this solution: all sme mgs in the smaX files will be now
part of the mgs containing nob and swe, which is in the very spirit of
Cip's dream.
2.2 tasks:
2.2.1 unify meaning groups that have been separated ONLY
because of sme-language: this can be done ONLY if there
is a parallelity of sme- vs. non-sme-mgs
2.2.2 split (old) tg into different groups if the
semantics are different: this is possible ONLY if
there are semantic groups with ANY tg in the mg
2.2.3 group (old) tg to the same meaning group if the
semantics are he: this is possible ONLY if
there are semantic groups with ANY tg in the mg
(see the pre-tests below)
================
Starting testing for level simplification (it is not that simple):
- excluding file: names.xml propPl_smanob.xml
Test 1: checking the content of each e-element:
Question: How many mg-elements are there? Should be unified
(because of lang feature sme) or let as they are (because thery
represent genuinely different meanings)?
sma>grep -h '
793
181
30
18
7
5
4
2
1
1
As agreed upon with Lene, we ignore the sme-mg for this task.
Test 2: checking the content of each mg:
sma>grep -h '
202
90
32
6
1
The data seems to be ready for a automatic restructuring.