Infra Upgrade And Github
Contents:
The present GiellaLT infrastructure has been serving us well in terms of
Another aspect is that the present version control system, with a self-hosted
This document is a first overview of things to work on, and list possible
Directory structure
The present structure is way too deep and hard to navigate. It also hard-codes
- merge src/morphology/ and src/phonology to a new dir fst/, and
reorganise the whole src/ dir to reflect technology rather than linguistic groups - the linguistic splitup really never was completely consistent and logical; - flatten the tools/ directory tree, so that it has only (or mostly) one
subdir level - move all test dirs to being subdirs of the actual source code they are testing
e.g. test/tools/spellcheckers/ should be moved to tools/spellcheckers/test/
Isolate dependencies and common build elements
Each language source code has a number of dependencies, both on other parts of
Specifically, the following changes are ideas to follow up:
- remove the am-shared/ dir, and instead havd a common/shared am-shared/
dir in giella-shared/ (or core?) and sym-link to it or check it out as a submodule or similar within a .deps/ directory, done by the autogen.sh script; a requirement of the solution is that it must still be compatible with the various build systems we need to support - do the same for the various giella-shared/ dirs
- do the same for giella-core/
The end result of the points above should be that it should be enough to check
These changes should also remove the need to merge changes in am-shared/,
Github move concerns
When we move to Github, and thus git, there are a number of maintenance
Benefits of the move:
- vastly increased code visibility and accessibility
- easier cooperation and contribution
- automatized user management/account creation
- easy CI and CD
To avoid the cost of a huge and slow, single repo for all languages, each
Procedure for moving
See https://github.com/subethaedit/UniversalDetector. Follow that receipt for
git remote add origin <git@github.com:my-user/new-repo.git> git push origin -u master
After that we should have a set of language repos in GitHub, with the full
Update data across all repositories
One of the reasons we have been able to scale well in terms of languages (with
One promising tool to handle such chores is
GitHub multiple repo admin
By moving to GitHub, with one language = one repo, there is also a need to
- perform actions on repos whose name are matching a given regex (e.g. all repos
matching keyboard-*, or all repos mathcing *-sm?, etc) - rename all/multiple repos
- set default branch for all/multiple repos
- add all/multiple repos to a team
- set access restrictions for all/multiple repos
- add/update/remove git submodules, including revision for said submodule
- get list of all repos matching a certain pattern, potentially with repo URL
or documentation site URL, to include in overview documentation or reports
There will certainly be more, but this at least gives a first impression of the
Private repos
We have a couple of languages which are closed-source and private for various
These are concerns we need to discuss thoroughly before we decide to switch to
After further investigation, it looks like UiT could apply for the
Teams and nested teams
We should consider whether
- Tromsø people only
- all committers
- Edmonton people only
- project people only
- etc
The only trouble is: one can still not force an email upon everyone -
Subversion compatibility
GitHub allows checkout via svn, allowing people most familiar with svn to
We also need to consider how to best ensure that people are using the correct
A third concern is that we need to ensure that the main development is done
Multiple languages in one go
The present infra groups languages in different subdirs:
-
langs/ - the default/production languages
-
startup-langs/ - as the name implies
-
experiment-langs/ - as the name implies
- private*-langs/ - closed-source or otherwise non-public languages
In various settings it would be beneficial to be able to continue to define sets
Possible use cases for such groupings:
- defining a group of production languages = all within should be pushed to
a páhkat release repo - language groups for projects or teams
- defining something as startup or experimental is a clear sign to
outsiders that this is not yet ready for consumption - the core UiT people would like to check out all languages, and possibly build
them all locally
There might be better ways of achieving these goals in the Git(Hub) universe,
Also, the grouping should be bidirectional, such that a given language repo
Name collisions
Because languages are grouped in several subdirs, we have a couple of cases
In both cases one of the two descriptions is in experiment-langs/, so one
This is also a reminder that we need to update the script for setting up a new
- it must support the new git repo structure
- it should accept a full BCP-47 tag, not only ISO 639-3 language codes
The second point could be fixed while we are still in svn, but the first one
Bugzilla issues
Open (or all?) issues for various languages in the

