corpus_policy

MEETING TO DECIDE A USER ACCESS POLICY FOR LINGUSTIC ANALYSIS OR OUR CORPORA

Agenda

  • target user groups
  • access models
  • policy for each user group, if different

target user groups

  • Users that don't need a unix shell
    • linguists doing research on singleton examples
    • historians and other people interested in content, not in form
  • Users that do need a unix shell
    • linguists doing research on texts as a whole
    • linguists with separate analysis tools
    • language technology developers

We have both free texts and texts which are restricted. The restricted texts must be protected by means of usernames and passwords, and require a contract.

Users that don't need unix shell

The Oslo interface is good enough for this user category, or it will require only small modifications, e.g.. links to documents containing the hits, preferably with the hits highlighted.

Users that need a unix shell

Typically, these users are linguists or language technology developers coming with their own tools, e.g. another disambiguator, a separate morphological analyser, or in general will need command line access to the whole corpus to achieve what they want. Also other scholars may belong to this group. These users will need access to our corpus machine(s), and will invariably be required to accept our user contract.

Shell access

Users that want shell access to our corpus have to be members of the group bound. These user will have shell access to both the free and bound corpus on our machine. Users which are not members of this group will have no shell access to any of our corpus files.

There will be two groups with access to /usr/local/share/corp, with the following access rights:

Group Description Intended users
bound Access to read the bound corpus External linguists
corpus Access to alter our orig. catalogues Project workers (group as today)

External users will get their own user account, belonging to the groups myself and bound, and will be able to install their own tools and programs for corpus processing, analysis, etc. External users will not get access to the orig/ directory.

To let the bound group members be able to analyse, we need to do some minor adjustments - as other they automatically have full access to the Xerox tools, and the compiled fst's are available in /opt/smi/sme/bin/sme-num.fst etc. The Xerox tools and vislcg are available in /opt/Xerox/bin. A couple of tools are missing right now, and need to be added to /opt/ by a crontab.

TODO:

  • make a group bound for our external corpus users, which:
    • gives access to read our bound texts
    • gives access to execute/run the tools in /opt
  • export to /opt (with cron) tools that the project team members do find in their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in /opt in order to conduct linguistic analyses).
    • ccat (and some perl scripts?)
    • other tools?
  • make shell script wrappers for the most common commands
  • write documentation for our bound users, with pointers to the ordinary documentation.
  • write user contract
  • write documentation for how to apply for a user account (where's the form, to whom do I send the form, who needs it, etc.)
  • make our own guidelines for the user application processing

Web browser access

Users of only the free corpus won't need anything but a browser.

Users of the bound corpus will need a username and password to the Oslo computer (until the base is moved to Tromsø). These usernames and passwords will be created and administered by the Oslo people, later by ourselves.

TODO:

  • discuss with Oslo
  • delay other tasks until we are ready to go public?
  • user management for access to bound texts

09 for each user group

Future policy for non-shell users

Divide our texts in two parts, also for the graphical interface:

  • The free texts will be available without a password, and will require no contract
  • The bound texts will be available, graphically, with a password, and bound by our contract.
  • All interested parties may download our cvs tree, and our open texts (the latter is not automatically updated today)
  • No one may download our bound texts

Future policy for shell users

  • All texts will be available only with a username/password, and bound by our contract.
  • shell access is provided for gtlab and other linux boxes, and possibly our XServe.
  • They will have read-only access to the corpus files, and access to our tools in /opt/

Future splitting of the cvs group

Altering the CVS group may be a topic for future discussion:

Today, the cvs group has access to alter and read our linguistic source code. In the future, we may split this access into alter OR read, and make it more fine-grained, according to subtree (gt, kt, st, xtdoc), or even according to language.