28 April 2017

Written by Thibault Clérice [Twitter]

What’s up ?

We are happy to announce the release of CapiTainS Guidelines 2.0.0. This is the result of a 7-month-long run and a major change for myself as the main developer behind the CapiTainS suite. CapiTainS has been an organization pushed by both the Chair of Digital Humanities at the University of Leipzig and Tufts University through both Perseus and Perseids projects.

If you were wondering, texts following the CapiTainS guidelines currently represent the following numbers (as of April 28, 2017) with its main repositories:

Repository Texts Total (Words) Ancient Greek Latin English German French Unknown
https://github.com/OpenGreekAndLatin/First1KGreek 641 18.038.636 17.554.353   244.848 161.465    
https://github.com/OpenGreekAndLatin/csel-dev 278 6.055.446   6.055.446        
https://github.com/PerseusDL/canonical-latinLit 390 8.324.065   5.073.035 2.646.765     604.265
https://github.com/PerseusDL/canonical-greekLit 368 8.294.584 1.770.357 1912 4.062.712 459.555 1912  
https://github.com/nevenjovanovic/cts-croala 467 5.700.000   5.700.000        

Which sums up to a wonderful 46 million token corpus.

Changes and the Future of the organization

Changes for CapiTainS

The CapiTainS organization decided to go wider than just Citable Text Services and it led to a complete rework of the Python software suite that we developed. This decision was driven by the simple observation that CTS does not match everyone’s needs and probably never will. As such, we asked ourselves, “What’s the goal of CapiTainS?” The answer: the goal of CapiTainS is to offer to any one the ability to :

  • serve texts through API and User Interface without having to care about producing the API. This is why, even though you could build your own interface, Nautilus remains an excellent piece of software to serve your text according to good standards,
  • browse texts and collections from a Command Line Interface or local software, reuse text in “local” studies such as Named Entity tagging or corpus linguistic
  • help people to reuse texts produced in other contexts

The new CapiTainS is essentially upgraded to match this. Unfortunately, we did not have the time to improve the Javascript library Sparrow. So this blog post also gives us the opportunity to call for anyone willing to apply the same changes we did to MyCapytain to Sparrow. CapiTainS Sparrow is looking for a developer to help it evolve along with CapiTainS!

Future

CapiTainS will live at a smaller pace in the coming month: for one, I will be moving to a new position but not as an engineer. While I will take my role of maintainer seriously on MyCapytain, Nemo and Nautilus, there won’t be any major releases. This is both a good news and a bad news in the end: we get to have a longer release lifecycle. This also means that the future of CapiTainS as a provider of new software tiles is a bit blurry: the main software producer will most probably be Perseids, and certainly the DH Chair at the University of Leipzig will continue to do some work.

Fortunately, we wrote everything to be sustainable: you have documentation (we tried to write it well), you’ve got unit tests, you even have Puppet modules thanks to Bridget Almas and the Perseids Team! In order to help CapiTainS survive this time, we also decided to make HookTest more decentralized: you can now run HookTest with Travis on Github! There is no more need for a fancy 32-core computer to run these tests!

Then, let’s go down to business : the softwares and guidelines’ update

Guidelines 2.0.0

Thibault Clérice, Matthew Munson, & Bridget Almas. (2017). Capitains/Capitains.github.io: 2.0.0 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.570516

The guidelines have been updated to take into account two changes that we felt we needed in a web 3.0 and a philological world: we added the <structured-metadata> node to every collection-level node (textgroup, work, version). This node allows the user to express a triple whose subject is the collection described! Isn’t it wonderful ?

The second and third nodes are something we had discussed for a long time and which we desperately need for one of OpenGreekAndLatin repositories. So we decided to add: cts:commentary and its child cts:about. While this is not part of the CTS Scheme originally, this allows us to produce, in the context of a CTS URN, a commentary text which is linked to the text it talks about through the cts:about link.

You can checkout the examples in the guidelines .

There is a short future for CapiTainS Guidelines: we hope to release soon enough 2.1.0 guidelines that would provide support for non-cts based collection while following the same scheme. If we can, we’ll try to finish this and implement it in related tools before December 2017. But let’s not bet our life on it, except if you are willing to code it ;)

MyCapytain 2.0.0

Thibault Clérice, Matthew Munson, & Bridget Almas. (2017). Capitains/MyCapytain: 2.0.0 . Zenodo. http://doi.org/10.5281/zenodo.569838

MyCapytain has had a quite thorough rework where the API was cleaned and the Object Oriented Programming became more sensible by reworking dependency. But the most important changes were structural:

  • The whole system of metadata is stored in an RDF Triple Store (in memory or can be stored somewhere else, because it’s based on RDFLib)
  • Where CTS Text and Passage used to be the gods of the classes, they are now merely humans: we reworked the whole system so that more kinds of texts can be treated with the same tool
  • Every object that can be exported now uses the common export system self.export(output) where output should be an item from the MyCapytain.common.constants.Mimetype. This means that you should find really quickly how to export a text to Plain/Text or TEI XML!
  • We added a new kind of class : [Resolvers](http://mycapytain.readthedocs.io/en/latest/MyCapytain.classes.html#resolvers. Resolvers are meant as tool with a really simple api (getMetadata, getTextualNode, getSiblings, getReffs) that will always return the same kind of object. This means that, whether you use a remote CTS API, a local CapiTainS repository (or any other system implemented later in MyCapytain), your code should not change !

Flask Capitains Nemo 1.0.0 and Nautilus 1.0.0

Thibault Clérice, & Bridget Almas. (2017). Capitains/Nautilus: 1.0.0 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.569847

Thibault Clérice, Bridget Almas, & Matthew Munson. (2017). Capitains/flask-capitains-nemo: 1.0.0 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.569863

Both Nemo and Nautilus are mostly upgraded thanks to the MyCapytain upgrade. The app changes themselves are really small but some changes made on MyCapytain were specifically targeting the web use of resources. This means that Nemo and Nautilus are now much, much faster and more stable !

Flask CapiTainS Nemo was in 1.0.0 beta for a long time, and we did not release it before moving to CapiTainS 2.0.0. Flask CapiTainS Nemo 1.0.0 had the following changes :

  • Plugin development. Nemo can now be enhanced through plugins that will be loaded easily
  • End of the CTS-dependency and friendly routes: the new system already allows for different identifiers and a deep collections system. It’s just up to you to develop the resolver!
  • Better base templates (It was about time !)
  • Better (Much better!) Cache handling

Nautilus on its end has been simplified to serve two different purposes : providing Web API services for the MyCapytain resolvers (with an eye set on providing OAI-PMH and Sparql endpoints soon) and providing a caching system for live web services. We can say proudly that this has been achieved!

HookTest and HookUI

Thibault Clérice, Matthew Munson, & Bridget Almas. (2017). Capitains/HookTest: 1.0.0 . Zenodo. http://doi.org/10.5281/zenodo.569841

Thibault Clérice. (2017). Capitains/Hook: 1.0.0 . Zenodo. http://doi.org/10.5281/zenodo.555931

These are the pieces of software that have seen the biggest changes and might significantly change how we perceive corpora production. As of 1.0.0 for both of these, HookTest and HookUI can be installed with really small web instances (heck, even the new http://ci.perseids.org is now run on a free Amazon AWS!). The reason for it? We decentralized!

HookTest can now be run easily with Travis (and the output looks like https://travis-ci.org/PerseusDL/canonical-greekLit/builds/226562703). But most of all, HookTest can now help you build releases that contain only completely compatible texts. Do that, connect your repository on Zenodo and you get an automatic DOI for every version of your corpus! Isn’t it nice?

Hook

Oh, and even better, both these pieces of software are finally tested with a coverage > 90%. The testing suite is tested!

Now ?

Now we are going to take some vacation and try to explain to people why it would be great to collaborate on this suite. We truly believe CapiTainS and its Python suite could fit the need of a lot of projects and we hope that people see that. It might need some customization, and perhaps some better documentation, but we cannot do that without you asking questions (and as we say in French : we have never eaten anybody!). We also think we lack at least one piece of software, which would be a text search engine. We have ideas but not the time right now. Maybe you do?

We hope that this software will be helpful to you. This is a major milestone, maybe the biggest since the first tool released in here. See you around!

Thibault and hopefully all other past and present collaborators : Bridget Almas, Matthew Munson, Stella Dee, Frederik Baumgardt.