<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://www.w3.org/2005/Atom">
<title>ART Project</title>
<link href="http://hdl.handle.net/2160/1973" rel="alternate"/>
<subtitle/>
<id>http://hdl.handle.net/2160/1973</id>
<updated>2013-05-18T16:56:47Z</updated>
<dc:date>2013-05-18T16:56:47Z</dc:date>
<entry>
<title>The ART Corpus</title>
<link href="http://hdl.handle.net/2160/1979" rel="alternate"/>
<author>
<name>Liakata, Maria</name>
</author>
<author>
<name>Soldatova, Larisa</name>
</author>
<id>http://hdl.handle.net/2160/1979</id>
<updated>2009-04-30T15:30:25Z</updated>
<published>2009-04-29T10:25:23Z</published>
<summary type="text">The ART Corpus
Liakata, Maria; Soldatova, Larisa
The ART corpus consist of 225 papers manually annotated the CISP labels (i.e. "Goal", "Method", "Result"). The ART Corpus  is &gt;1 million words, 35,040 sentences. These papers cover topics in physical chemistry and biochemistry and were provided by the Royal Society of Chemistry (RSC) Publishing.&#13;
&#13;
The Corpus was developed primarily to add value to scientific papers, through semantic markup that would make it easier for natural language processing and semantic web applications to automatically extract information pertaining to core scientific concepts. The ART corpus can also be used as a training set for machine learning algorithms, in order to automate the annotation of papers with CISP meta-data. &#13;
The corpus is available as a collection of 225 .xml files, where each file corresponds to a separate paper whose sentences have been annotated individually with core scientific concepts.
Within the JISC funded ART project (University of Wales, Aberystwyth http://www.aber.ac.uk/compsci/Research/bio/art/) we developed a tool (SAPIENT) to allow the annotation of scientific papers with core scientific concepts (e.g. 'Goal', 'Hypothesis', 'Experiment', 'Method', 'Result', 'Conclusion', 'Motivation', 'Observation'). These concepts constitute the CISP meta-data and were verified through an on-line survey addressed to researchers. The CISP meta-data were accompanied by a set of guidelines for their implementation as an annotation scheme. We worked with chemistry experts, who used the guidelines and SAPIENT to create a corpus of 225 papers manually annotated with CISP concepts.&#13;
The sustainability of and the benefits obtained from annotating papers with CISP meta-data will be investigated by the JISC funded SAPIENT Automation (SAPIENTA) project.&#13;
Source Data:  The source data consists of text in XML format, encoded in unicode (utf-8 character set). The XML schema used is a variant of SciXML, which can be provided upon request. &#13;
The differences between the ART Corpus XML and SciXML consist in the following: &#13;
* An &lt;s&gt; element has been added at the same level as the &lt;S&gt;, &lt;EQN&gt; and &lt;EXAMPLE&gt;  elements. The latter elements can occur within a &lt;P&gt; element according to the SciXML schema. This &lt;s&gt; tag covers all kinds of sentences. That is, there is no distinction between sentences in the abstract (denoted as &lt;A-S&gt; in SciXML) and sentences in the main paper (denoted as &lt;S&gt; in SciXML) or sentences within equations (&lt;EQ-S&gt;) and examples (&lt;EX-S&gt;). &#13;
* The &lt;s&gt; element has an id (sid) and can include an &lt;annotationART&gt; element. &#13;
* The &lt;annotationART&gt; element has the attributes "type", "conceptID", "novelty" and "advantage". For more details please refer to the annotation guidelines.&#13;
&#13;
Annotation:  The goal of the annotation was to mark-up core scientific concepts in research papers. Papers from the domains of chemistry and biochemistry were chosen as a proof of principle approach. Annotation was performed by 20 chemistry experts, at PhD or postdoctorate level with excellent knowledge of English. The annotators selected were given an annotation package consisting of a set of guidelines[3] for annotating papers with CISP, the SAPIENT system[4] and its manual, as well as an example paper which had already been annotated. Most of this material is available for download from: http://www.aber.ac.uk/compsci/Research/bio/art/sapient. &#13;
The annotation guidelines are available upon request. &#13;
&#13;
Work with annotators was conducted in three phases over a period of six months.&#13;
	In phase I (training phase) all 20 annotators were sent the same four papers to annotate using SAPIENT and the annotation guidelines, in order to familiarise themselves with the process. Individual annotators' results were analysed meticulously at this stage and were used to improve the guidelines. &#13;
For Stage II, (evaluation phase) the aim was to evaluate both the annotators  and the guidelines. A preliminary evaluation of the experts' agreement was conducted based on a sample of 41 papers (5,000 sentences) which were annotated by 16 experts, divided in non-overlapping groups of 3 experts. The results show significant agreement between annotators, given the difficulty of the task (an average kappa co-efficient of 0.55 per group). &#13;
The 9 experts from phase II who had the highest average inter-annotator agreement were selected for phase III. The latter constitutes the actual creation of the ART Corpus, through the annotation of 225 papers.&#13;
&#13;
Distribution: The ART corpus is available as a 2.2 MB tar.gz file which expands to 12 MB. It consists of 225 papers (&gt; 1 million words, 35,040 sentences). The corpus is available as a collection of 225 .xml files, where each file corresponds to a separate paper whose sentences have been annotated individually with core scientific concepts. The papers have been arranged into 9 folders, corresponding to each of the 9 annotators.  These papers can be processed individually, per folder or as a batch by any script for handling XML. &#13;
One can display papers individually by using the SAPIENT software[4], which was used for creating the original annotations. For instructions on how to use SAPIENT to display the software please refer to SAPIENT_FAQ.txt (both can be downloaded from: http://www.aber.ac.uk/compsci/Research/bio/art/sapient.)&#13;
&#13;
For any requests/details regarding the corpus please contact Dr Maria Liakata (mal@aber.ac.uk).; To unpack the ART Corpus; For linux + mac users:; * Download the ART_Corpus.tar.gz file and save it to a local folder (e.g. /Users/myhome/art_corpus)&#13;
* Open a terminal window and navigate to the art_corpus directory (by typing "cd /Users/myhome/art_corpus")&#13;
* Type the command: gunzip ART_Corpus.tar.gz&#13;
* And then: tar -xvf ART_Corpus.tar&#13;
The last two commands can also be replaces by:&#13;
* tar -xvfz ART_Corpus.tar.gz&#13;
This should create the ART_Corpus folder and its subfolders; For windows users:; * Download the ART_Corpus.tar.gz file and save it to a local folder (e.g. C:art_corpus)&#13;
* Download the open source 7-zip utility available from http://www.7-zip.org/&#13;
* To use 7-zip to unpack the corpus, follow the simple instructions at:&#13;
http://www.simplehelp.net/2007/06/22/how-to-open-rar-arj-gz-tar-and-rpm-files-in-windows/
</summary>
<dc:date>2009-04-29T10:25:23Z</dc:date>
</entry>
<entry>
<title>ART - an ontology based article preparation tool</title>
<link href="http://hdl.handle.net/2160/1976" rel="alternate"/>
<author>
<name>Soldatova, Larisa</name>
</author>
<id>http://hdl.handle.net/2160/1976</id>
<updated>2009-04-28T01:00:15Z</updated>
<published>2007-01-01T00:00:00Z</published>
<summary type="text">ART - an ontology based article preparation tool
Soldatova, Larisa
Ontologies have been proven as a solid theoretical foundation for the development of information systems. They provide consistency and comprehensibility of underlying logic and building blocks. We will use EXPO as a core ontology for the developing an ART tool. EXPO is generic domain independent ontology and will ensure applicability of the ART system to various domains. We will also reuse appropriate for scientific mark-up ontological classes from OBI (Ontology for Biomedical Investigations) [http://obi.sourceforge.net/] and FuGE (Functional Genomics Experiment) [http://fuge.sourceforge.net/index.php] projects. However existing ontologies still do not cover well enough the area of semantic representation of scientific articles. The most important missing part is a representation of theoretical methods. We will need to formalize a description of theories in order to semantically represent papers containing theoretical sections.                                 The aim of the ART project is to develop a tool ART based on a generic ontology of experiments EXPO for semantic representation of scientific articles. The main objectives of the project are:&#13;
 Developing an ontology based format for semantic representation of scientific papers;&#13;
 Translating scientific papers into the proposed format with an explicit semantics.&#13;
 Explicit linking of repository papers to data (where available) and metadata.&#13;
 Creation of an example intelligent digital repository.&#13;
Semantically rich and theoretically sound ontological representation will provide metadata to annotate papers stored in digital repositories. An ontology based tool ART will be developed to automate the process of translating the papers into the proposed semantic format.
Partner Institutions:- University of Bath, UCOLN. University College London, Department of Chemistry.                            Royal Society of Chemistry, RSC Publishing                                                                        &#13;
The expected outputs of the project are as follows:&#13;
1. ART - a tool for preparing scientific articles in enriched semantic format.&#13;
2. An example digital repository of articles in enriched semantic format.&#13;
3. A user manual for the ART tool.&#13;
4. A guideline for semantic mark-up of papers.&#13;
5. Report on the minimum information required for representing papers.&#13;
6. A paper and a presentation at a workshop level describing the ART project goals.&#13;
7. A journal paper (BMC Bioinformatics or higher) describing the ART project results and applications.
</summary>
<dc:date>2007-01-01T00:00:00Z</dc:date>
</entry>
</feed>
