NIH MEETING SUMMARY - JUNE 2013

RESOURCE IDENTIFICATION AND TRACKING IN THE NEUROSCIENCE LITERATURE

EXCECUTIVE SUMMARY AND ACTION ITEMS

A meeting to discuss the challenge of identifying key research resources used in the course of scientific studies published in the neuroscience literature was held at the National Institute on Drug Abuse (NIDA) in Washington DC on June 26, 2013.  The meeting was organized by the Neuroscience Information Framework (NIF;  http://neuinfo.org) and the International Neuroinformatics Coordinating Facility (INCF:  http://incf.org) with support from NIDA.  Attendees were drawn from different stakeholders, including government representatives, publishers, journal editors, informaticians, curators and commercial resource suppliers.

At the end of the session, almost all attendees indicated their interest in a pilot project to identify antibodies, model organisms and tools in a machine processable form across neuroscience journals to improve reproducibility and tracking of resource utilization.  One goal of this project will be to gather data on the best implementation strategy to engage the authors in providing these identifiers and in establishing a scalable process for verifying that the correct identifiers are used.  Another goal will also be to provide a demonstration project to the research community that will show the benefits of machine processable information within papers by making it easier to find research resources.

ACTION ITEMS

1.  Perform pre-pilot project (2 months-Resource Identification Group:  NIF, NITRC, INCF, Monarch, Cross-Ref, antibodies-online, eagle-i and other interested parties):

  • Form the Resource Identification Group:  The RIG will develop and evaluate the specific technologies and implementation.  Ensuring that other groups who are working in this area are involved will be important for the success of the project.
  • Make sure that the appropriate identifiers are available for all model organisms
  • Establish a single website with an easy to use front end for obtaining identifiers
  • Prepare instructions for authors
  • Perform usability studies with naive users (~25)
  • Present re s u lt s to workshop consortium

2.  Discuss potential pilot project with publishers (meeting attendees) - 1 month 

  • Get initial commitments from publishers for proposed pilot project:  what journals, what resources
  • Discu ss potential implementation per journal

3.  Prepare detailed proposal for publishers (at completion of pre-pilot project) (Resource Identification Group)

  • Include a link to a demonstration site and the results of the usability study
  • Allow flexibility in implementation
  • Launch pilot project at SFN???

4.  Continue to improve the automated pipeline and authoring/curation tools (Resource Identification Group)

  • Contact Biocreative to see if they are interested in hosting a text mining challenge

5.  Seek sponsorship for implementation and promoting the project (all)

  • antibodies-online
  • Mozilla Foundation: Open Science and Science in the Web
  • Society for Neuroscience?
  • CrossRef?

MEETING OVERVIEW

A meeting to discuss the challenge of identifying key research resources used in the course of scientific studies published in the neuroscience literature was held at the National Institute on Drug Abuse (NIDA) in Washington DC on June 26, 2013.  The meeting was organized by the Neuroscience Information Framework (NIF;  http://neuinfo.org) and the International Neuroinformatics Coordinating Facility (INCF:  http://incf.org) with support from NIDA.  Attendees were drawn from different stakeholders, including government representatives, publishers, journal editors, informaticians, curators and commercial resource suppliers.  A list of attendees is included in the appendix. 


MOTIVATION

The goal of the meeting was to come to agreement about a course of action to improve the ability of both humans and automated agents to identify key research resources - defined here as materials, data and digital  tools - used in published studies.  Digital tools in this context refer to software programs, services, data sets or databases. The meeting was motivated by the experiences of the Neuroscience Information Framework, a project of the NIH Blueprint consortium tasked with cataloging these types of digital resources for neuroscience, and other informatics projects such as the model organism databases, who attempt to identify important research resources like the subject of a study or reagents used within a published paper. These projects routinely encounter three problems:

1) Insufficient identifying information is included in the paper such that the exact organism or antibody used is not identifiable, requiring either a loss of information or curator effort to interact with the author to track down the information.  In either case, this identifying information will not be included in the paper for humans to access.

2)  If the information is in the paper, it is not machine readable, that is, it cannot be parsed and recognized by a computer, either because the information is ambiguous or in a form, e.g., lots of special characters, that is difficult for a computer to handle. 

3)  If a machine readable identifier, e.g., an accession number or a stock number, is used within the paper, it is in a section of the paper, usually the materials and methods, that is behind a paywall, hampering text mining approaches for extracting this information.

The issue of research resource identification thus reflects three critical needs in biomedical science:

1)  The need for better reporting of materials and methods to promote reproducible science.  Proper resource identification is a step towards that goal.

2)  The need for a cultural shift in the way we write and structure papers, recognizing that we will interact with the literature through an automated agent, and so the conventions we adopt should be tailored towards greater machine-processability.

3)  The need for a cultural shift in the way we view the literature:  not only a source of papers prepared for people to read but a connected database of data, observations and claims in biomedicine that spans journals, publishers and formats.  To locate and synthesize information from the literature requires universal machine access to key entities within the paper.

Because the current practices for reporting research resources within the literature are inadequate, non-standardized or optimized for machine-based access, it is currently very difficult to answer a very basic question about published studies such as “What studies used resource X”?  These types of questions are of interest to the biomedical community, which relies on the published literature to identify appropriate reagents, troubleshoot experiments and aggregate information about a particular organism or reagent to form hypotheses about mechanism and function.  Such information is also critical to funders who would like to be able to track the impact of resource funding by generating reports on substantive usage of these resources within the biomedical literature.  Such information is also very useful for resource providers, both commercial and academic, so that they can track use of their resources.

Based on pilot projects and solutions from other communities, NIF, along with several database curators and informaticians proposed that:

1)   Key research resources be identified within papers using a unique and persistent identifier, i.e., an accession number for antibodies, animals and tools, that is machine readable.

2)  These resource identifiers need to be available outside of the paywall

3)  The same format for these identifiers should be used across publishers

There is precedent for these requirements already in that journals require accession numbers for certain types of entities:  gene sequences and protein structures. 

Who benefits?

As discussed above, if these practices were adopted, it would clearly benefit funders, who would be able to track the usage of research resources within literature to measure impact of funding.  These practices would also clearly benefit database curators, who would reduce the amount of time required for curating results from the literature.  Science as a whole benefits, because proper identification of the tools used to generate research findings is a cornerstone of reproducible science.  Resource providers clearly benefit, as it is easier to track who uses their reagents and tools.  The benefits to individual research scientists, which includes journal editors,  would need to be made clear for these policies to become adopted.  Perceived benefits to the research include:

  • It will be easier to design and troubleshoot experiments, because researchers would be able to find all studies that used a particular reagent, animal or tool
  • Researchers would be able to find studies that used a particular type of reagent, e.g., a mouse monoclonal antibody, even if that information was not explicitly included in the paper, because a complete characterization of the entity is present in an external database that can be accessed at time of query
  • It will be easier to aggregate and compare results across studies, using both human effort and data mining approaches
  • Problems found in a resource, e.g., specificity of an antibody, an error in a database or algorithm,  can be easily propagated across the literature, even retrospectively.  With proper tools, readers could be alerted to any potential problem thereby reducing time, effort and money wasted on problematic resources and incorrect conclusions based on the results of these studies .     

These benefits can only be realized if machine-processable resource identification is carried out on a large scale across journals, in order to create a rich enough data set for data mining and resource linking across papers to be interesting to the research scientist.


BACKGROUND:  SUMMARY OF FIRST MEETING HELD AT SOCIETY FOR NEUROSCIENCE, OCT 15TH, 2012

The June 26th meeting was a follow up to a meeting held at the Society for Neuroscience’s (SFN) annual meeting involving NIF, INCF and a group of neuroscience journal editors and publishers.  At this meeting, the above proposal was presented, along with the results of a NIF pilot project to identify antibodies, transgenic animals and digital tools used in neuroscience research.  The NIF pilot project used a combination of human curation and text mining.  The results for antibodies indicated that the antibody could be identified as a particular antibody sold or produced by an individual in less than 50% of the papers.  Reporting procedures required a lot of human intervention, as there were as many styles of reporting research resources as there were papers.  The major problems with resource identification are summarized here: 

Antibodies:

  • The author did not supply sufficient identifying information, e.g., catalog number, such that an antibody could be reliably found in a vendor catalog.  Rather, general information , e.g., mouse monoclonal antibody against actin from Sigma, was provided.  As many vendors sell multiple antibodies that fit these descriptions, we could not identify the reagent used.
  • The vendor no longer sold the antibody referenced or the vendor no longer existed, so information could not be discovered about its properties.  In many cases, the same manufacturer would sell their products through multiple vendors, with no ability to cross reference
  • The same antibody identifier, e.g., clone ID, could point to multiple antibodies
  • Methods were not referenced within the paper, but readers were referred to other papers, which then referred to other papers…

Transgenic animals:

  • Authors did not supply sufficient information to identify the exact transgenic animal used, e.g., stock number.  As with antibodies, a given reference to a transgenic from Jackson Labs could not be resolved to a particular transgenic line, but could point to more than one.
  • The notation adopted by the IMSR does not lend itself for use by automated agents or search systems.  It employs superscripts, subscripts and special characters.

Digital tools:

  • NIF’s semi-automated pipeline did fairly well at recognizing research resources listed in the NIF catalog within papers, except for those resources with names that were very common or short, e.g., R, Enzyme.
  • In trying to determine meaningful use of the resource, as opposed to mentions of the resource but no actual use within the study, NIF focused its search on the Materials and Methods section.  However, NIF had access to the materials and methods of sections of only a subset of PubMed Central, that is the PMC Open Access Subset which is a relatively small part of the total collection of articles in PMC, and other open access journals.

At the SFN meeting, the attendees were polled regarding the desirability and feasibility of implementing the research resource identification proposal. No serious objections were raised about the desirability of better resource identification. However, several issues were raised about the feasibility of such a process:

  • Who would do the identification?
    • Author?  Algorithm? Curator?  Editor?  How would the information be verified once supplied?
  • Would a special tool be needed?  If so, who will pay?
    • Would it scale to 40,000 papers/month?
    • Is the information available from authors in general?
  • Difficult to implement only for neuroscience journals, as journals have many different journals in their portfolio
  • Granularity:  Would we be able to specify the requirements at a level of granularity that would be useful but still feasible?
  • What will be the benefit to the user?  How will we show that?

Prior to convening a follow up meeting, NIF and some of our partners agreed to develop some pilot projects to address some of these issues, based on work that NIF was doing with Elsevier.  Elsevier had agreed to provide full text access to a significant number of neuroscience journals.


SUMMARY OF DISCUSSION AT JUNE 26TH MEETING

The meeting was divided into a morning session with 3 presentations and an afternoon breakout session with two working groups.  Dr. Jonathan Pollock opened the meeting with a charge to the participants that at the end of the day, we needed to have a set of action steps. 


Presentations

The morning session included presentations from:

1)  Mike Huerta, National Library of Medicine:  Discovering, Citing and Linking Data

-an overview of the planned NIH data catalog and of the BD2K project

NIH has several initiatives planned for increasing reporting of and access to data, in particular the creation of a Data catalog, where a researcher would upload minimal information about a data set to the data catalog. Each data set would receive a unique identifier that would be used to subsequent use of data.  These same identifiers will be used in PubMed.

2)  Maryann Martone, Neuroscience Information Framework:  Current practices in reporting neuroscience resources

Maryann presented the results of several pilot and formal projects that had provided information about some of the challenges and questions raised in the previous meeting.  The main conclusions of these projects were:

  • The issue of proper resource identification is not unique to neuroscience.  Nicole Vasilevsky and her colleagues from eagle-i (https://www.eagle-i.net/) performed a comprehensive study of resource reporting across a spectrum of journals and fields, tracking the reporting of antibodies, cell lines, model organisms, knockdown reagents and constructs.  Although the results differed across fields and type of resource, the general conclusions reached by the neuroscience pilot held:  most papers did not contain sufficient identifying information for either a human or automated algorithm to identify the resources used. The study is under review in PeerJ.
  • Vasilevsky et al examined the reporting requirements of journals and found no correlation between proper identification and the stringency of reporting requirements
  • Although the pilot project did not address availability of the information from the author systematically, in a case study of a single laboratory at Carnegie Mellon University, Anita de Waard and colleagues found that the identifying information for reagents and animals was kept in good order by the researcher, i.e., the appropriate identifying information was available, but this information by and large did not make it into published papers. 
  • Although only an N of 1, the finding affirms the contention that authors simply do not think to put this information in a paper. In contrast, the vendor location and city are routinely supplied, because this information is requested from many journals and mentors teach their students to supply it.
  • Scalability:  NIF and Elsevier worked on a text mining project to see if a machine-learning algorithm could be used to automate the process of resource identification.  They focused on antibodies and tools registered within the NIF Antibody Registry and NIF Catalog (databases and software tools). Over 500 articles were hand annotated and then used for text mining. The algorithm was reasonably accurate at detecting antibodies and identifying them if the catalog numbers were provided (~87%), although the many different styles of reporting catalog numbers decreased the total number identified (~63%).  Identification of tools was better, approaching 100%. The algorithms are still under development, but the results were encouraging in that:
  • This project suggested that automated text mining would be helpful in verifying information supplied by the authors. 
  • This project also suggested that at some point, a “resource identification” step could be incorporated into the manuscript submission pipeline that would be able to assist authors in identifying their resources.
  • Commercial antibody providers, specifically antibodies-online, that seeks to provide more transparency in the antibody market, are interested in helping to support such efforts.  NIF has interacted with antibodies-online (http://www.antibodies-online.com/), who are experts in the antibody market and are wiling to help underwrite costs for the NIF Antibody Registry, an on-line database for assigning unique identifiers to antibody reagents.         

3)  Geoffrey Bilder, Cross Ref:  Current Solutions in working with the Biomedical Literature

- provided an overview of Cross Ref and identifier systems

Dr. Bilder was invited to this meeting because of his expertise in identifier systems through ORCID, the unique author identification system, and Cross Ref, a non-profit organization funded by the publishing industry to ensure that articles could be identified on the web through the creation of the Digital Object Identifier (DOI’s).  Cross Ref works with the entirety of the biomedical literature and has been approached by other groups to develop methods for identifying specific research resources within the literature, e.g., chemical identifiers.  These efforts did not proceed because of objections from the publishers, similar to those expressed to NIF:  publishers don’t want to do this in an ad hoc fashion for just one domain.  Dr. Bilder gave an overview of the use and need for identifiers and addressed issues of duplication and trust.  Some key points:  the system should be as simple as possible, the identifiers should be owned by the community of interest and not the individual, bi-directional verification provides added trust, e.g., if an author supplies a catalog number for an antibody from a particular vendor and the system checks a database that says the vendor has an antibody with those characteristics and catalog number, then the information receives some validation. 

The major conclusions drawn from the presentations ;

-the problems of resource identification are not unique to neuroscience and therefore the solutions could be applied across all of biomedicine;  clearly, as per Bilder’s talk, the study of Vasilevsky et al.  and the experience of the model organism database curators, there is need and interest from communities beyond neuroscience.

-getting authors to supply proper identifiers will require a cultural shift;  instructions to authors without some sort of editorial or curatorial oversight will probably not be adequate, although right now the evidence for this is somewhat anecdotal

-tools can be developed that would make the process of validating identifiers and perhaps assisting authors with annotating the correct entities in their papers at least semi-automated.  We note that subsequent to the meeting, a paper by Kafkas et al (2013) was published on the use of text mining to identify genomic database accession numbers.  Although the recall was not perfect, they note that “These initial results suggest that, given the volume of references found, and the low cost and high precision of the text-mining method we deploy..., it is useful to extend the scope of accession number mining beyond the “core three” data sources [Genbank, UniProt, PDB], that publishers currently mark up.”

-Identifier systems exist for antibodies (NIF Antibody Registry), most model organisms (Fly, Zebrafish, Rat, Worms);  mice are a bit of a problem as not all of the mouse suppliers have easily accessible unique identifiers, tools (NIF and NITRC)

-NIH will be addressing many of the same issues for data sets

Kafkas Ş, Kim J-H, McEntyre JR (2013) Database Citation in Full Text Biomedical Articles. PLoS ONE 8(5): e63184. doi:10.1371/journal.pone.0063184


DISCUSSION OF PRESENTATIONS

The presentations generated much discussion from the audience, centering again on the issues of feasibility rather than desirability.  Issues of granularity again were raised  and some questioned whether the proposed reporting guidelines would go far enough.  For example, those who deal with behavioral data might want to require  more stringent requirements and unique identifiers be given for key procedures like the Morris Water Maze, for example.  The moderators countered that while such things were identifiable through the many community ontologies that are under development, and that groups have been working to create fully structured methods and to semantically enhance entire papers, e.g., FEBS Journal Structured Abstract project, we needed to start with a set of entities that we agree could be reasonably identified and for which authoritative sources of identifiers exist.  Similarly, while everyone present agreed that researchers should write better and more detailed methods, the moderators made it clear that this meeting was equally focused on the machine-accessibility and processability of information and not just its presence in an article for another human with a subscription to read. 

The larger discussion focused on who would do the work and when would it be done.  Several in the audience felt that adding yet another requirement for the journal staff or the reviewers would be too onerous.  Authors also might not adopt the practice if it was too difficult to find accession numbers or if it wasn’t clear what they should identify.  Geoff Bilder noted that the authors currently spend a lot of time doing things that are no longer necessary, e.g., formatting references for a particular journal style, and that perhaps if we started to eliminate some of these unnecessary steps, we could free up time for new practices required for electronic publishing, e.g., resource identification. 

Dr. Bilder  also noted that any workflow that involved a modification of the manuscript submission system, e.g., Scholar One, would not likely succeed in the short run, as these modifications are perceived to be expensive and take time.  He did say, however, that the entire process could be done with minimal modification of the current manuscript submission system.  Matt Giampaolo from Wiley noted that if the text mining tools are made available across all publishers, it would make the process easier and more widespread. 

The issue of whether resource identification and tracking would provide sufficient benefit to the research community that it would spur adoption was raised, with some questioning whether it would be any benefit.  While proper identification of antibodies via catalog numbers and even lot numbers was viewed to be a “no brainer”, antibodies aren’t used by much of the neuroscience community.  Just knowing that a researcher used a particular software tool might also not be useful, without additional information about version and other parameters.  These objections were countered by others, some of whom noted that projects like ADNI (Alzheimer’s Disease Neuroimaging Initiative) had recently requested special identifiers within PubMed so that they could track usage.  The well known problems with finding antibodies-described as a“a search industry” by antibodies-online were reiterated.  

There was general agreement that, although the problem of resource identification goes far deeper than just supplying catalog numbers and other identifiers, that we have to start simply with something that is doable.  The hope is that if it proves beneficial to researchers, funders, publishers and resource providers, that we would expand the project to include much more structured methods. 

Following the morning discussion, the workshop broke up into two groups:  1)  Feasibility of a pilot project:  what would be identified and by whom?  2)  Implementation:  what would an end-to-end system look like.


WORKING GROUPS

Feasibility Group: 

Scope: 3 types of entities should be identified as an initial pilot project:  1)  Antibodies;  2)  Tools;  3)  Model organisms

For tools, the scope should be those that are registered within the NIF Registry and not all commercial tools or instruments used.  The NIF Registry focuses on digital resources that are largely, although not exclusively, produced by the academic community.  Note that the NIF Registry links with NITRC (Neuroimaging Tools and Resource Clearinghouse;  http://nitrc.org), which has catalogued software tools and databases of for neuroimaging.  For the purposes of this proposal, references to the NIF Registry will also include NITRC, as that is the authoritative source of neuroimaging tools.

Who:  The issue of whether the author should be asked to supply this information or whether we should attempt to use semi-automated means to identify potential research resources and then go back to the author was discussed.  One can envision a two step process where the authors are asked to supply the information and then the article is screened via NLP for verification.  The need to ensure that the process was not overly onerous for the author was emphasized.

When:  Should the process of resource identification be done at time of submission, during review or after acceptance?  The general feeling was that during review or after acceptance would be the time when we would likely get the most compliance.  If this process is done during review, then the reviewers would need to be alerted that they should look for this information and be able to communicate with the author that they need to supply this information.  We do not want to make this an absolute requirement for publication, as we all recognize that the authors may not possess this information and we don’t want them supplying false information in order to have the article published.  If it is after acceptance, then the onus would be on the staff or the editor to ensure compliance.

How:  If authors are going to supply these identifiers, then it needs to be easy for them to obtain them.  Dr. Martone felt that the proper identifiers were sometimes difficult to find in the model organism databases, but that NIF could help with a simple service.  NIF also would need to be made more simple, as it is currently difficult to know where to look.  Communication with the Mutant Mouse Resources is necessary to ensure that proper identifiers are being given to all mouse strains.

Dr. Pollock brought up the issue of having animals identified through a bar code or and perhaps spiking reagents with a sequence or some other identifier that could be automatically read.  It is clear that novel technology solutions are now possible or on the horizon and that investments into laboratory information management need to be made.  Once the research community begins to make the shift towards a web-enabled platform for scholarly communication-one that handles all types of diverse research objects-we believe that there will be numerous opportunities to streamline the process of working with these objects. 

Implementation Group

The implementation group mapped out what an end-to-end workflow might look like for a pilot project and beyond.  The minimum requirement is that we have the appropriate registries that are viewed as authoritative for the entities to be identified.

Other steps:

1)  Tagging:  The option of having an independent group like NIF do the tagging, rather than the author,  was discussed but would likely bring up privacy concerns from authors.  As with the feasibility group, one can see pros and cons to performing the resource identification at different steps in the publication process:  at time of submission, during review, after acceptance. 

2)  Verification step:  The suggestion was made that we contact Biocreative (http://biocreative.sourceforge.net):  “Critical Assessment of Information Extraction systems in Biology”, a organization that runs challenges for evaluating text mining and information extraction systems applied to the biological domain.  We could make the verification of research resources within the materials and methods section a challenge project. 

3)  Where would the identifiers be?  The request was that any identifiers supplied would be available in a uniform format across publishers, would not be stripped out by PubMed,  and be available to 3rd parties outside of the paywall.  In the NIF-Elsevier pilot, identifiers are placed in the author-supplied keyword field, which is indexed by PubMed.  This solution may be unwieldy if larger numbers of antibodies are used, for example.  Alternatively, Geoff Bilder suggested that they could be stored in a single URL that points to a metadata record.  Placing the identifiers in text is something that is done already for entities like gene accession numbers, but unless the text was accessible, this would not satisfy the requirements for 3rd party accessibility.  However, with access to materials and methods, these identifiers could be extracted and placed in a location outside of a paywall.  Clearly, as indicated in Mike Huerta’s talk, the NIH Data Catalog will face similar issues. 

4)  Sustainability:  The issue of sustainability of projects like NIF was brought up, as some publishers are concerned about investing in a strategy only to have the database disappear.  Of course, no one can guarantee that any organization will exist in perpetuity.  Possible solutions are to replicate the services, e.g., the INCF and eagle-i both offered to mirror the NIF system, to provide robustness.  Geoff Bilder also noted that if the identifiers and systems are covered by a CC-0 license, then they would be available to anyone to pick up should NIF go out of business.  

Meeting Outcomes

At the end of the session, almost all attendees indicated their interest in a pilot project to identify antibodies, model organisms and tools in a machine processable form across neuroscience journals.  One goal of this project will be to gather data on the best implementation strategy to engage the authors in providing these identifiers and in establishing a scalable process for verifying that the correct identifiers are used.  Another goal will also be to provide a demonstration project to the research community that will show the benefits of machine processable information within papers by making it easier to find research resources. 

Pre-pilot: 

Considerable groundwork has been done and the major resources (NIF Registry, Antibody Registry, NITRC, NIF Integrated Model Organism database) required for this project are largely in place, before a large scale pilot project can be launched, we’ll need to do a pre-pilot.  Thus far, the work done by NIF and Monarch has not engaged the author but has relied on curators or automated agents to identify research resources.  As the author must be engaged in this process, a pre-pilot was outlined where a small group of users is given 5-10 papers and asked to supply appropriate identifiers for antibodies, tools and animals.  We would monitor whether:

  • naive users were able to understand which entities needed to be identified
  • naive users were able to look up the appropriate identifiers
  • users got frustrated or annoyed at the process
  • What percentage of appropriate entities within papers were available through NIF

We didn’t discuss what would constitute success for this pre-pilot phase, but clearly we would like to see that a majority of users could successfully complete the task.   This pre-pilot could be conducted via webinar so that it did not involve a large expense. 

Pilot project: 

Once the system is in place for obtaining the appropriate identifiers, a larger scale pilot project would be launched across journals.  This project would involve asking the authors to supply the correct identifiers at some point in the publication process:  at submission, during review or after acceptance.   We will leave it up to the individual journals and publishers how they would like to implement the the stage at which they send the author the request, in order to give them some flexibility and to allow us to test different strategies for acquiring this information.  Ideally, the project would run for a specified period of time, e.g., one month, during which time all articles from a particular journal would be tagged.  Again, the journals and publishers can have some flexibility in choosing the journals and the exact number of articles.  However, it is important that high impact journals participate in this project, as authors are usually highly motivated to comply with requests from high impact journals and because it would give high visibility to the project. 

Authors would be notified by the editors by email that they are participating in a pilot project to make science more reproducible and to make articles easier for machines to read.  NIF will provide the appropriate instructions and a link to the website where the authors can obtain the information.  Geoff Bilder offered to work with colleagues, e.g., Steve Pettifer,  to create a nice front end for the system. 

For the initial project, the authors should insert the identifiers into their materials and methods section, as they would a gene accession number or a URL for a tool.  Some journals have author guidelines for this type of citation, and we would follow this convention, e.g., BMC Genomics states that nucleic acid sequences, protein sequences, and the atomic coordinates of macromolecular structures should be deposited in the appropriate database, and that the accession number should be provided in square brackets with their corresponding database name (e.g. [EMBL:AB026295, GenBank:U49845, PDB:1BFM] (Kafkas et al., 2013)).

To oversee the implementation issues and ensure that the effort can extend beyond neuroscience, we will create a Resource Identification Group that includes participants in this workshop and others who have expertise and tools relevant to the pilot.  We will utilize FORCE11 (Future of Research Communications and e-Scholarship;  http://force11.org), a community platform  for stakeholders interesting in advancing scholarly communication through technology, to align our efforts with those underway in different areas of biomedicine, as the goal is to establish a uniform citation style.   FORCE11 is already coordinating discussions on data citation styles (http://www.force11.org/node/4381) and can provide feedback and advice about the proposed implementation.  

Once the papers have been annotated with resource identifiers, we would then need access to the full text so that we could verify and extract these identifiers. For the initial pilot project, we need not determine the final solution about where they identifiers are to be stored outside of the paywall, as an outside organization like NIF, INCF or Cross Ref can store them.  As per the discussion above, the pilot project may involve mirroring at all 3 sites.  After the pilot project is complete, we will follow up with a questionnaire to find out how the authors viewed the task.  We will also provide them with a link where they can view the results of the pilot, and give them a search interface so that they can find papers that used their reagent or tool.  We hope to engage the publishers to create various widgets that might provide this information through the article itself. 

ACTION ITEMS

1)  Perform pre-pilot project (2 months-Resource Identification Group:  NIF, NITRC, INCF, Monarch, Cross-Ref, antibodies-online, eagle-i and other interested parties):

  • Form the Resource Identification Group:  The RIG will develop and evaluate the specific technologies and implementation.  Ensuring that other groups who are working in this area are involved will be important for the success of the project.
  • Make sure that the appropriate identifiers are available for all model organisms
  • Establish a single website with an easy to use front end for obtaining identifiers
  • Prepare instructions for authors
  • Perform usability studies with naive users (~25)
  • Present results to workshop consortium

2)  Discuss potential pilot project with publishers (meeting attendees) - 1 month

  • Get initial commitments from publishers for proposed pilot project:  what journals, what resources
  • Discuss potential implementation per journal

3)  Prepare detailed proposal for publishers (at completion of pre-pilot project) (Resource Identification Group)

  • Include a link to a demonstration site and the results of the usability study
  • Allow flexibility in implementation
  • Launch pilot project at SFN???

3)  Continue to improve the automated pipeline and authoring/curation tools (Resource Identification Group)

  • Contact Biocreative to see if they are interested in hosting a text mining challenge

4)  Seek sponsorship for implementation and promoting the project (all)

  • antibodies-online
  • Mozilla Foundation:  Open Science and Science in the Web
  • Society for Neuroscience?
  • CrossRef?