::TBI Core Facility::

Home / About Us / Applied Medical Genomics

Unit 6 : Biomedical text mining and biomarker discovery

By combining our knowledge management tools, InfoMap, and natural language agent, we are currently constructing a Question Answering system for genomic and proteomic knowledge. We shall further extend this system to help biologist to execute certain natural language scripts automatically in their dry labs. Finally, we shall utilize InfoMap to facilitate the accurate search of various relationships in biological literature.

We also provide MS-based metabolomics quantitation and identification, including quantitation of the abundance, isotope ratio and charge state of the metabolites, and the qualitative identification of their peakgrouping and annotation.

協同研究人員：

蔡宗翰教授	國立中央大學資訊工程學系
戴鴻傑教授	國立高雄科技大學電資學院電機工程系

A.On-line services and software

(1) IDEAL-Q：an automated analysis tool for label-free quantitative proteomics

IDEAL-Q is an automated analysis tool for label-free quantitative proteomics. It accepts mzXML raw data format and Mascot xml and ProtXML/PepXML for identification result. IDEAL-Q uses an elution time prediction and peak alignment algorithms to quantify peptides across different LC-MS runs and increase quantitation coverage. Furthermore, the tool adopts an stringent validation step on Signal-to-noise ratio, Charge state, Isotopic distribution (SCI validation) to ensure quantitation accuracy. IDEAL-Q provides variously optional normalization tools for flexible workflow design such as addition of fractionation strategies and multiple spiked internal standards.

(2) Multi-Q：an automated data analysis tool for multiplexed protein quantitation based on iTRAQ labeling method

Multi-Q Web Server provides an automated data analysis tool for multiplexed protein quantitation based on iTRAQ labeling method. Multi-Q is designed as a generic platform that can accommodate various input data formats from search engines and mass spectrometer manufacturers. In comparison with its previous stand-alone version, this new web server version provides many enhanced features and flexible options for quantitation. The work flow of the web server is represented by a quantitation wizard so that the tool can be easy to use. Furthermore, it provides friendly user interface in output. The web server output a default report for quantitation results. In addition, it allows users to customize their output report and information of user's interest can be easily highlighted. The output provides visualization of mass spectral data so that users can conveniently validate the results.

(3) MaXIC-Q：an automated quantitation tool, which utilizes XICs acquired from isotope labeling techniques for quantitation analysis.

MaXIC-Q is an automated quantitation tool, which utilizes XICs acquired from isotope labeling techniques for quantitation analysis. As a generic computation platform for high-throughput quantitative proteomics, MaXIC-Q offers the following features: (1) It accepts the mzXML (24) spectral format, which can be converted from raw files of various mass spectrometers by existing tools, as well as search results from commonly used search engines, including Mascot and SEQUEST. (2) It allows user-defined isotope codes, which cover a very broad range of quantitation strategies for various in vivo and in vitro labeling techniques, and even user-developed labeling methods. To the best of our knowledge, MaXIC-Q is currently the only tool that defines stringent criteria for the validation of both XIC and mass spectra to achieve high accuracy in an unattended manner. Furthermore, MaXIC-Q provides graphic interfaces, Elution3D, an XIC viewer, and an ion mass spectrum viewer that allow flexible user-activated interactive modification based on simultaneous 3D visualization of the m/z, elution time and intensity.

(4) MAGIC：an automated tool for glycopeptide identification and glycan composition determination

MAGIC (Mass spectrometry-based Automated Glycopeptide IdentifiCation platform) is an automated tool for glycopeptide identification and glycan composition determination. MAGIC adopts a novel Trident algorithm for accurate Y1-ion detection and generates in silico peptide MS/MS spectra for database searches. MAGIC provides the flexibility in setting detection criteria, visualization of each spectrum with peak annotated and a summary report for export. MS/MS spectra in Mascot Distiller MGF and Mascot XML are recommended format for input and identification results, respectively.

(5) LiverCancerMarkerRIF Textual Evidence Database: a liver cancer biomarker interactive curation system combining text mining and expert annotations.

LiverCancerMarkerRIF contains functions including the recognition of gene, disease, post-translational modifications, mutations and investigative technologies, linking of the aforementioned terms to their corresponding database, and the extraction of LiverCancerMarkerRIF sentences. Furthermore, a user curation interface is available to submit suggestions on the sentences extracted. Once confirmed, users can directly submit the function-describing sentence to our LiverCancerMarkerRIF database.

(6) THOD: This database serves as a research tool and an overview on variations of disease candidate genes.

This database serves as a research tool and an overview on variations of disease candidate genes. By text-mining techniques, We provide candidate genes list with Entrez gene IDs and SNP sites (using rs number) for user accessing detail information. User can browse our database by different perspectives: textual evidence, disease-centric protein-protein interaction (PPI) network, integrated gene and SNP information, year of publication.

(7) BWS: A web application for annotating biomedical entities and relations.

Bioinformatics tools for automatic processing of biomedical literature are invaluable for both the design and interpretation of large-scale experiments. Many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have thus been developed for use in the biomedical field. A key IE task in this field is the extraction of biomedical relations, such as protein-protein and gene-disease interactions. However, most biomedical relation extraction systems usually ignore adverbial and prepositional phrases and words identifying location, manner, timing, and condition, which are essential for describing biomedical relations. Semantic role labeling (SRL) is a natural language processing technique that identifies the semantic roles of these words or phrases in sentences and expresses them as predicate-argument structures. We construct a biomedical SRL system called BIOSMILE that uses a maximum entropy (ME) machine-learning model to extract biomedical relations. BIOSMILE is trained on BioProp, our semi-automatic, annotated biomedical proposition bank. Currently, we are focusing on 30 biomedical verbs that are frequently used or considered important for describing molecular events.

B. Customized services

(1) Information retrieval: Collects texts that are relevant to the user’s query. e.g. Through Google and PubMed.

(2) Natural language processing: Provides prerequisite linguistic data for information extraction.

(a) Lexical analysis (tokenization): Before any linguistic analysis can take place, the basic tokens involved in the natural language have to be identified.

(b) Syntactic analysis: Syntactic analysis assembles sequences of tokens from a sentence syntactically into larger units, such as phrases or clauses.

(c) Semantic analysis: Semantic analysis is motivated by the assumption that the interpretation of sentences can be captured in proper formats.