iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization (2021-02-28T00:00:00.000000Z)

TL;DR

iLearnPlus is introduced, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine- learning pipelines for analysis and predictions using nucleic acid and protein sequences and caters to experienced bioinformaticians and biologists with no programming background, given the point-and-click interface and easy-to-follow design process.

Abstract

Abstract Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.

References151 items

Deep learning for genomics using Janggu

Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning

Prediction of the sequence-specific cleavage activity of Cas9 variants

Systematic evaluation of machine learning methods for identifying human-pathogen protein-protein interactions

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features

PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs

Sequence-based Detection of DNA-binding Proteins using Multiple-View Features Allied with Feature Selection.

Identification of Protein Lysine Crotonylation Sites by a Deep Learning Framework With Convolutional Neural Networks

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Iterative feature representations improve N4-methylcytosine site prediction

Multimodal deep representation learning for protein interaction identification and protein family classification

Prediction of drug-target interaction based on protein features using undersampling and feature selection techniques with boosting.

Machine learning techniques for protein function prediction

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences

4mCpred-EL: An Ensemble Learning Framework for Identification of DNA N4-Methylcytosine Sites in the Mouse Genome

BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

The Kipoi repository accelerates community exchange and reuse of predictive models for genomics

DeepFunc: A Deep Learning Framework for Accurate Prediction of Protein Functions from Protein Sequences and Interactions

Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk

mCSM-PPI2: predicting the effects of mutations on protein–protein interactions

ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks

Cellular functions of long noncoding RNAs

iLearn : an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines

Selene: a PyTorch-based deep learning library for sequence data

Hot spot prediction in protein-protein interactions by an ensemble system

Integration of A Deep Learning Classifier with A Random Forest Approach for Predicting Malonylation Sites

Protein Family Classification with Multi-Layer Graph Convolutional Networks

M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species

Large-scale comparative assessment of computational predictors for lysine post-translational modification sites

High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features

LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property

Identification and analysis of adenine N6-methylation sites in the rice genome

4mCPred: machine learning methods for DNA N4‐methylcytosine sites prediction

Quantitative Crotonylome Analysis Expands the Roles of p300 in the Regulation of Lysine Crotonylation Pathway

Combinatorial Targeting by MicroRNAs Co-ordinates Post-transcriptional Control of EMT.

DeepAffinity: Interpretable Deep Learning of Compound-Protein Affinity through Unified Recurrent and Convolutional Neural Networks

ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

PANNZER2: a rapid functional annotation web server

A comprehensive review and comparison of different computational methods for protein remote homology detection

MusiteDeep: a deep‐learning framework for general and kinase‐specific phosphorylation site prediction

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

A deep learning method for lincRNA detection using auto-encoder algorithm

Ultradeep Lysine Crotonylome Reveals the Crotonylation Enhancement on Both Histones and Nonhistone Proteins by SAHA Treatment.

Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility

SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids.

Global profiling of crotonylation on non-histone proteins

Large-Scale Identification of Protein Crotonylation Reveals Its Role in Multiple Cellular Functions.

Improving protein disorder prediction by deep bidirectional long short‐term memory recurrent neural networks

Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition

Convolutional neural network architectures for predicting DNA–protein binding

Deep learning in bioinformatics

XGBoost: A Scalable Tree Boosting System

SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features

iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition

Deep Residual Learning for Image Recognition

Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9

Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor

Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome

Predicting effects of noncoding variants with deep learning–based sequence model

Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences

Machine learning applications in genetics and genomics

Advances in protein contact map prediction based on machine learning.

repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects

Adam: A Method for Stochastic Optimization

Oncogenic role of long noncoding RNA AF118081 in anti-benzo[a]pyrene-trans-7,8-dihydrodiol-9,10-epoxide-transformed 16HBE cells.

iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition

iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition.

hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties.

Metalearning: a survey of trends and technologies

Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

SUMOhydro: A Novel Method for the Prediction of Sumoylation Sites Based on Hydrophobic Properties

Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art

Identification of 67 Histone Marks and Histone Lysine Crotonylation as a New Type of Histone Modification

Discriminative prediction of mammalian enhancers from DNA sequence.

Prediction of Ubiquitination Sites by Using the Composition of k-Spaced Amino Acid Pairs

Incorporating Distant Sequence Features and Radial Basis Function Networks to Identify Ubiquitin Conjugation Sites

Scikit-learn: Machine Learning in Python

The WEKA data mining software: an update

A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation

Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs

Machine Learning in Bioinformatics

Data clustering: 50 years beyond K-means

Predicting Human Nucleosome Occupancy from Primary Sequence

Computational identification of ubiquitylation sites from protein sequences

Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences

Matplotlib: A 2D Graphics Environment

Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs

Predicting protein–protein interactions based only on sequences information

Clustering by Passing Messages Between Data Points

A coding measure scheme employing electron-ion interaction pseudopotential (EIIP)

Statistical Models: Theory and Practice: References

Predicting the in vivo signature of human gene regulatory sequence

Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog

Prediction of protein subcellular locations by GO-FunD-PseAA predictor.

Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition*

100

Enzyme family classification by support vector machines

101

Comparison of various algorithms for recognizing short coding sequences of human genes

102

Prediction of RNA-binding proteins from primary sequence by a support vector machine approach.

103

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

104

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

105

SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence

106

Amino acid encoding schemes from protein structure alignments: multi-dimensional vectors to describe residue types.

107

An efficient algorithm for large-scale detection of protein families.

108

Greedy function approximation: A gradient boosting machine.

109

Prediction of protein cellular attributes using pseudo‐amino acid composition

110

Accurate Prediction of Protein Secondary Structural Content

111

New techniques for extracting features from protein sequences

112

Prediction of protein subcellular locations by incorporating quasi-sequence-order effect.

113

Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids

114

Data clustering: a review

115

Recognition of a protein fold in the context of the SCOP classification

116

Using a neural network to backtranslate amino acid sequences

117

Long Short-Term Memory

118

Bagging Predictors

119

Prediction of protein folding class using global description of amino acid sequence.

120

Mean Shift, Mode Seeking, and Clustering

121

The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site.

122

An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression

123

Discriminant Analysis and Statistical Pattern Recognition

124

Francis Galton's Account of the Invention of Correlation

125

Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities

126

A method of comparing the areas under receiver operating characteristic curves derived from the same cases.

127

LIII. On lines and planes of closest fit to systems of points in space

128

Data Mining And Knowledge Discovery Handbook

129

e60 matrix with deep residual neural networks

130

iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC

131

PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition

132

Points of Significance: Classification and regression trees

133

Deep Learning

134

Machine learning methods for microRNA gene prediction.

135

Accelerating t-SNE using tree-based algorithms

136

Latent Dirichlet Allocation

137

Pattern Recognition. (4th edn)

138

AdaBoost and the Super Bowl of Classifiers A Tutorial Introduction to Adaptive Boosting

139

Clustering Algorithms II: Hierarchical Algorithms

140

A survey of kernel and spectral methods for clustering

141

Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population.

142

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

143

Clustering Methods

144

Support-vector networks

145

Statistical Methods for Psychology. Duxbury/Thomson Learning, Pacific

146

Random Forests

147

Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification.

148

The global average DNA base composition of coding regions may be determined by the electron-ion interaction potential.

149

Supplementary Supplementary Supplementary Supplementary Methods Methods Methods Methods Comparison Comparison Comparison Comparison of of of of Cnci Cnci Cnci Cnci Performance Performance Performance Performance with with with with Cpc Cpc Cpc Cpc and and and and Phylocsf Phylocsf Phylocsf Phylocsf

150

i) To the best of our knowledge, iLearnPlus is the first GUI-based platform that facilitates machine learning-based analysis and prediction of biological sequences

151