1
|
Usmani S, Shamsi JA. LSTM based stock prediction using weighted and categorized financial news. PLoS One 2023; 18:e0282234. [PMID: 36881605 PMCID: PMC9990937 DOI: 10.1371/journal.pone.0282234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 02/11/2023] [Indexed: 03/08/2023] Open
Abstract
A significant correlation between financial news with stock market trends has been explored extensively. However, very little research has been conducted for stock prediction models that utilize news categories, weighted according to their relevance with the target stock. In this paper, we show that prediction accuracy can be enhanced by incorporating weighted news categories simultaneously into the prediction model. We suggest utilizing news categories associated with the structural hierarchy of the stock market: that is, news categories for the market, sector, and stock-related news. In this context, Long Short-Term Memory (LSTM) based Weighted and Categorized News Stock prediction model (WCN-LSTM) is proposed. The model incorporates news categories with their learned weights simultaneously. To enhance the effectiveness, sophisticated features are integrated into WCN-LSTM. These include, hybrid input, lexicon-based sentiment analysis, and deep learning to impose sequential learning. Experiments have been performed for the case of the Pakistan Stock Exchange (PSX) using different sentiment dictionaries and time steps. Accuracy and F1-score are used to evaluate the prediction model. We have analyzed the WCN-LSTM results thoroughly and identified that WCN-LSTM performs better than the baseline model. Moreover, the sentiment lexicon HIV4 along with time steps 3 and 7, optimized the prediction accuracy. We have conducted statistical analysis to quantitatively assess our findings. A qualitative comparison of WCN-LSTM with existing prediction models is also presented to highlight its superiority and novelty over its counterparts.
Collapse
Affiliation(s)
- Shazia Usmani
- Systems Research Laboratory, FAST-National University of Computer and Emerging Sciences, Karachi, Pakistan
- * E-mail:
| | - Jawwad A. Shamsi
- Systems Research Laboratory, FAST-National University of Computer and Emerging Sciences, Karachi, Pakistan
| |
Collapse
|
2
|
Achilonu OJ, Fabian J, Bebington B, Singh E, Nimako G, Eijkemans RMJC, Musenge E. Use of Machine Learning and Statistical Algorithms to Predict Hospital Length of Stay Following Colorectal Cancer Resection: A South African Pilot Study. Front Oncol 2021; 11:644045. [PMID: 34660254 PMCID: PMC8518555 DOI: 10.3389/fonc.2021.644045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 08/31/2021] [Indexed: 12/23/2022] Open
Abstract
The aim of this pilot study was to develop logistic regression (LR) and support vector machine (SVM) models that differentiate low from high risk for prolonged hospital length of stay (LOS) in a South African cohort of 383 colorectal cancer patients who underwent surgical resection with curative intent. Additionally, the impact of 10-fold cross-validation (CV), Monte Carlo CV, and bootstrap internal validation methods on the performance of the two models was evaluated. The median LOS was 9 days, and prolonged LOS was defined as greater than 9 days post-operation. Preoperative factors associated with prolonged LOS were a prior history of hypertension and an Eastern Cooperative Oncology Group score between 2 and 4. Postoperative factors related to prolonged LOS were the need for a stoma as part of the surgical procedure and the development of post-surgical complications. The risk of prolonged LOS was higher in male patients and in any patient with lower preoperative hemoglobin. The highest area under the receiving operating characteristics (AU-ROC) was achieved using LR of 0.823 (CI = 0.798–0.849) and SVM of 0.821 (CI = 0.776–0.825), with each model using the Monte Carlo CV method for internal validation. However, bootstrapping resulted in models with slightly lower variability. We found no significant difference between the models across the three internal validation methods. The LR and SVM algorithms used in this study required incorporating important features for optimal hospital LOS predictions. The factors identified in this study, especially postoperative complications, can be employed as a simple and quick test clinicians may flag a patient at risk of prolonged LOS.
Collapse
Affiliation(s)
- Okechinyere J Achilonu
- Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - June Fabian
- Medical Research Council/Wits University Rural Public Health and Health Transitions Research Unit (Agincourt), School of Public Health, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa.,Wits Donald Gordon Medical Centre, School of Clinical Medicine, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa
| | - Brendan Bebington
- Wits Donald Gordon Medical Centre, School of Clinical Medicine, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa.,Department of Surgery, Faculty of Health Science, University of the Witwatersrand Faculty of Science, Parktown, Johannesburg, South Africa
| | - Elvira Singh
- Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.,National Cancer Registry, National Health Laboratory Service, Johannesburg, South Africa
| | - Gideon Nimako
- Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.,Industrialization, Science, Technology and Innovation Hub, African Union Development Agency (AUDA-NEPAD), Johannesburg, South Africa
| | - Rene M J C Eijkemans
- Julius Center for Health Sciences and Primary Care, University Medical Center, Utrecht University, Utrecht, Netherlands
| | - Eustasius Musenge
- Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
3
|
DEM- and GIS-Based Analysis of Soil Erosion Depth Using Machine Learning. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2021. [DOI: 10.3390/ijgi10070452] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Soil erosion is a form of land degradation. It is the process of moving surface soil with the action of external forces such as wind or water. Tillage also causes soil erosion. As outlined by the United Nations Sustainable Development Goal (UN SDG) #15, it is a global challenge to “combat desertification, and halt and reverse land degradation and halt biodiversity loss.” In order to advance this goal, we studied and modeled the soil erosion depth of a typical watershed in Taiwan using 26 morphometric factors derived from a digital elevation model (DEM) and 10 environmental factors. Feature selection was performed using the Boruta algorithm to determine 15 factors with confirmed importance and one tentative factor. Then, machine learning models, including the random forest (RF) and gradient boosting machine (GBM), were used to create prediction models validated by erosion pin measurements. The results show that GBM, coupled with 15 important factors (confirmed), achieved the best result in the context of root mean square error (RMSE) and Nash–Sutcliffe efficiency (NSE). Finally, we present the maps of soil erosion depth using the two machine learning models. The maps are useful for conservation planning and mitigating future soil erosion.
Collapse
|
4
|
Ershadi MJ, Qhanadi Taghizadeh O, Hadji Molana SM. Selection and performance estimation of Green Lean Six Sigma Projects: a hybrid approach of technology readiness level, data envelopment analysis, and ANFIS. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2021; 28:29394-29411. [PMID: 33559076 DOI: 10.1007/s11356-021-12595-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Accepted: 01/18/2021] [Indexed: 06/12/2023]
Abstract
Nowadays budget and schedule constraints have forced organizations to select six sigma projects based on pre-defined success criteria. Also, progressive approaches based on green and lean paradigm are vital for companies to enhance their social and environmental performance. Then, Green Lean Six Sigma (GLS) projects play the main role in improving the performance of an organization while augmenting its sustainability. Accordingly in this paper, past studies were reviewed, and GLS projects' indicators and performance evaluation criteria were identified. Data envelopment analysis (DEA) was employed for the appropriate selection of GLS projects. Next, the ranking and performance weight of each project were investigated, and also the projects were categorized based on the technology readiness level (TRL). Additionally, an adaptive neuro-fuzzy inference system (ANFIS) method was applied for the successful prediction of selected GLS projects. Twenty-eight inputs and 9 outputs for the first project category (with TRL 9) and 28 inputs and 6 outputs for the second project category (with TRL 8) were entered into the model. The statistical assessment measures such as Nash-Sutcliffe efficiency (NSE), root mean squared of error (RMSE), mean absolute error (MAE), and R2 were employed for capability appraisal of ANFIS model. Results of NSE and R2 indicators for both project categories were 1.00 that proved the efficiency of the ANFIS model for success prediction of GLS projects. Also, RMSE and MAE indicators for category 1 were 0.01 and 0.02 respectively. Similarly, these measures for category 2 were 0.02 and 0.02. The results advocate a proper approximation for observed values by the ANFIS model. Also, the results indicated that TRL as an important enabler of the GLS project has a meaningful role in the performance of GLS projects.
Collapse
Affiliation(s)
- Mohammad Javad Ershadi
- Information Technology Department, Iranian Research Institute for Information Science and Technology (IRANDOC), Tehran, Iran.
| | - Omid Qhanadi Taghizadeh
- Industrial Engineering Department, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | | |
Collapse
|
5
|
Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2021. [DOI: 10.3390/ijgi10010042] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Although machine learning has been extensively used in various fields, it has only recently been applied to soil erosion pin modeling. To improve upon previous methods of quantifying soil erosion based on erosion pin measurements, this study explored the possible application of ensemble machine learning algorithms to the Shihmen Reservoir watershed in northern Taiwan. Three categories of ensemble methods were considered in this study: (a) Bagging, (b) boosting, and (c) stacking. The bagging method in this study refers to bagged multivariate adaptive regression splines (bagged MARS) and random forest (RF), and the boosting method includes Cubist and gradient boosting machine (GBM). Finally, the stacking method is an ensemble method that uses a meta-model to combine the predictions of base models. This study used RF and GBM as the meta-models, decision tree, linear regression, artificial neural network, and support vector machine as the base models. The dataset used in this study was sampled using stratified random sampling to achieve a 70/30 split for the training and test data, and the process was repeated three times. The performance of six ensemble methods in three categories was analyzed based on the average of three attempts. It was found that GBM performed the best among the ensemble models with the lowest root-mean-square error (RMSE = 1.72 mm/year), the highest Nash-Sutcliffe efficiency (NSE = 0.54), and the highest index of agreement (d = 0.81). This result was confirmed by the spatial comparison of the absolute differences (errors) between model predictions and observations using GBM and RF in the study area. In summary, the results show that as a group, the bagging method and the boosting method performed equally well, and the stacking method was third for the erosion pin dataset considered in this study.
Collapse
|
6
|
Evaluation of the SEdiment Delivery Distributed (SEDD) Model in the Shihmen Reservoir Watershed. SUSTAINABILITY 2020. [DOI: 10.3390/su12156221] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The sediment delivery ratio (SDR) connects the weight of sediments eroded and transported from slopes of a watershed to the weight that eventually enters streams and rivers ending at the watershed outlet. For watershed management agencies, the estimation of annual sediment yield (SY) and the sediment delivery has been a top priority due to the influence that sedimentation has on the holding capacity of reservoirs and the annual economic cost of sediment-related disasters. This study establishes the SEdiment Delivery Distributed (SEDD) model for the Shihmen Reservoir watershed using watershed-wide SDRw and determines the geospatial distribution of individual SDRi and SY in its sub-watersheds. Furthermore, this research considers the statistical and geospatial distribution of SDRi across the two discretizations of sub-watersheds in the study area. It shows the probability density function (PDF) of the SDRi. The watershed-specific coefficient (β) of SDRi is 0.00515 for the Shihmen Reservoir watershed using the recursive method. The SY mean of the entire watershed was determined to be 42.08 t/ha/year. Moreover, maps of the mean SY by 25 and 93 sub-watersheds were proposed for watershed prioritization for future research and remedial works. The outcomes of this study can ameliorate future watershed remediation planning and sediment control by the implementation of geospatial SDRw/SDRi and the inclusion of the sub-watershed prioritization in decision-making. Finally, it is essential to note that the sediment yield modeling can be improved by increased on-site validation and the use of aerial photogrammetry to deliver more updated data to better understand the field situations.
Collapse
|
7
|
Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10134571] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The world has witnessed the success of artificial intelligence deployment for smart healthcare applications. Various studies have suggested that the prevalence of voice disorders in the general population is greater than 10%. An automatic diagnosis for voice disorders via machine learning algorithms is desired to reduce the cost and time needed for examination by doctors and speech-language pathologists. In this paper, a conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm called CGAN-IFCM is proposed for the multi-class voice disorder detection of three common types of voice disorders. Existing benchmark datasets for voice disorders, the Saarbruecken Voice Database (SVD) and the Voice ICar fEDerico II Database (VOICED), use imbalanced classes. A generative adversarial network offers synthetic data to reduce bias in the detection model. Improved fuzzy c-means clustering considers the relationship between adjacent data points in the fuzzy membership function. To explain the necessity of CGAN and IFCM, a comparison is made between the algorithm with CGAN and that without CGAN. Moreover, the performance is compared between IFCM and traditional fuzzy c-means clustering. Lastly, the proposed CGAN-IFCM outperforms existing models in its true negative rate and true positive rate by 9.9–12.9% and 9.1–44.8%, respectively.
Collapse
|