<?xml version="1.0" encoding="UTF-8"?>
<record>
  <title>Unsupervised Semantic Analysis of Python FAQ Corpora: Topic Modeling, Similarity Detection, and Structural Optimization</title>
  <journal>Progress in Computing Applications</journal>
  <author>Maleerat Maliyaem</author>
  <volume>15</volume>
  <issue>1</issue>
  <year>2026</year>
  <doi>https://doi.org/10.6025/pca/2026/15/1/18-29</doi>
  <url>https://www.dline.info/pca/fulltext/v15n1/pcav15n1_2.pdf</url>
  <abstract>This study demonstrates how unsupervised machine learning techniques can audit and optimize technical
knowledge repositories through systematic semantic analysis of a curated Python FAQ corpus comprising
163 question answer pairs. Addressing the challenge that over 80% of organizational data exists in
unstructured text form, we implement a transparent, eight stage natural language processing pipeline
encompassing preprocessing, TF-IDF feature extraction, cosine similarity analysis, Latent Dirichlet Allocation
(LDA) topic modeling, Principal Component Analysis (PCA) dimensionality reduction, and network-based
co-occurrence analysis. Our methodology emphasizes contextualized preprocessing decisions, responding
to contemporary gaps in methodological transparency within organizational text mining research. Results
reveal five coherent thematic clusters corresponding to core Python programming concepts: language
fundamentals, data structures, syntax and control flow, function operations, and advanced mechanisms.
Cosine similarity analysis identifies non trivial semantic overlap among FAQ entries, highlighting actionable
opportunities for content consolidation to reduce redundancy and improve retrieval efficiency. Network
analysis establishes &quot;python,&quot; &quot;function,&quot; &quot;list,&quot; and &quot;dictionary&quot; as high centrality conceptual anchors
within the corpus topology. These findings translate into practical recommendations for technical
documentation management, including evidence based FAQ merging, hierarchically organized navigation
schemas, and gap identification for underrepresented subject areas. By bridging unstructured text data with
structured organizational intelligence, this reproducible framework supports educational chatbot
development, curriculum design, and knowledge base optimization. The study underscores that rigorous,
contextually justified preprocessing combined with multi perspective unsupervised analytics enables
researchers and practitioners to unlock significant value from complex text corpora while ensuring
methodological transparency and analytical validity in organizational research applications.</abstract>
</record>
