Enhancing LLM prompts with Keyphrase Trees and hierarchical topic extraction

Authors: Amr Hegazy, Mohamed Abdelkarim, Reem El Adawi, and Niranjan Sitapure

At Siemens EDA, we are not only focused on integrating different types of AI—such as Machine Learning, Reinforcement Learning, Generative, and Agentic AI—across the entire EDA workflow, but we are also dedicated to developing new AI approaches and algorithms relevant to EDA. In an effort to this direction, this article summarizes one such recently developed novel approach to a common challenge: extracting meaningful insights from the vast digital landscape of complex documents, which is akin to finding needles in a haystack.

Enter Keyphrase Trees, a new method developed at Siemens EDA that revolutionizes how Large Language Models (LLMs) understand and extract key concepts from text. Building on advanced hierarchical clustering techniques, our enhanced Keyphrase Trees method takes natural language processing to the next level, demonstrating impressive results across multiple benchmark datasets and providing significant improvements in accuracy and domain independence.

In the context of Electronic Design Automation (EDA), the capabilities of Keyphrase Trees are particularly powerful, directly addressing the challenge of information overload. Consider the massive volumes of technical documentation generated in modern chip design. Instead of engineers manually sifting through datasheets and design manuals, Keyphrase Trees can automatically generate a hierarchical topic map. This allows an engineer to instantly pinpoint critical information, for example, tracing a specific power management feature across multiple documents.

When it comes to writing new specifications, the system can analyze existing requirement documents and legacy designs to extract core functionalities and constraints, ensuring nothing critical is overlooked. For verification and QA testing, Keyphrase Trees can process complex verification reports to hierarchically cluster failures and warnings, guiding engineers directly to the root cause of a bug. This accelerates debugging by uncovering hidden relationships between seemingly unrelated test results and the original design specifications, ultimately improving productivity and enabling a more intelligent, scalable approach to managing the complexity of semiconductor design.

Our team—Hegazy, Abdelkarim, and El Adawi from Siemens Digital Industries Software—presented our novel integration of hierarchical clustering techniques with LLM-driven keyphrase extraction at the Cognitive Models and Artificial Intelligence Conference. Our Keyphrase Trees approach represents a significant advancement by providing structured, hierarchical input to LLMs, thereby enhancing both accuracy and consistency in keyphrase identification across diverse document types and domains. With access to IEEE publications, you can read the full paper here: “Keyphrase Trees: Enhancing LLM Prompts with Hierarchical Topic Extraction.“

The challenge: When AI struggles with structure

Traditional keyphrase extraction methods often approach text analysis like a word-counting machine rather than an intelligent reader. They might identify individual important words or phrases, but struggle to understand relationships between concepts or the overall narrative structure of documents. This limitation becomes particularly problematic when dealing with complex, multi-topic documents where context and semantic relationships are crucial for accurate understanding.

LLMs have made tremendous strides in understanding human language, but they face their own unique challenges. These powerful AI systems can be incredibly sensitive to how information is presented to them. The quality of the prompt and the organization of information can dramatically affect their performance, like having a brilliant conversation partner who can provide amazing insights, but only if you know exactly how to ask the right questions.

Keyphrase Trees: Building bridges between structure and understanding

Our research team recognized that the main aspect of having better keyphrase extraction lies in combining hierarchical thinking with AI-powered language understanding. Keyphrase Trees represent a fundamental shift in how we approach this challenge, creating a structured framework that enhances the natural capabilities of LLMs.

The foundation of our approach rests on a powerful insight: documents have natural hierarchical structures that mirror how humans organize and process information. By identifying and leveraging these structures, we can guide AI systems to focus their attention more effectively, leading to more accurate and contextually relevant keyphrase extraction.

Our methodology begins by breaking documents into manageable chunks, then analyzing them using advanced embedding techniques that capture semantic meaning and relationships. We apply hierarchical clustering to organize these text chunks into a tree-like structure, which serves as a roadmap for the LLM, helping it understand which sections are most closely related and how different concepts build upon one another.

Comparison of scores on different benchmarks between our Keyphrase Trees approach and other approaches. — Figure 1: Scores comparison between our approach and other state-of-the-art approaches.

Real-world impact and performance

The practical implications of Keyphrase Trees extend far beyond academic research. In our comprehensive evaluation using three widely recognized benchmark datasets—Inspec, SemEval 2010 and DUC 2001—our approach consistently achieved state-of-the-art results as shown in Figure 1, demonstrating effectiveness across different domains and document types. Figure 2 illustrates an example tree generated from a document in the Inspec benchmark, demonstrating how the leaf nodes contain the extracted chunks from the document and how they are hierarchically organized based on semantic similarity.

Figure 2: Example Tree Generated from one document from the Inspec benchmark. Note that the leaf nodes contain the chunks extracted from the document.

What makes these results particularly exciting is the consistency of performance across diverse content areas. Traditional keyphrase extraction methods often struggle when moving from one domain to another, requiring significant adjustments. Keyphrase Trees, however, maintain their effectiveness whether analyzing scientific papers, news articles, or technical documentation, making them truly domain independent.

The hierarchical structure offers several key advantages that directly translate into improved real-world performance. It helps LLMs maintain focus on relevant sections, reduces the likelihood of generating irrelevant keyphrases and enables the system to capture both local details and global themes within the same document.

Looking to the future

The success of Keyphrase Trees opens exciting possibilities for the future of natural language processing. As we continue to refine this approach, we envision applications extending into document summarization, content organization and intelligent information retrieval.

Our ongoing research focuses on optimizing clustering algorithms, exploring different embedding models and investigating real-time applications. The domain-independent nature of our approach suggests potential for analyzing any form of structured information where hierarchical relationships play a significant role.

With access to IEEE publications, you can read the full paper: Keyphrase Trees: Enhancing LLM Prompts with Hierarchical Topic Extraction.

At Siemens EDA, we continually conduct research and product development on various advanced AI algorithms tailored to different parts of the EDA workflow. These efforts are all inductive to the new Siemens EDA AI system, which is a comprehensive, generative, and agentic AI platform designed to enable faster time-to-market and increase productivity across chip designers. To learn more about some of the different AI initiatives, please see here.

The challenge: When AI struggles with structure

Keyphrase Trees: Building bridges between structure and understanding

Real-world impact and performance

Looking to the future

What to read next:

Leave a Reply Cancel reply