DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • My LLM Journey as a Software Engineer Exploring a New Domain
  • Unlocking AI Coding Assistants Part 3: Generating Diagrams, Open API Specs, And Test Data
  • Key Considerations in Cross-Model Migration

Trending

  • Evolution of Cloud Services for MCP/A2A Protocols in AI Agents
  • It’s Not About Control — It’s About Collaboration Between Architecture and Security
  • Issue and Present Verifiable Credentials With Spring Boot and Android
  • How to Practice TDD With Kotlin
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application

Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application

A case study showing how LLMs accelerate graph theory development, reducing coding time from hours to minutes and boosting research productivity.

By 
Gerald Rigdon user avatar
Gerald Rigdon
·
May. 12, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.6K Views

Join the DZone community and get the full member experience.

Join For Free

Research activities necessitate the exploration of ideas which may involve significant software development, experimentation, and testing effort. The ability of Large Language Models (LLMs) to generate executable software has been demonstrated in various use cases. This study highlights the positive outcomes of a use case involving a graph theory application. Through a series of successful prompts, functional software was produced within minutes, a task that would otherwise have taken several hours or days to complete. Rather than focusing solely on cost savings from reduced engineering effort, this study aims to highlight the substantial opportunity benefits provided by this technology, especially since research typically involves evaluating multiple ideas before selecting the optimal solution.

Introduction

When considering great systems thinking quotes, Donnella H. Meadows stands out: “Remember, always, that everything you know, and everything everyone knows, is only a model. Get your model out there where it can be viewed. Invite others to challenge your assumptions and add their own.” 

In the context of software engineering, one way to wrap your head around a complex software code base is to create a static model of it and attempt to analyze the behavior accordingly. In the classic book Safeware (Leveson, 1995) it states: “Static analysis evaluates the software without executing it. Instead, it examines a representation of the software. In some ways, static analysis is more complete than dynamic analysis, since general conclusions can be drawn and not just conclusions limited to the particular test cases that were selected. On the other hand, static analysis necessarily is limited to evaluating a representation of a behavior rather than examining the behavior itself.”

A paper written several years ago (Rigdon, 2010), discusses the role of static analysis in medical device software development, providing insights into the process at Boston Scientific Corporation concerning the use of design constraints in software engineering activities. These design constraints, distinct from requirements, help ensure that detailed design and implementation remain within domain-specific boundaries set and enforced by static analysis activities. A concrete example is best illustrated by the following story, which also illustrates an opportunity to, as Meadows exhorted, “Get your model out there.” 

A few weeks ago, a Boston Scientific colleague identified a concern emerging from the development of a new device platform and proposed creating another design constraint. The issue involved creating a new firmware utility function interface and ensuring developers used this new interface in specific contexts instead of the existing interface that had been in use for years. Integrating a new function in an existing code base generally conjures up images in the mind of software engineers as they immediately want to understand how this new piece fits into the larger whole. As stated so well in (Hopstock, 2022), “Being able to model the inter-procedural control flow as a call graph is one of the most important building blocks when analyzing programs. Many of the more advanced analyses depend on this information being available.”

Conveniently, as discussed in (Rigdon, Doshi, Zheng, 2010), the Boston Scientific implantable firmware development environment includes the use of call graph data for static analysis produced by a customized Static Analysis Tool (SA Tool). But this new design constraint required leveraging that graph data differently, necessitating the development of software to experiment and determine its viability for this new application. It was estimated that although this was not likely to result in a lot of code, that it was still three to four days of considerable Python development and debug given the esoteric nature of the task. Consequently, the task was assigned to an available developer. Waiting until the developer completed the job successfully provided working code and a baseline that was useful for this case study.

Pacemakers and defibrillators, namely, safety-critical Class III implantable devices, are custom machines that are typically programmable. This includes a set of parameters programmed at the time of manufacture, later at the time of device implant, and in post-operative clinical settings. These telemetry interactions are an example of one group of activities that result in firmware utility function use. Referencing the previous conversation with a colleague concerning the creation of another design constraint for managing finer control of device firmware utility function use, the first order of business was to find all uses of the existing utility function, namely, utl_ParamUpdate() and then:

  • Decide which cases should be updated to use the newly proposed firmware utility function
  • Implement the new utility function and refactor the existing firmware to make use of it
  • Enforce, by means of a design constraint, that future design and implementation use the proper utility function 

Interestingly, this software engineering problem highlights an opportunity for testing the viability of another type of model, a Large Language Model (LLM).

Identifying Existing Uses

Figure 1 is a snippet of a much larger call graph in this code base that shows all callers of the utl_ParamUpdate() firmware utility function under discussion. For each caller, encased in the bright pink rectangles, there is an exploded view available as shown in Figure 2 and Figure 3. These subsequent views could likewise contain the same bright pink visuals leading to further exploded views until arriving at one or more root nodes or graph origin nodes.


An image of a snippet of a call graph.

Figure 1




Exploded view of the call graph.

Figure 2



Exploded view of the call graph.

Figure 3



It was expected that the utl_ParamUpdate() firmware utility function would be called in the following cases, identifiable by the graph root nodes:

  • During initialization of the firmware and following a reset
  • When commanded by an authenticated external device connected through telemetry
  • Dynamically, during firmware execution including monitoring and therapy delivery, etc.

Performing text searches like greps or otherwise are not overly useful for discovering these call paths given the goal is to find the root node which would, as stated, allow the identification of the classifications from the above bulleted cases. Granted, given a simple graph, such searches can be successful. However, this discovery process becomes much more difficult when the call graph is both deep and wide or as shown in Figure 3 above, identifies multiple root node path cases. Hence, a manual process would be more susceptible to yielding inaccurate results given graph complexity. It is a problem in search of an automated solution especially since the call graphs themselves are auto generated.

Hand-Crafted Code Creation

As stated earlier, this project commenced with the creation of software by an engineer given some general requirements and the SA Tool generated call graph output CSV file as an input. Figure 4 is a terse example of a call graph file that is a list of all file and function caller and callee pairs.

An example of a call graph file.

Figure 4


Each row in this graph CSV file represents nodes in a call graph that are connected. These connections in a directed graph are known as edges which represent the relationship between the caller or source node that invokes the callee or target node. With a list of these caller and callee pairs one has all the information necessary to build out a visual call graph. Further, with the use of some open-source tools like Graphviz this information can be converted to the DOT (graph description language) format which can be ultimately rendered in formats such as JPEG for displaying beautiful visual graphs.

This case study was based on a project whose graph CSV file consisted of over 3,500 rows where the primary software requirements are conveyed as follows:

Req1: The software shall use the graph CSV as an input file and produce a list of all caller root node functions and subsequent node edges that result in all possible unique directed graph paths that end in a specified callee function.

Req2: Each path in Req1 shall be output as a separate row in a destination CSV file where the edges between each node are represented visually with “->” characters.

Although the final product was not a large Python script as had been anticipated, it took three days to develop. Not surprising given most developers fluent in Python are often not well versed in how to write and debug code for graphs unless that is their primary job focus. So, the initial development effort estimates were close. The developed Python code yielded an output file containing one-hundred-six unique paths. An excerpt capturing five of these paths is shown below in Figure 5.

An excerpt capturing five paths from a Python code.

Figure 5


Code Creation With OpenAI ChatGPT-4o

Now with some source code and generated output, the question to be answered was could one get an LLM to write a Python script that could produce the same output. The study used a custom AI web application known as Blue Sky (a tribute to Jeff Lynne and the Electric Light Orchestra and nothing to do with the new social media application boasting the same name). Blue Sky offers several features, and the one chosen was the basic chat with a document selection and configured for connectivity to OpenAI GPT-4o. As shown in Figure 6, the graph CSV named “TestProject.bsci.calls.csv” was uploaded, which is the file produced by the SA Tool for the project of interest.

A screenshot of the graph CSV named “TestProject.bsci.calls.csv”

Figure 6


To produce the correct Python script, the GPT-4o LLM was provided with the prompts shown in Figure 7 (for brevity the intermediate LLM responses were omitted). A close examination reveals some liberties were taken with the wording of the prompts compared to the previously described requirements, namely, Req1 and Req2. Furthermore, only three prompts were needed, taking less than fifteen minutes to generate the desired results. For each GPT-4o script recommendation following each prompt, the generated code was copied and pasted into a file, executed, and the output results were compared until a perfect match was found with the output produced by the hand-crafted code. 

Prompts provided to GPT-4o LLM

Prompts provided to GPT-4o LLM

Prompts provided to GPT-4o LLM

Figure 7


The final GPT-4o generated Python script shown in Figure 8 (although different from the hand-crafted script) was finalized when the output matched the output from the hand-crafted script. An obfuscated Figure 9 captures a side-by-side difference report intended to show that the files are identical (no markups) instead of the specific content of the output.


The final GPT-4o generated Python script

The final GPT-4o generated Python script

Figure 8



A side-by-side difference report.

Figure 9


Code Creation With Anthropic Haiku – Attempt 1

Unfortunately, the first attempt with the Anthropic Haiku LLM did not go very well. The strategy was to lead off with the same initial prompt given to OpenAI ChatGPT-4o and then respond accordingly based on the LLM recommendations. Figure 10 captures all the prompts sent to Haiku (for brevity the intermediate LLM responses were omitted) until it was decided to give up. Knowing when to quit and start over is not straightforward. After working with LLMs, sometimes results deteriorate, and it becomes apparent that things are beyond recovery. In these cases, it is best to clear the chat memory and get a fresh start.


All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

All the prompts sent to Haiku.

Figure 10



Code Creation With Anthropic Haiku – Attempt 2

For the second attempt, a different prompt more aligned with the wording from Req1 was used. Figure 11 captures all the prompts leading up to success in this round.


All the prompts leading up to code creation success with Anthropic Haiku.

All the prompts leading up to code creation success with Anthropic Haiku.

All the prompts leading up to code creation success with Anthropic Haiku.

All the prompts leading up to code creation success with Anthropic Haiku.
All the prompts leading up to code creation success with Anthropic Haiku.

Figure 11



Figure 12 captures the Python script that generated the identical output captured in Figure 9. Notably, the script in Figure 12 is quite different from the GPT-4o generated script that is shown in Figure 8. It is smaller and takes advantage of the Pandas and NetworkX Python packages.

An image of a Python script.

Figure 12



More on Prompting

Given the ChatGPT-4o experiment captured in Figure 7 required three prompts, intuitively one might conclude that combining all three prompts into a single prompt would work and be the most efficient method. In practice, the intuition proved true as the consolidated single prompt shown in Figure 13 produced a script that generated the correct output file.


An image of ChatGPT-4o prompt.

Figure 13



However, given the Anthropic Haiku Attempt 1 failure after fourteen unsuccessful prompt responses shown in Figure 10, one might think about alternative prompt strategies altogether. A well-known research paper (Wei, Wang, Schuurmans, Bosma, Ichter, Xin, Chi, Le, Zhou, 2022) draws the comparison between Standard Prompting and Chain-of-Thought Prompting. Instead of a conventional, standard, or open-ended approach to prompting that may intuitively feel “more conversational,” this research suggests that using a series of intermediate reasoning steps likened to chains of logic yield better LLM responses to queries involving complex reasoning. Figure 14 highlights the additional intermediate reasoning steps.


An alternative reasoning prompt steps.

Figure 14



After spending a brief amount of time with the Chain-of-Thought prompting strategy in further experiments, it became clear that more learning is necessary. Notably, it was observed that over-engineering a prompt can result in poor responses, requiring further experimentation that is out of the scope of this study.

Testing Considerations

Testing is a critical component, whether the script is manually written or generated by an LLM. When developing a script, it is essential to verify its functionality through testing. Typically, this involves using a reduced test code base, for example, code with known call graph data output or expected results. The test design might include something like ten to twenty call paths, which are representative and sufficient enough to build confidence that the script will work correctly with a larger code base containing hundreds or thousands of call paths. Therefore, as a script is developed, it needs to be evaluated against the test code base. This testing approach is equally valid and necessary when an LLM is used to generate scripts based on prompts. In this study, the output of the manually crafted script included one-hundred-six unique paths in an actual project code base specified by requirements Req1 and Req2. Consequently, this file was used for evaluating the output of the scripts generated by the LLMs because the input file was a real project file and not a test file.

Conclusion

This case study compared a traditional software engineering approach (including web search which usually involves things like the use of Stack Overflow and like forums, etc.) where one typically encounters partial solutions requiring programmer will and brute force to solve the problem, to a different approach using LLMs. The discovery revealed that what took days could be accomplished in a few hours, even minutes. The findings are summarized in Figure 15 as follows:

An image showing a case study findings.

Figure 15



As previously alluded to, the Chain-of-Thought experiments did not work so well. Hence, a couple of hours of additional work went into this excursion, where the threshold to declare failure was decided to be the number of prompts it took to reach success in the Standard Prompt Case. The big takeaway is that even if one spent an entire day experimenting with various prompts, for this case study it still would have reduced the overall engineering effort by two-thirds. And the upside is that time spent in prompt experimentation builds skills that will prove beneficial when confronted with the next problem in need of a solution.

Personal Thoughts

During a recent conference call on the use of LLMs, I was introduced to a book on how people confront change (Johnson, 2002). After discovering it was massively popular as Amazon’s best seller with millions of copies sold, I wondered how this had escaped my attention for so long. To summarize, the book is a quick read about mice and “Littlepeople” (creatures the size of mice that act like people) in a maze and how they react when their respective cheese stores unexpectedly move to an unknown location. The book allows for multiple interpretations of its message, making it a malleable guide to self-reflection on how we react and adapt to ever changing environments. While change may not always be better, it sometimes is, and inevitable, nevertheless. Echoing Meadows again, this story is another great example of an explanatory model. 

Having been a product of the 1980s software engineering generation, I recall my hesitation when I was first introduced to the web search interface. After all, my bookcases were full of books that I had relied on for years in getting my work done. However, I soon discovered the power of the internet and there was no looking back. Admittedly, while I still hang on to many of these old texts, I rarely use them in my day-to-day work. Yet in a metaphorical sense, the web only search has become my books of the past and the LLM prompt is now the latest technology. However, changing habits is not easy, especially when going immediately to web search has become second nature. Therefore, I am actively retraining myself to make the LLM prompt my primary go-to option and other search activities, like my old books, are still there if I need them.

References

Rigdon, G. (2010, July). Static Analysis Considerations for Medical Device Firmware. Embedded Systems Conference Proceedings.

Rigdon, G., Doshi, H., Zheng, X. (2010, July). Static Analysis Considerations for Stack Usage. Embedded Systems Conference Proceedings.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xin, F., Chi, E., Le, Q. Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. 36th Neural Information Processing Systems Conference Proceedings.

Johnson, S. (1998, 2002). Who Moved My Cheese. G. P. Putnam’s Sons.

Hopstock, S. (2022, November). Call Graphs: The Bread and Butter of Program Analysis. guardsquare.com. https://www.guardsquare.com/blog/call-graphs-the-bread-and-butter-of-program-analysis

Leveson, N. (1995). Safeware. Addison-Wesley.


AI Graph (Unix) large language model

Opinions expressed by DZone contributors are their own.

Related

  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • My LLM Journey as a Software Engineer Exploring a New Domain
  • Unlocking AI Coding Assistants Part 3: Generating Diagrams, Open API Specs, And Test Data
  • Key Considerations in Cross-Model Migration

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

OSZAR »