Use of Large Language Model Embeddings to Predict Research Topic Suitability Based on Organizational Capabilities

March 2024 Greg Bacon and Vineetha Menon IEEE SoutheastCon 2024

Abstract

We performed a pilot study on the use of large language model technology to help researchers in industry and academia identify prospective opportunities to pursue for funding or grant awards, especially those that they might otherwise overlook due to reading volume, time pressure, and non-obvious connections. Our goal is to help researchers offload some of the burden to technology. As a use case, we query a recent Department of Defense (DoD) Small Business Innovation Research (SBIR) solicitation with natural language inputs in the form of real-world marketing documents and abstract areas of relevance. We experiment with clustering algorithms to determine which best use embeddings to predict solicitation topics that human team members would recommend for proposal. Investigation into this nascent yet practical application of technology will move toward human-centric automation and personalization of results through human reinforcement learning.

Contribution

Our LLM-based embedding clustering approach performed on an expertise level similar to that of undergraduate research interns: some hits, some misses, and some thought-provoking selections. Multiclass spectral clustering stood alone in its performance on clustering topics with the capabilities statement. For clusters focused on relevance queries, formulating performance metrics was challenging. We used subjective “eye tests” as experts to determine whether the LLM embedding framework was indeed identifying the correct clusters for topics as expected. Even so, the clusters and intersections did appear mostly reasonable, albeit with some surprises such as the physiological topics that matched cybersecurity.

We introduced the heuristic threshold

tk=maxdkmindkσdkt_k = \frac{ \max d_k - \min d_k }{ \sigma_{d_k} }

where where dkd_k are the Euclidean distances of embeddings from cluster kk’s centroid and σdk\sigma_{d_k} is their standard deviation.