<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>FINA2350 Student Blog 2026 - Final Report</title><link href="https://buehlmaier.github.io/FINA2350-student-blog-2026-01/" rel="alternate"/><link href="https://buehlmaier.github.io/FINA2350-student-blog-2026-01/feeds/final-report.atom.xml" rel="self"/><id>https://buehlmaier.github.io/FINA2350-student-blog-2026-01/</id><updated>2026-05-03T00:00:00+08:00</updated><entry><title>Progression reflection on our NLP model by Group Project ICE</title><link href="https://buehlmaier.github.io/FINA2350-student-blog-2026-01/progression-reflection-on-our-nlp-model-by-group-project-ice.html" rel="alternate"/><published>2026-05-03T00:00:00+08:00</published><updated>2026-05-03T00:00:00+08:00</updated><author><name>FINA2350 Students 2026</name></author><id>tag:buehlmaier.github.io,2026-05-03:/FINA2350-student-blog-2026-01/progression-reflection-on-our-nlp-model-by-group-project-ice.html</id><summary type="html">&lt;p&gt;This post details the latest updates from Group Project ICE.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Financial data cleaning&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Following our plan from the first blog post, we resolved the S\&amp;amp;P500 financial data cleaning issues regarding missing information in two different ways accordingly to our suggested plan. First, when the primary label such as cost …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This post details the latest updates from Group Project ICE.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Financial data cleaning&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Following our plan from the first blog post, we resolved the S\&amp;amp;P500 financial data cleaning issues regarding missing information in two different ways accordingly to our suggested plan. First, when the primary label such as cost of revenue is missing, the model will look for established alternative field names, such as cost of goods sold or cost of sales. Second, the model will also derive missing information through accounting formulas, such as deducing gross profit by subtracting cost of revenue from revenue. Collectively, these two implementations successfully address field name inconsistency data collection, reflecting the effectiveness of our initial plan, or so we thought.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;gross&lt;/span&gt;\&lt;span class="n"&gt;_profit&lt;/span&gt; \&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;\&lt;span class="n"&gt;_income&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Gross Profit&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Gross Income&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gross&lt;/span&gt;\&lt;span class="n"&gt;_profit&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;cost&lt;/span&gt;\&lt;span class="n"&gt;_of&lt;/span&gt;\&lt;span class="n"&gt;_revenue&lt;/span&gt; \&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;\&lt;span class="n"&gt;_income&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="s2"&gt;&amp;quot;Cost Of Revenue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Cost of Revenue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="s2"&gt;&amp;quot;Reconciled Cost Of Revenue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Cost Of Goods Sold&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="s2"&gt;&amp;quot;Cost Of Sales&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Reconciled Cost Of Goods Sold&amp;quot;&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;\&lt;span class="n"&gt;_of&lt;/span&gt;\&lt;span class="n"&gt;_revenue&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                    &lt;span class="n"&gt;gross&lt;/span&gt;\&lt;span class="n"&gt;_profit&lt;/span&gt; \&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; \&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;\&lt;span class="n"&gt;_of&lt;/span&gt;\&lt;span class="n"&gt;_revenue&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;While we found a great improvement in data cleaning, we still encountered a large range of missing field after data cleaning, which became very confusing. However, we eventually realized that this only mostly happened within financial sector firms with the contributing factor eventually identified as their structural differences. Since financial firms operate differently from the typical industrial company, which buys inputs and sells outputs while funding operations with borrowings, they do not report items like free cash flow and capital expenditure, as their cash flows are driven by flow of loans and deposits rather than capital expenditure cycles, where companies invest money in assets to increase production and generate cash flow. Given that our screening criteria depends heavily on free cash flow, financial sector companies were breaking our pipeline structurally due to information that never existed beyond mere inconsistency in field names as we previously thought. Therefore, it was determined that the most appropriate resolution was to exclude them from the model’s candidate pool entirely, done through the below codes. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sector&lt;/span&gt; \&lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Financials&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;tickers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This decision is further supported by the general infeasibility of financial sector firms as LBO candidates. Not only are they already heavily leveraged by nature, but their assets are predominantly financial instruments like loans rather than operational assets that can generate predictable cash flow, and regulatory capital requirements also make post-acqutision restructuring severely constrained. As a result, financial sector firms were excluded from the model screening as due to both technical and financial reasons. Reflecting on this decision, we believe this was the right call as data completeness improved significantly, with only occasion missing fields that can be neglected in impact. &lt;/p&gt;
&lt;p&gt;What we learnt from this experience was more than just technical coding issues. As we originally assumed that data collection would be relatively simple through simple lines of coding, but we learned that real world financial data is inconsistent and does not always conform to a single structure as well due to the diverse ways of business operations, illustrating the difficulty behind data standardization beyond technical coding.&lt;/p&gt;
&lt;p&gt;These 2 graphs illustrate the before and after data completeness, with empty fields in red&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://buehlmaier.github.io/FINA2350-student-blog-2026-01/images/Group-ProjectICE_02_figure1.png"&gt;
&lt;strong&gt;Figure 1 -&lt;/strong&gt; Before data completeness (I)&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://buehlmaier.github.io/FINA2350-student-blog-2026-01/images/Group-ProjectICE_02_figure2.png"&gt;
&lt;strong&gt;Figure 2 -&lt;/strong&gt; Before data completeness (II)&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="https://buehlmaier.github.io/FINA2350-student-blog-2026-01/images/Group-ProjectICE_02_figure3.png"&gt;
&lt;strong&gt;Figure 3 -&lt;/strong&gt; After data completeness&lt;/p&gt;
&lt;h2&gt;&lt;strong&gt;Merging the NLP and Financial Layers&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Another major update was combining the NLP output with the financial dataset. This was necessary because the project was never meant to be only a text model or only a financial ratio screen. The goal was to identify companies that show signs of strategic stagnation in their filings, while also having financial characteristics that make them worth reviewing as possible LBO candidates.&lt;/p&gt;
&lt;p&gt;At this stage, the project had two separate layers. The first layer was the NLP layer which contained company-level stagnation metrics extracted from filings, including innovation decay, strategic decay, and topic rigidity. These variables were designed to capture whether a company’s language was becoming less dynamic over time. Another layer was the financial layer which contained structured financial metrics such as interest coverage, debt-to-EBITDA, and free cash flow. These variables were not the main signal of the project, but they were necessary because an LBO candidate must be financially feasible. A company can look interesting from a language perspective, but if it cannot support debt or generate cash flow, then it is not very useful as an LBO candidate. &lt;/p&gt;
&lt;p&gt;The problem was that these two datasets were created separately. The NLP file was organized around filing-level or company-level textual outputs, while the financial file was organized around accounting variables and financial years. Before producing a final ranking, we had to make both layers compatible. &lt;/p&gt;
&lt;p&gt;The most important step was standardizing ticker symbols. If one dataset has &lt;code&gt;“aapl "&lt;/code&gt; and the other has &lt;code&gt;"AAPL"&lt;/code&gt;, Python treats them as different values. That type of small formatting issue can cause companies to disappear during the merge. So before combining the datasets, we stripped whitespace and converted all tickers to uppercase.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Standardize ticker symbols in both datasets&lt;/span&gt;
&lt;span class="n"&gt;nlp_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ticker&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nlp_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ticker&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;fin_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ticker&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fin_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ticker&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We also had to make sure the financial dataset had only one row per company. Since financial data can contain multiple years for the same ticker, we kept the latest available financial year for each company. This allowed the final ranking to compare one NLP stagnation score with one financial feasibility score per company.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Keep the latest financial year for each company&lt;/span&gt;
&lt;span class="n"&gt;fin_latest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;fin_df&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ticker&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;year&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ticker&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;as_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After that, we merged the NLP layer with the latest financial layer using ticker as the common identifier.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Merge NLP stagnation metrics with financial feasibility metrics&lt;/span&gt;
&lt;span class="n"&gt;merged_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nlp_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;fin_latest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ticker&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;inner&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;NLP companies:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nlp_df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Financial companies:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fin_latest&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Merged companies:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;merged_df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We used an inner merge because we only wanted companies that had both usable NLP data and usable financial data. This reduced the final universe, but it made the results cleaner. In our larger run, the NLP file had more than 300 companies, but after merging with the clean financial dataset, the final ranking universe became 119 companies. We initially thought this was a failure of the model as we assumed that a good model should be able to evaluate as many companies as possible. However, we learned that the excluded companies were dropped because they lacked usable NLP or financial data to be properly evaluated, and including them in the final ranking would have only produced quantitative output that has no practical qualitative information. So rather than this being a failure, we understood that it was a clean overlap between the textual and financial layers instead.&lt;/p&gt;</content><category term="Final Report"/><category term="Group Project ICE"/></entry></feed>