Technical debt is the implied cost of rework caused by choosing an easier, quicker solution now instead of a better approach that would take longer. While it can speed up initial delivery, this "debt" slows down future development, increases costs, and lowers software quality over time, similar to financial debt. It can be incurred intentionally for short-term gains or unintentionally through poor practices, and includes issues like inadequate code, design, or documentation.

Causes of technical debt:

Fowler's thoughts:

Thoughts

High-level categories

Metrics

Tracking technical debt with metrics: Keeping track of technical debt is essential for maintaining the health and sustainability of a software project.


Tech Debt Index (TDI)

Measurement: Metric (Goal will depend on how the metric is calcuated)

A composite metric that combines various indicators of technical debt into a single index.

Mechanism:

Combining various technical debt metrics (like code complexity, bug count, etc.) into a single index.

Benefits:

Risks:

The construction of the index can be subjective, depending on which metrics are included.
May oversimplify complex aspects of technical debt.


Tech Debt Ratio (TDR)

Goal: Low Measurement: percentage

TDR is a measure of the cost to fix the technical debt relative to the size of the codebase. Use TDR to assess and communicate the overall health of the codebase to both technical and non-technical stakeholders.

Mechanism:

  1. Identifying components. Decide which parts of the application are to be evaluated for technical debt.
  2. Code analysis. Use static code analysis tools suitable for the programming languages and frameworks used in the fintech app (e.g., ESLint for JavaScript, SonarQube for a range of languages).
  3. Calculating remediation cost. Estimate the time required to address the code issues identified. Consider the complexity of the fintech environment, which might require specialized skills. Assume an average cost per developer hour specific to fintech development expertise (e.g., $80/hour). Example: If it takes 15 hours to refactor the complex code and update libraries, the cost is 15 hours * $80/hour = $1,200.
  4. Calculating development cost. Estimate the total cost of developing the app or its specific modules, including design, coding, testing, deployment, and any other expenses. Example: The development cost of the investment portfolio manager might be $30,000.
  5. Calculating Technical Debt Ratio. Technical Debt Ratio = (Remediation Cost / Development Cost) * 100
    Calculation: For the investment portfolio manager, the ratio is ($1,200 / $30,000) * 100 = 4%.
  6. Interpretation and action. Interpretation: A 4% technical debt ratio indicates a moderate level of debt. Action: Prioritize fixing critical issues, especially those related to security, data integrity, and regulatory compliance.

Benefits:

Limitations:


Defect count

Goal: Low Measurement: Instances of defects within a system

This metric tracks the number of open bugs. High numbers can indicate underlying quality issues contributing to technical debt.

Mechanism:

Use a defect tracker or other system (spreadsheet).

Benefits:

Risks:


Defect age

Goal: Low Measurement: Time per defect

Bugs that have been open for a long time can indicate that technical debt is being neglected. This can lead to a deteriorating codebase and user experience.

Mechanism:

Use a defect tracker or other system (spreadsheet). Timestamps must be present for each defect report.

Benefits:

Risks:


Defect growth rate

Goal: Low, preferably negative Measurement: instance delta (number of new bugs reported versus the number of bugs that have been resolved or closed over a specific period)

Mechanism:

Track the number of new bugs reported and the number of bugs closed in your issue tracking or project management software over regular intervals (e.g., weekly, monthly).

Benefits:

Risks:


Defect ratio, Defect density

Goal: Low Measurement: number of defects in a software system relative to its size

Helps in tracking the quality of the software over time, indicating whether technical debt is causing an increase in defects. Can guide decisions about where to focus development efforts, particularly in identifying areas of the codebase that may be contributing disproportionately to the defect count. Assists in prioritizing technical debt reduction strategies, especially when correlated with other metrics like code complexity or code churn.

Applied to whole systems, this is essentially a bug count. Applied to individual files or components, it highlights (heatmaps) the 20% of the code where 80% of the defects live.

Benefits:

Limitations:

Mechanism:


Cycle Time

Goal: Low Measurement: Time

Measures the time it takes for a unit of work to be completed from start to finish. Unlike throughput, which measures the amount of work completed in a given time period, cycle time focuses on the amount of time. This metric is useful for understanding how long it takes to complete a specific task, and can be useful for identifying bottlenecks or inefficiencies in the development process.


Lead time for changes, Time to Market (TTM)

Goal: Low Measurement: Time

Lead time for changes is the time it takes to go from code committed to code successfully running in production. TTM is typically customer-facing and at largerscale than a single feature. This metric is important for measuring the speed and agility of your software delivery process. Lead time takes into account the time it takes to plan, prioritize, and start work on a unit of work, as well as the time it takes to complete it. It can be useful for understanding how long it takes to deliver new features or improvements to customers.

Mechanism:

Measure the duration from when a feature is planned to when it is available to users.

Benefits:

Risks:


Code Churn

Goal: Low Measurement: percentage of a developer's own code that is recently edited (added, modified, deleted) after being written.

Code churn can be indicative of indecision, lack of clarity, or misunderstanding – all of which contribute to technical debt. High code churn, especially late in a development cycle, can be a red flag that code may be less reliable or harder to maintain. High churn may indicate unstable or problematic areas in the codebase. Helps identify features or modules that are frequently changed and might be accumulating debt.

Mechanism:

Benefits:

Limitations:


Development Throughput, Velocity

Goal: High Measurement: Instances (stories, features, etc) within time

Throughput is a way of measuring how much work your team is able to complete in a specific period of time. It's a way of tracking progress and understanding how well your processes are working. By keeping an eye on throughput, you can make changes to improve the efficiency of your team’s work. Tracking throughput is like keeping a pulse on the health of the development process.

Mechanism:

  1. Define the unit of work. The first step is to define what constitutes a unit of work in your development process. For example, you may decide that a unit of work is a completed user story, a fixed bug, or a feature that meets acceptance criteria. Whatever you choose, it's important that the definition of a unit of work is consistent and clear for everyone on the team.
  2. Track progress. Next, track how many units of work are being completed in a given time period. For example, you might track the number of completed user stories over the course of a sprint, or the number of fixed bugs over the course of a week.
  3. Calculate the throughput. Finally, calculate the throughput by dividing the number of units of work completed by the amount of time it took to complete them. For example, if your team completed 20 user stories in a two-week sprint, the throughput would be 20 user stories / 2 weeks. Important: You can’t measure a weekly throughput by dividing this number by 2. For that, you need to assess a week as a separate entity.

Example: Development team of 5 people completed a total of 20 user stories in 4 weeks. The team's throughput would be 20 user stories / 4 weeks.

Benefits:

Risks:


Pull Request Size

Goal: Low Measurement: bytes, lines of code, # of files

Pull request size refers to the extent of code changes introduced within a single pull request, typically measured by the number of files modified and lines of code added or removed.

Measuring pull request size helps streamline code reviews, accelerate feedback cycles, and improve code quality by breaking down changes into manageable units and facilitating efficient collaboration among team members.


Code Duplication

Goal: Low Measurement: percentage

Code duplication often leads to maintenance challenges. If a bug is found in a piece of duplicated code, it needs to be fixed in all instances. This increases the likelihood of missed defects and future issues.

Mechanism:

Analyze the codebase using static code analysis tools to identify and quantify duplicated blocks of code.

Benefits:

Risks:


Cyclomatic Complexity

Goal: Low Measurement: instances of linearly independent paths through a unit of code

High cyclomatic complexity indicates that the code may be harder to test thoroughly and more prone to defects. It can also be harder to read and understand, increasing the likelihood of introducing errors during future changes. ‍Use this metric to identify complex code that may be harder to maintain and more prone to defects. Analyzing cyclomatic complexity at a method, class, and module level can help identify specific areas that may benefit most from refactoring.

Mechanism:

‍Use static analysis tools to calculate the number of independent paths through the code.

Benefits:

Risks:


Test Coverage

Goal: High Measurement: percentage over lines of code and/or files

It measures the extent to which the modified code is covered by automated tests. Higher test coverage indicates more confidence in the changes and reduces the risk of regressions. Test coverage analysis helps ensure code quality and stability. While high test coverage is not a guarantee of code quality, inadequate test coverage can mean that the codebase is not well-protected against regressions, making it riskier to address technical debt through refactoring.

Mechanism:

Use testing tools and frameworks to calculate the percentage of code executed during testing.

Benefits:

Risks:


Deployment Frequency

Goal: High Measurement: Instances within time

Deployment frequency is the number of times code is deployed to production in a given period of time. This metric is important for measuring the speed and agility of a software delivery process.

For example, if your team is able to deploy code to production frequently, it can lead to a faster feedback loop and quicker response to issues or changes. This can lead to a better product and increased user satisfaction.

A high deployment frequency indicates that teams are able to deliver changes quickly, responding to the needs of the business.


Changed Failure Rate

Goal: Low Measurement: percentage (of changes (e.g., deployments or code commits) that fail and require immediate remediation (like a hotfix or rollback))

Mechanism:

Tracking the success and failure rates of deployments and code changes.

Benefits:

Risks:


Number of Failed CI/CD Events

Goal: Low Measurement: instances (number of times continuous integration (CI) or continuous deployment (CD) processes fail.)

Mechanism:

Benefits:

Risks:


Mean time to recovery (MTTR)

Goal: Low Measurement: Time

MTTR is the average time it takes to repair a system after a failure. This metric is important for measuring the reliability and availability of a system.

For example, let's say your company has a web application that experiences a failure. The MTTR is the time it takes for your team to identify the failure and fix the issue. A low MTTR means your team is able to quickly detect and resolve issues, ensuring that your application is available to users.

A low MTTR indicates a system that is reliable and easy to recover from failures.


Mean time between failures (MTBF)

Goal: High Measurement: Time

MTBF is the average time between failures in a system. This metric is important for measuring the reliability and availability of a system.

For example, if your web application has a high MTBF, it means that your users will experience fewer issues and downtime due to system failures. This can lead to increased user satisfaction and a better overall experience with your product.

A high MTBF indicates a system that is reliable and has fewer failures.


Mean time to detect (MTTD)

Goal: Low Measurement: Time

MTTD is the average time it takes to detect a failure in a system. This metric is important for measuring the effectiveness of monitoring and alerting systems.

For example, if you have a monitoring system set up to detect issues with your web application, the MTTD is the time it takes for your team to receive an alert after a failure occurs. A low MTTD means your team can quickly identify and address issues, minimizing the impact on users.

A low MTTD indicates that failures are detected quickly, allowing teams to respond and recover more quickly.


User value and impact

Goal: High Measurement: opinion/human analysis

It involves considering factors like user feedback, customer requests, or business goals. By prioritizing value-driven changes, your team will ensure that their efforts align with the needs and expectations of stakeholders.


Peer review

Goal: High Measurement: opinion/human analysis

Instead of relying solely on metrics, emphasize the feedback and insights provided during peer code reviews. Human judgment and expertise play a crucial role in evaluating code changes. Peer reviews provide an opportunity for knowledge sharing, identifying potential issues, and enhancing overall code quality.


Table of Tech Debt ideas

Metric Name,Description,Category,Primary Source,Probability of Existing,Ease of Collection,Importance (Why track this?)
1. Cyclomatic Complexity,A quantitative measure of the number of linearly independent paths through a program's source code.,Engineering,SonarQube,High,Easy (Out of box),High complexity increases defect probability and makes testing/refactoring exponentially harder.
2. Cognitive Complexity,"A measure of how difficult a unit of code is to intuitively understand (unlike Cyclomatic, which measures structural logic).",Engineering,SonarQube,Medium,Easy (Out of box),"Directly correlates to ""time to read"" and onboarding costs for new developers."
3. Code Coverage,The percentage of code lines executed by automated unit/integration tests.,Engineering,SonarQube / JaCoCo,High,Easy (CI Integration),"Low coverage implies ""Blind Spots"" where regressions can occur without detection."
4. Code Duplication %,The percentage of code blocks that appear identical or nearly identical in multiple places.,Engineering,SonarQube,High,Easy (Out of box),Violates DRY (Don't Repeat Yourself); a bug fix in one spot must be manually replicated.
5. SQALE Rating (Debt Ratio),An aggregated rating (A-E) based on the estimated time required to fix all maintainability issues.,Engineering,SonarQube,Medium,Easy (Config needed),"Provides a high-level ""Credit Score"" for codebases that executives can easily understand."
6. Code Churn,"The volume of lines added, modified, or deleted over a specific time period.",Engineering,GitHub Ent.,High,Medium (Git mining),"High churn in legacy files often indicates ""Hotspots"" of instability or unclear requirements."
7. TODO/FIXME Count,The count of code comments explicitly flagged as temporary workarounds or deferred maintenance.,Engineering,SonarQube / Grep,High,Easy (Regex search),"A literal ""IOU"" list developers have left in the code; indicates rushed features."
8. Deployment Frequency,How often an organization successfully releases to production.,Delivery (DORA),Jenkins / GitHub,Medium,Medium (Log parsing),"Low frequency often implies brittle pipelines, manual gates, or fear of breaking production."
9. Lead Time for Changes,The amount of time it takes a commit to get into production.,Delivery (DORA),Jira / GitHub,Low,Hard (Correlating tools),"Long lead times indicate pipeline bottlenecks, slow code reviews, or manual testing debt."
10. Change Failure Rate,The percentage of deployments causing a failure in production.,Delivery (DORA),ServiceNow / Jira,Medium,Hard (Manual tagging),"High failure rates indicate ""Quality Debt""—lack of automated testing or reliable staging."
11. Mean Time to Recovery,How long it takes to restore service after a failure.,Delivery (DORA),PagerDuty,High,Medium (Incident tools),"High MTTR suggests ""Observability Debt""—it takes too long to diagnose and patch issues."
12. Flaky Test Ratio,The percentage of automated tests that fail and pass without code changes.,Delivery,CI Tool,Low,Hard (Log analysis),"Destroys trust in the CI/CD pipeline, causing developers to ignore red builds."
13. Build Duration,The time it takes for a full CI/CD pipeline to run.,Delivery,CI Tool,High,Easy (CI Logs),"Slow builds break flow state and discourage frequent commits, leading to larger riskier merges."
14. Library Freshness,A measurement of how many versions behind the current stable release your dependencies are.,Cyber / Platform,Snyk / Dependabot,Medium,Easy (SaaS Tools),Outdated libraries prevent using new features and eventually become security liabilities.
15. Critical/High CVEs,Count of known security vulnerabilities in code or dependencies.,Cyber,Snyk / Veracode,High,Easy (Scanners),"Immediate risk of breach; represents ""Security Debt"" that must be paid immediately."
16. Secret Leaks Count,"Number of hardcoded secrets (API keys, passwords) detected in the codebase.",Cyber,TruffleHog,Low,Medium (Scanners),"Indicates poor security practices and requires immediate, expensive rotation of credentials."
17. IAM Over-provisioning,Percentage of permissions granted vs. permissions actually used by services.,Cyber,AWS IAM Analyzer,Low,Hard (Cloud audit),"""Least Privilege"" debt; increases blast radius if a service is compromised."
18. Infrastructure Drift,The delta between the defined IaC (Terraform) and the actual state of the cloud environment.,Infrastructure,Terraform / Driftctl,Low,Medium (State diffs),"Indicates ""ClickOps"" (manual changes) are happening, making disaster recovery unreliable."
19. Cloud Asset Utilization,"Measure of provisioned CPU/RAM vs. actual usage (e.g., idle instances).",Infrastructure,AWS Cost Explorer,High,Easy (Cloud Native),Financial debt; paying for resources that aren't delivering value due to poor optimization.
20. Legacy OS Instances,Count of servers running on End-of-Life (EOL) Operating Systems.,Infrastructure,CMDB / Console,Medium,Medium (Inventory),High operational risk; EOL systems cannot be patched against new exploits.
21. IaC Coverage %,Percentage of cloud infrastructure managed via code vs. manual console creation.,Infrastructure,Terraform,Low,Hard (Manual Audit),Low coverage means environments cannot be easily replicated or restored.
22. Cyclic Dependencies,"Number of cycles between packages/modules (A depends on B, B depends on A).",Architecture,SonarQube,Low,Medium (Static Analysis),Makes code tightly coupled; you cannot change one module without breaking the other.
23. God Class Count,"Classes that exceed a high threshold of Lines of Code (e.g., >2000 LOC) or methods.",Architecture,SonarQube,High,Easy (Static Analysis),"These classes know too much, are hard to test, and usually become the bottleneck for changes."
24. API Contract Breaking,Frequency of backward-incompatible changes to internal/external APIs.,Architecture,Pact / Swagger,Low,Hard (Diffing specs),High frequency breaks downstream consumers and forces unplanned refactoring work.
25. Afferent/Efferent Coupling,Measures stability (how many rely on you) vs. instability (how many you rely on).,Architecture,ArchUnit,Low,Medium (Tooling req.),High coupling makes the architecture rigid; changing one component ripples through the system.
26. Monolith Size (LOC),Total Lines of Code in a single deployable unit (if microservices strategy is desired).,Architecture,SonarQube / Git,High,Easy (LOC Count),"If the goal is decoupling, a growing monolith represents negative architectural progress."
27. Bus Factor,The minimum number of developers required to hit by a bus before the project stalls.,Process,Git Analytics,Low,Hard (Algo required),"Low bus factor (e.g., 1) indicates knowledge silos and ""Knowledge Debt."""
28. Defect Backlog Age,The average time known non-critical bugs sit in the backlog.,Product,Jira,High,Medium (JQL Query),"Old bugs are rarely fixed; they clutter the view and represent ""Product Debt."""
29. Onboarding Time,"Time from ""Day 1"" to ""First PR Merged"" for a new engineer.",Process,HR + GitHub,Low,Hard (Manual data),Proxy for documentation quality and environment complexity. Long onboarding = high complexity.
30. DB Schema Version Lag,Difference between production schema version and migration scripts.,Data,Flyway,Medium,Medium (DB Tools),Indicates manual DB patches or dangerous divergence between code and database state.
31. Dead Code / Unused Tables,Methods or DB tables that have zero usage references.,Data,Sonar / Logs,Medium,Hard (Usage analysis),Clutter that confuses developers and wastes backup/storage resources.
32. Documentation Staleness,Average age of files in the /docs or Wiki compared to current date.,Process,Confluence,Medium,Medium (API Metadata),Outdated docs are worse than no docs; they mislead engineers and cause outages.
33. PR Review Time,Average time elapsed between a Pull Request being opened and merged/closed.,Process,GitHub,High,Easy (API available),Long review times block value delivery and increase merge conflicts (Process Debt).
34. Branch Lifespan,The duration a feature branch exists before being merged.,Delivery,GitHub / Git,High,Easy (Git logs),"Long-lived branches drift from main, leading to ""Merge Hell"" and integration debt."
35. Merge Conflict Rate,Percentage of PRs that require manual conflict resolution.,Delivery,GitHub / Git,Low,Medium (Git stats),High rates indicate poor communication or architectural coupling.
36. Test Execution Time,Total time to run the full test suite (local or CI).,Delivery,CI Tool,High,Easy (CI Timestamps),"If tests take 1 hour, devs won't run them locally, leading to ""Feedback Loop Debt."""
37. Environment Parity Gap,Count of configuration differences between Staging and Production.,Infrastructure,Terraform,Low,Hard (Deep diffs),"""It worked on my machine"" syndrome; parity gaps cause production-only bugs."
38. Untagged Cloud Assets,Percentage of cloud assets missing cost allocation/owner tags.,Infrastructure,AWS Config,High,Medium (Policy scan),"""FinOps Debt""—impossible to attribute costs or identify owners of zombie resources."
39. Orphaned Storage,Count/Size of EBS volumes or snapshots not attached to any compute instance.,Infrastructure,Cloud Custodian,High,Easy (API Query),Pure waste; paying for storage that is technically disconnected from the application.
40. Container Image Size,Size of the Docker/Container images being deployed.,Platform,Artifactory,High,Easy (Registry API),Bloated images slow down scaling (autoscaling latency) and increase vulnerability surface area.
41. Pod Restart Rate,"Frequency of containers crashing and restarting (OOMKilled, etc.).",Platform,Prometheus,High,Easy (Metrics),"Indicates instability, memory leaks, or improper resource limits (Configuration Debt)."
42. Serverless Cold Starts,Average latency added when a function scales from zero.,Platform,Datadog,Medium,Medium (APM),High latency here indicates poor optimization of runtimes or dependencies.
43. License Risk,"Count of dependencies with restrictive (e.g., GPL) or unknown licenses.",Legal,FOSSA,Medium,Easy (Scanners),"""Legal Debt""—risk of having to open-source proprietary code or face lawsuits."
44. TLS Protocol Lag,Percentage of endpoints supporting deprecated protocols (TLS 1.0/1.1).,Cyber,Qualys,Medium,Medium (Net scan),Security compliance debt; modern browsers/clients will eventually reject connections.
45. Missing Security Headers,"Endpoints missing standard headers (HSTS, CSP, X-Frame-Options).",Cyber,OWASP ZAP,Medium,Easy (Curl/Scan),"Low-hanging fruit for attackers; indicates lack of ""Security by Design."""
46. Dependency Tree Depth,The average depth of transitive dependencies in the project.,Architecture,Maven / npm,Low,Medium (Graph analysis),Deep trees make vulnerability patching difficult (you rely on a library that relies on a library...).
47. Layer Violations,"Instances where lower layers call upper layers (e.g., Domain layer calling UI).",Architecture,ArchUnit,Low,Hard (Custom rules),Breaks separation of concerns; creates spaghetti code that is hard to refactor.
48. Interface Segregation,Classes forced to implement methods they do not use (bloated interfaces).,Architecture,SonarQube,Low,Medium (Static Analysis),Indicates poor abstraction; makes mocking and testing specific behaviors difficult.
49. Slow Query Ratio,Percentage of database queries exceeding a specific latency threshold.,Data,DB Insights,High,Medium (DB Logs),"""Performance Debt""—often due to missing indexes or N+1 query problems."
50. Data Null Rate,Percentage of unexpected nulls or format errors in critical data columns.,Data,dbt,Low,Hard (Data testing),"""Data Trust Debt""—if data is dirty, downstream analytics and ML models are worthless."
51. ETL Failure Rate,Frequency of data pipeline job failures requiring manual retry.,Data,Airflow,High,Easy (Orchestrator),High failure rates indicate fragile data ingestion and lack of idempotency.
52. Core Web Vitals,"Scores for LCP (Loading), FID (Interactivity), CLS (Visual Stability).",Engineering,Lighthouse,Medium,Easy (CI or Browser),"""Frontend Debt""—poor scores hurt SEO and user retention."
53. Accessibility Violations,"Count of automated accessibility errors (missing alt tags, contrast).",Engineering,axe-core,Low,Easy (Scanner),Legal and ethical debt; excludes users with disabilities and risks lawsuits.
54. App Binary Size,Total size of the IPA/APK downloaded by users.,Mobile,Store Connect,High,Easy (Store Metadata),"Bloat leads to lower install conversion rates, especially in low-bandwidth markets."
55. Crash-Free Sessions,Percentage of mobile/web sessions that do not end in a crash.,Mobile,Crashlytics,High,Easy (SDK),The ultimate measure of stability; low rates destroy brand reputation.
56. API 5xx Error Rate,Percentage of server-side errors returned to clients.,Platform,Load Balancer,High,Easy (Logs),Indicates unhandled exceptions and poor error management in backend services.
57. Log Coverage %,Percentage of critical business paths that emit traceable logs.,Platform,Splunk / ELK,Low,Hard (Manual Audit),"""Observability Debt""—if you can't see it, you can't debug it during an outage."
58. Mean Time To Detection,Average time between an issue starting and an alert firing.,Delivery,PagerDuty,Low,Hard (Incident Review),"If MTTD is high, your customers are your monitoring system."
59. Sprint Spillover %,Percentage of story points committed vs. completed in a sprint.,Process,Jira,High,Medium (Jira Reports),"High spillover indicates ""Planning Debt""—poor estimation or unclear requirements."
60. Meeting Load,Average hours per day engineers spend in meetings vs. coding blocks.,Process,Calendar API,Low,Medium (API Integration),"""Focus Debt""—engineers cannot pay down technical debt if they have no deep work time."

Reading

Articles/Blogs/Essays


Tags: reading   development   business  

Last modified 14 December 2025