Finish week 29 tue

akirataguchi115 · akirataguchi115 · commit a6f83df58eb3 · 2025-07-15T13:23:58.000+03:00
diff --git a/Ch.40_Discussion.tex b/Ch.40_Discussion.tex
@@ -45,11 +45,16 @@ \subsection{Limitations of literature selection for review}
 On top of too heavy filters we would also like to document the too light filters in the literature selection for review. We can see from \hyperref[appendix:a]{Appendix A} that for example public software licenses with the literature identifiers L777 and L780 are almost the same regarding the shortcoded identifiers: ''ZPL - 2.1'' and ''ZPL-2.1''. The duplicate removal would have been seemingly simple to execute on phase 1. However with the presence of over 700 pieces of literature we decided not to give special treatment to any potential set of duplicates. While it is most possible that OSI's ''ZPL - 2.1'' is equivalent exactly to SPDX's ''ZPL-2.1'' we could not be sure without looking at their contents. This could have resulted duplicate public software licenses in the literature selection for review but these type of duplicates are removed in phases 2 and 3 due to the public software licenses being read in full.
 
 % Miscellaneous validity issues on literature selection 
-\textcolor{red}{SAME THING HERE approx duplicates were the result of going to the two listing websites that had the approaximately same looking licenses. then i just checked if they were actually some sort of duplicates of one another or if they already exist somewhere else. examples here. this is also a validity threat. problem with focusing software specific licenses is for example wtfpl. it is mostly used in software licensing but it doesn't quite clearly state that it is software specific license. maybe ill have to include the word "public license" and just include stuff that's not actually software specific or maybe ill make some exclusion criteria in order to get less non-software licenses}
+To finish this subsection we will discuss some more minor validity issues that did not fit into \hyperref[results]{Chapter 3} but are regardless important to note for the integrity of the thesis. Stage three of the search process included a validity threat regarding the removal of duplicates. If two full license texts would seem duplicates we would check the two license listing sites' license pages for further investigation without using an internet archiver. This is a common validity threat on this thesis, that is not relying on an internet archiver on every source possible. Still, archiving more than a thousand license pages and accessing them would have been very slow process in terms of both archiving and accessing.
 
-\textcolor{red}{the order of wikipedia infobox was used for missing licenses so a validity threat and for the de-duplication in stage 2 and some other parts of the thesis you must declare. second stage licenses were fetched from scancode licensedb on 2025 mar 25 15:30}
+% why exclusion over inclusion
+As can be seen in  \hyperref[methods]{Chapter 2} the regular expression string was only an exclusion filter. Using an inclusion and exclusion resulted in difficulities to match all of the public software licenses. In other words it eventually turned out to be faster to match the excludable licenses than the includable licenses. The validity threat lying here is that only using an exclusion filter implicates a majority of the public licenses in our dataset to be public software licenses. An example of difficult to include public software license is the \texttt{wtfpl} which includes no evidence of it being a public software license but rather a general public license. However because \texttt{wtfpl} is a largely used in software source code as can be seen in \hyperref[results]{Chapter 3}. Another examples to back up this choice in exclusion-only are the font licenses that are considered public software licenses. With the exceptions inflating the inclusion regular expression string we eventually decided to only use the aforementioned exclusion filter. Before the decision our inclusion string looked like this:
+\begin{verbatim}
+  (.*\b(source|software|program|code|module|public(s+)license|ware|
+  (w+)ware)\b).*
+\end{verbatim}
 
-\textcolor{orange}{stage 3 doesnt benefit from which site the licenses come from since we removed the duplicates from stage 3 according to wikipedia order}
+As mentioned earlier in the thesis the Wikipedia infobox order of license listing sites plays a heavy role in the literature selection. This manifests as a validity threat for example in removal of duplicates where the duplicates are removed from the lattermost listing site, giving a false sense of the majority of the public licenses coming from the formermost license listing sites like the SPDX. While this might be true due to the high volume of literature from the formermost license listing sites in order of the Wikipedia infobox it is still a threat to validity. Because of this choice in our scope the accuracy of the origins of the licenses in the search stages is not as high as it could be.
 
 \textcolor{red}{. dejavu and dbg-3.0 were also two other licenses that contained a space. this might indicate that the space is an accident that its simply just not found from a license listing site x. its also good to note that the python script was decided to be an valid approach since many of the licenses were actually found with the shortcode from the licensedb scancode. fetching 700 licenses by hand would have had time and validity issues. wayback machine could have been used to do the actual searching as well. this is unfortunately a validity issue but at least the source is available in wayback machine.}
 
@@ -59,8 +64,6 @@ \subsection{Limitations of literature selection for review}
 
 \textcolor{orange}{its good to note that systematic != automatic. our approach especially uses automation (python) to help the author use their human eye sight otherwise it would be more prone to error due to the large amount of licenses}
 
-\textcolor{orange}{\\"(source|software|program|code|module|public(s+)license|ware|(w+)ware)" was the first inc exc regex i used and it caught stuff like gfdl and thats how i ended up using exclusion only. note that documentation is not software but for example font is}
-
 \textcolor{orange}{python script does not work on windows machine due to some os dependent path problems - validity threat}
 
 \textcolor{red}{remember to document the validity L of human eye sight used majorly on third stage of search process. duplicate removal in tabs was done so that: i check the text if the n and n+1 and if they look pretty much the same i act and if the shortcodes look the sam i act.}
diff --git a/HY-CS-main.pdf b/HY-CS-main.pdf
diff --git a/bibliography.bib b/bibliography.bib
@@ -201,7 +201,7 @@ @misc{scancode
   author = "ScanCode",
   howpublished = "\url{https://github.com/aboutcode-org/scancode-licensedb}",
   year = "2025",
-  note = "Accessed : 2025 June 10th"
+  note = "Accessed : 2025 March 25th"
 }
 
 @book{weak-sustainability,

Original file line number	Diff line number	Diff line change
`@@ -201,7 +201,7 @@ @misc{scancode`
`201`	`201`	`author = "ScanCode",`
`202`	`202`	`howpublished = "\url{https://github.com/aboutcode-org/scancode-licensedb}",`
`203`	`203`	`year = "2025",`
`204`		`- note = "Accessed : 2025 June 10th"`
	`204`	`+ note = "Accessed : 2025 March 25th"`
`205`	`205`	`}`
`206`	`206`
`207`	`207`	`@book{weak-sustainability,`