Language Bias in the Google Scholar Ranking Algorithm [review]

Abstract

The visibility of academic articles or conference papers depends on their being easily found in academic search engines, above all in Google Scholar. To enhance this visibility, search engine optimization (SEO) has been applied in recent years to academic search engines in order to optimize documents and, thereby, ensure they are better ranked in search pages (i.e., academic search engine optimization or ASEO). To achieve this degree of optimization, we first need to further our understanding of Google Scholar’s relevance ranking algorithm, so that, based on this knowledge, we can highlight or improve those characteristics that academic documents already present and which are taken into account by the algorithm. This study seeks to advance our knowledge in this line of research by determining whether the language in which a document is published is a positioning factor in the Google Scholar relevance ranking algorithm. Here, we employ a reverse engineering research methodology based on a statistical analysis that uses Spearman’s correlation coefficient. The results obtained point to a bias in multilingual searches conducted in Google Scholar with documents published in languages other than in English being systematically relegated to positions that make them virtually invisible. This finding has important repercussions, both for conducting searches and for optimizing positioning in Google Scholar, being especially critical for articles on subjects that are expressed in the same way in English and other languages, the case, for example, of trademarks, chemical compounds, industrial products, acronyms, drugs, diseases, etc.

Keywords: ASEO; SEO; reverse engineering; citations; google scholar; algorithms; relevance ranking; citation databases; academic search engines; multilingual search

Introduction

A researcher’s professional career is heavily dependent on the visibility of, and the recognition afforded by, their scholarly output. The number of citations received and the corresponding indexes associated with this variable, in particular the h-index, are typically the most widely used metrics employed in official processes of accreditation before appropriate academic boards or commissions. Indeed, the need to be cited, combined with the exponential increase in world bibliographic output, means that researchers today have to promote their own articles as the final step in the complex process of publishing their research findings. This promotion of their research output usually also implies building their personal academic brand [1,2], including the creation of complementary content that extends well beyond traditional scholarly articles for specialized publications or papers for conferences.

Among the actions researchers can take to promote both their personal brand and scientific output are mentioning their articles in academic and non-academic social networks, creating a professional blog with complementary content (videos, presentations, PDFs), creating profiles on a range of platforms from ORCID, Google Scholar Citation, and ResearcherID to Mendeley, depositing documents in open access repositories and optimizing articles so that they command a good position in search engines, especially Google Scholar. Indeed, the vast majority of Internet users do not look beyond the second or third page of results [3], which means for a document to be found easily it has to be optimized to ensure it appears towards the top of the first page.

Search results are ranked according to relevance, a value automatically calculated by search engines. Moreover, this ranking is usually established as the default order, because other forms of sorting—for example, by title or by date—are considered less significant for most search intentions (although these other forms of sorting are often available). Relevance is calculated using an algorithm that takes into account a range of factors, which means that for each search engine, relevance will differ to some degree, given that it is being defined by a distinct algorithm in each case.

Search engine optimization (SEO) [4,5] is, today, a well-established discipline, the goal of which is to highlight the quality of web pages and so improve their position in the results pages. This goal should not be achieved by fraudulent means, but depends rather on knowing how the algorithms that determine relevance operate, identifying the factors taken into account and, finally, optimizing these factors in one’s documents. However, Google Scholar is not always able to detect attempts at manipulation [6,7].

Today, there is a huge community of SEO experts and companies that dedicate their efforts to analyzing and discussing Google’s relevance ranking algorithm. Via blogs [8,9,10,11], online publications [12,13,14], and books [15,16,17] they advise designers and webmasters as to how they can optimize their websites so that they are easily indexed and can occupy the highest rankings in the results pages.

Google’s relevance algorithm is based on more than 200 factors [18] and include the number of links received, the keywords and related terms in the title and other significant areas of the document, the download speed of the server on which the page is hosted, the length of the text, the user experience, the mobile-first design, the semantic tagging, the age of the domain, etc. Google has never released complete information about all these factors or the exact weighting attached to each; the company only provides general, incomplete information in order to avoid spam. Indeed, if all the details of how the algorithm works were known, then poor quality documents could be placed at the top of the results page.

This “black box” policy has led SEO professionals to conduct reverse engineering research in an effort to identify the specific factors involved in relevance ranking. Thus, they analyze the search results in order to infer how the algorithm works. However, it is a complicated process in which many factors intervene and it is not easy to draw any conclusive results.

In recent years, this ecosystem of research concerned with algorithms and the subsequent publication of recommendations has been extended to Google Scholar and academic articles. On a much smaller scale, reverse engineering research has been applied to Google Scholar [19,20,21,22,23,24,25] and blogs [26,27,28,29,30], university library guidelines [31,32,33,34] and the authors’ services of the publishers of academic journals [35,36,37,38,39,40] offer their recommendations as to how to optimize articles so that they appear at the top of the rankings of Google Scholar’s results pages. This SEO applied to academic search engines has been called academic search engine optimization or ASEO [20,22,41,42,43].

This research community is still in its infancy both in terms of the quantity and quality of its output; moreover, the recommendations given for Google Scholar are often contaminated by research findings for Google. In fact, they are two quite distinct algorithms that operate on two quite distinct types of document in two very different environments. Indeed, as far as the ranking algorithm is concerned, academic documents have at least four major characteristics that clearly distinguish them from web pages: most are in PDF (and not HTML) format; they contain links based on bibliographic citations with other academic documents (not hyperlinks); once published, they are not modified; and, usually, author metadata and the date of publication are clearly identified.

Promoting a personal academic brand and the visibility of web pages, blogs, videos and other complementary content depend to a large extent on Google positioning. But the visibility of academic articles or conference papers is determined by their optimization for Google Scholar. These differences need to be clarified, while it is necessary to further our understanding of Google Scholar’s relevance ranking algorithm, which is not so well known and widely analyzed as Google’s general search algorithm.

The aim of this study is to do just that and, more specifically, here, because of its far-reaching implications, we seek to determine whether the language in which a document is written is a key positioning factor. In this regard, no previous study, to the best of our knowledge, has attempted to find a relationship between positioning and language, be it for Google Scholar or for the general Google search engine.

Normally, language plays no role as a ranking factor in keyword searches, given that the language of the search word itself determines the language of the documents retrieved. If the documents are written in the same language, this factor is overridden. Language only intervenes in those few cases of keywords with the same spelling in different languages that generate multilingual lists of results. In contrast, searches by author or year are conducted independently of language and always provide what we shall refer to henceforth as multilingual results (or searches).

When searches are multilingual, that is, when results are provided in different languages for the same search, the language of the documents can be a decisive factor if it can be shown that this conditions ranking. Thus, our primary research question here is the following: In multilingual searches, is the language in which a document is written a factor in Google Scholar’s ranking algorithm?

Our hypothesis is that Google Scholar favors the English language in multilingual search results. As a result, documents in other languages have fewer possibilities of being placed at the top of the rankings for the sole reason that they are not published in English.

In the following section, we discuss related studies in the literature, before moving on to present the applied research methodology and the method used to select our sample. Next, we analyze the results obtained from our statistical data and from observation of our scatter plots. The limitations of the study are discussed and new lines of research are proposed. Finally, in the conclusions, the repercussions of our findings are highlighted, both for searches and for the optimization of positioning in Google Scholar.


(…)


Citation

Rovira, Cristòfol; Codina, Lluís; Lopezosa, Carlos (2021). Language Bias in the Google Scholar Ranking Algorithm. Future Internet 13, no. 2: 31. https://doi.org/10.3390/fi13020031


Links


Funding

This research was funded by the project Interactive content and creation in multimedia information communication: Audiences, design, systems and styles, CSO2012-39518-C04-02, Spanish Ministry of Economy and Competitiveness (Mineco/Feder).


References

  1. González-Solar, L. Marca personal en entornos académicos: Una perspectiva institucional. An. Doc. 201821. [Google Scholar] [CrossRef]
  2. Harzing, A.W. Building Your Academic Brand through Engagement with Social Media. 2018. Available online: https://harzing.com/blog/2018/05/building-your-academic-brand-through-engagement-with-social-media (accessed on 17 October 2020).
  3. Marcos, M.-C.; González-Caro, C. Comportamiento de los usuarios en la página de resultados de los buscadores. Un estudio basado en eye tracking. EPI 201019, 348–358. [Google Scholar] [CrossRef]
  4. Yalçın, N.; Köse, U. What is search engine optimization: SEO? Procedia Soc. Behav. Sci. 20109, 487–493. [Google Scholar] [CrossRef]
  5. Ziakis, C.; Vlachopoulou, M.; Kyrkoudis, T.; Karagkiozidou, M. Important factors for improving Google search rank. Future Internet 201911, 32. [Google Scholar] [CrossRef]
  6. López-Cózar, D.E.; Robinson-García, N.; Torres-Salinas, D. Manipulating Google Scholar citations and Google Scholar metrics: Simple, easy and tempting. arXiv 2012, arXiv:1212.0638. [Google Scholar]
  7. López-Cózar, D.E.; Robinson-García, N.; Torres-Salinas, D. The Google Scholar experiment: How to index false papers and manipulate bibliometric indicators. J. Assoc. Inf. Sci. Technol. 201465, 446–454. [Google Scholar] [CrossRef]
  8. SEMrush Blog. Available online: https://www.semrush.com/blog/ (accessed on 1 July 2020).
  9. The Moz Blog. Available online: https://moz.com/blog (accessed on 1 July 2020).
  10. Yoast SEO Blog. Available online: https://yoast.com/seo-blog/ (accessed on 1 July 2020).
  11. Developers Google. Available online: https://developers.google.com/search/blog#blog-principal-de-la-busqueda-de-google (accessed on 1 July 2020).
  12. Search Engine Journal. Available online: https://www.searchenginejournal.com/ (accessed on 1 July 2020).
  13. Search Engine Land. Available online: https://searchengineland.com/ (accessed on 1 July 2020).
  14. Search Engine Watch. Available online: https://www.searchenginewatch.com/ (accessed on 1 July 2020).
  15. Clarke, A. SEO 2020 Learn Search Engine Optimization with Smart Internet Marketing Strategies: Learn SEO with Smart Internet Marketing Strategies; Newest, Ed.; EEUU: Trenton, NJ, USA, 2019; pp. 1–244. [Google Scholar]
  16. Kent, P. SEO For Dummies; EEUU: Trenton, NJ, USA, 2019; pp. 1–478. [Google Scholar]
  17. Maciá, F. SEO Avanzado. Casi todo lo Que sé Sobre Posicionamiento Web; Anaya: Barcelona, Spain, 2020; pp. 1–416. [Google Scholar]
  18. Google. How Google Search Works. Learn How Google Discovers, Crawls, and Serves Web Pages. Available online: https://support.google.com/webmasters/answer/70897?hl=en (accessed on 1 July 2020).
  19. Beel, J.; Gipp, B. Google scholar’s ranking algorithm: An introductory overview. In Proceedings of the 12th International Conference on Scientometrics and Informetrics, ISSI’09, Istanbul, Turkey, 14–17 July 2009; pp. 230–241. Available online: https://goo.gl/c8a6YU (accessed on 23 January 2021).
  20. Beel, J.; Gipp, B. Google Scholar’s ranking algorithm: The impact of articles’ age (an empirical study). In Proceedings of the Sixth International Conference on Information Technology: New Generations, ITNG’09, Las Vegas, NA, USA, 27–29 April 2009; pp. 160–164. [Google Scholar] [CrossRef]
  21. Beel, J.; Gipp, B. Google scholar’s ranking algorithm: The impact of citation counts (an empirical study). In Proceedings of the Third International Conference on Research Challenges in Information Science, RCIS 2009c, Nice, France, 22–24 April 2009; pp. 439–446. [Google Scholar] [CrossRef]
  22. Beel, J.; Gipp, B.; Wilde, E. Academic search engine optimization (ASEO). Optimizing scholarly literature for Google Scholar & co. J. Sch. Publ. 201041, 176–190. [Google Scholar] [CrossRef]
  23. Rovira, C.; Guerrero-Solé, F.; Codina, L. Received citations as a main SEO factor of Google Scholar results ranking. EPI 201827, 559–569. [Google Scholar] [CrossRef]
  24. Rovira, C.; Codina, L.; Guerrero-Solé, F.; Lopezosa, C. Ranking by relevance and citation counts, a comparative study: Google Scholar, Microsoft Academic, WoS and Scopus. Future Internet 201911, 202. [Google Scholar] [CrossRef]
  25. Martín-Martín, A.; Orduña-Malea, E.; Ayllón, J.M.; López-Cózar, E.D. Does Google Scholar contain all highly cited documents (1950–2013). arXiv 2014, arXiv:1410.8464. [Google Scholar]
  26. Sparks, A. 8 Winning Hacks to Use Google Scholar for Your Research Paper, Editage. 2018. Available online: https://www.editage.com/insights/8-winning-hacks-to-use-google-scholar-for-your-research-paper (accessed on 1 July 2020).
  27. Cordero, J.J. 8 Estrategias Para Promocionar los Artículos Profesionales de tu Blog. 2018. Available online: https://www.javiercordero.com/como-promocionar-articulos-blog/ (accessed on 1 July 2020).
  28. Florido, M. Google Académico—7 Consejos para mejorar el posicionamiento. 2015. Available online: https://www.marketingandweb.es/marketing/google-academico/ (accessed on 1 July 2020).
  29. Drew, C. 11 Best Tips on How to use Google Scholar. 2020. Available online: https://helpfulprofessor.com/google-scholar/ (accessed on 1 July 2020).
  30. Miles, S. 12 Tips for Increasing Your Visibility in Google Search. 2020. Available online: https://webpublisherpro.com/12-tips-for-increasing-your-visibility-in-google-search/ (accessed on 1 July 2020).
  31. UCA Library. Research Visibility. SEO for Authors: A How-to Guide. Available online: https://guides.library.ucla.edu/seo/author (accessed on 2 July 2020).
  32. UNED. Investiga UNED. 10 Consejos Para Difundir tu Investigación y Conseguir más Impacto. 2017. Available online: http://investigauned.uned.es/10-consejos-para-difundir-tu-investigacion-y-conseguir-mas-impacto/ (accessed on 2 July 2020).
  33. University of Pittsburgh. How to Increase the Visibility of Your Research? 2020. Available online: https://pitt.libguides.com/researchvisibility (accessed on 2 July 2020).
  34. University of Montana. Google Scholar at the University of Montana. 2020. Available online: https://libguides.lib.umt.edu/c.php?g=854135&p=6115806 (accessed on 2 July 2020).
  35. Wiley. Search Engine Optimization (SEO) for Your Article. Available online: https://authorservices.wiley.com/author-resources/Journal-Authors/Prepare/writing-for-seo.html (accessed on 2 July 2020).
  36. Emerald Publishing. How to… Make Your Research Easy to Find with SEO. Available online: https://www.emeraldgrouppublishing.com/services/authors/author-how-guides/make-your-research-easy-find-seo (accessed on 2 July 2020).
  37. Sage Publishing. Promote Your Article. Available online: https://uk.sagepub.com/en-gb/eur/promote-your-article (accessed on 2 July 2020).
  38. Plos. Spreading the Word about Your Research. Available online: https://plos.org/article-promotion/ (accessed on 2 July 2020).
  39. Taylor and Francis. Search Engine Optimization for Academic Articles. Available online: https://authorservices.taylorandfrancis.com/research-impact/search-engine-optimization-for-academic-articles/ (accessed on 2 July 2020).
  40. Elsevier. Get Found—Optimize Your Research Articles for Search Engines. 2012. Available online: https://www.elsevier.com/connect/get-found-optimize-your-research-articles-for-search-engines (accessed on 2 July 2020).
  41. Codina, L. SEO Académico: Definición, Componentes y Guía de Herramientas. 2019. Available online: https://www.lluiscodina.com/seo-academico-guia (accessed on 9 July 2020).
  42. Martín-Martín, A.; Ayllón, J.M.; Orduña-Malea, E.; López-Cózar, E.D. Google Scholar metrics released: A matter of languages and something else. arXiv 2016, arXiv:1607.06260v1. [Google Scholar]
  43. Muñoz-Martín, B. Incrementa el impacto de tus artículos y blogs: De la invisibilidad a la visibilidad. Rev. Soc. Otorrinolaringol. Castilla León Cantab. Rioja 20156, 6–32. Available online: http://hdl.handle.net/10366/126907 (accessed on 23 January 2021).
  44. Giustini, D.; Boulos, M.N.K. Google Scholar is not enough to be used alone for systematic reviews. Online J. Public Health Inform. 20135, 214. [Google Scholar] [CrossRef] [PubMed]
  45. Walters, W.H. Google Scholar search performance: Comparative recall and precision. Portal-Libr. Acad. 20089, 5–24. [Google Scholar] [CrossRef]
  46. De-Winter, J.; Zadpoor, A.; Dodou, D. The expansion of Google Scholar versus Web of Science: A longitudinal study. Scientometrics 201498, 1547–1565. [Google Scholar] [CrossRef]
  47. Harzing, A.W. A preliminary test of Google Scholar as a source for citation data: A longitudinal study of Nobel prize winners. Scientometrics 201394, 1057–1075. [Google Scholar] [CrossRef]
  48. Harzing, A.W. A longitudinal study of Google Scholar coverage between 2012 and 2013. Scientometrics 201498, 565–575. [Google Scholar] [CrossRef]
  49. De-Groote, S.L.; Raszewski, R. Coverage of Google Scholar, Scopus, and Web of Science: A case study of the h-index in nursing. Nurs Outlook 201260, 391–400. [Google Scholar] [CrossRef]
  50. Orduña-Malea, E.; Ayllón, J.M.; Martín-Martín, A.; Delgado-López-Cózar, E. About the size of Google Scholar: Playing the numbers. arXiv 2014, arXiv:1407.6239. [Google Scholar]
  51. Orduña-Malea, E.; Ayllón, J.M.; Martín-Martín, A.; Delgado-López-Cózar, E. Methods for estimating the size of Google Scholar. Scientometrics 2015104, 931–949. [Google Scholar] [CrossRef]
  52. Pedersen, L.A.; Arendt, J. Decrease in free computer science papers found through Google Scholar. Online Inf. Rev. 201438, 348–361. [Google Scholar] [CrossRef]
  53. Jamali, H.R.; Nabavi, M. Open access and sources of full-text articles in Google Scholar in different subject fields. Scientometrics 2015105, 1635–1651. [Google Scholar] [CrossRef]
  54. Jamali, H.R.; Asadi, S. Google and the scholar: The role of Google in scientists’ information-seeking behaviour. Online Inf. Rev. 201034, 282–294. [Google Scholar] [CrossRef]
  55. Aguillo, I.F. Is Google Scholar useful for bibliometrics? A webometric analysis. Scientometrics 201291, 343–351. [Google Scholar] [CrossRef]
  56. Jacsó, P. Calculating the h-index and other bibliometric and scientometric indicators from Google Scholar with the Publish or Perish software. Online Inf. Rev. 200933, 1189–1200. [Google Scholar] [CrossRef]
  57. Torres-Salinas, D.; Ruiz-Pérez, R.; Delgado-López-Cózar, E. Google scholar como herramienta para la evaluación científica. EPI 200918, 501–510. [Google Scholar] [CrossRef]
  58. Beel, J.; Gipp, B. Academic search engine spam and Google Scholar’s resilience against it. J. Electron. Publ. 201013. [Google Scholar] [CrossRef]
  59. Delgado-López-Cózar, E.; Robinson-García, N.; Torres-Salinas, D. Manipular Google Scholar citations y Google Scholar metrics: Simple, sencillo y tentador. In EC3 Working Papers; Universidad De Granada: Granada, Spain, 2012; Available online: http://hdl.handle.net/10481/20469 (accessed on 1 July 2019).
  60. Meho, L.; Yang, K. Impact of Data Sources on Citation Counts and Rankings of LIS Faculty: Web of Science Versus Scopus and Google Scholar. J. Assoc. Inf. Sci. Technol. 200658, 2105–2125. [Google Scholar] [CrossRef]
  61. Martín-Martín, A.; Orduña-Malea, E.; Ayllón, J.M.; Delgado-López-Cózar, E. Back to the past: On the shoulders of an academic search engine giant. Scientometrics 2016107, 1477–1487. [Google Scholar] [CrossRef]
  62. Van-Aalst, J. Using Google Scholar to estimate the impact of journal articles in education. Educ. Res. 201039, 387–400. [Google Scholar] [CrossRef]
  63. Jacsó, P. Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for FW Lancaster. Libr. Trends 200856, 784–815. [Google Scholar] [CrossRef]
  64. Jacsó, P. The pros and cons of computing the h-index using Google Scholar. Online Inf. Rev. 200832, 437–452. [Google Scholar] [CrossRef]
  65. Jacsó, P. Using Google Scholar for journal impact factors and the h-index in nationwide publishing assessments in academia –siren songs and air-raid sirens. Online Inf. Rev. 201236, 462–478. [Google Scholar] [CrossRef]
  66. Martín-Martín, A.; Orduña-Malea, E.; Harzing, A.W.; Delgado-López-Cózar, E. Can we use Google Scholar to identify highly-cited documents? J. Informetr. 201711, 152–163. [Google Scholar] [CrossRef]
  67. Farhadi, H.; Salehi, H.; Yunus, M.; Aghaei-Chadegani, A.; Farhadi, M.; Fooladi, M.; Ale-Ebra-him, N. Does it matter which citation tool is used to compare the h-index of a group of highly cited researchers? Aust. J. Basic Appl. Sci. 20137, 198–202. Available online: https://ssrn.com/abstract=2259614 (accessed on 23 January 2021).
  68. Marks, T.; Le, A. Increasing article findability online: The four Cs of search engine optimization. Law Libr. J. 2017109, 83. [Google Scholar] [CrossRef]
  69. Kearl, M.; Noteboom, C.; Tech, D. A proposed improvement to Google Scholar algorithms through broad topic search emergent research forum paper. Fac. Res. Publ. 20176, 1–5. Available online: https://scholar.dsu.edu/bispapers/6 (accessed on 23 January 2021).
  70. Localseoguide. How Have Local Ranking Factors Changed? 2019. Available online: https://www.localseoguide.com/guides/local-seo-ranking-factors/ (accessed on 1 July 2020).
  71. Searchmetrics. Rebooting for Relevance. 2016. Available online: https://www.searchmetrics.com/knowledge-hub/studies/ranking-factors-2016/ (accessed on 1 July 2020).
  72. MOZ. Search Engine Ranking Factors 2015. Available online: https://moz.com/search-ranking-factors/correlations (accessed on 1 July 2019).
  73. Dave, D. 11 Things You Must Know About Google’s 200 Ranking Factors. 2018. Available online: https://www.searchenginejournal.com/google-200-ranking-factors-facts/265085/ (accessed on 10 September 2019).
  74. Chariton, R. Google Algorithm—What Are the 200 Variables? 2004. Available online: https://www.webmasterworld.com/google/4030020.htm (accessed on 10 September 2019).
  75. Wiggers, K. Google Details How It’s Using AI and Machine Learning to Improve Search. 2020. Available online: https://venturebeat.com/2020/10/15/google-details-how-its-using-ai-and-machine-learning-to-improve-search/ (accessed on 10 July 2020).
  76. Google. About Google Scholar. Available online: http://scholar.google.com/intl/en/scholar/about.html (accessed on 1 July 2019).
  77. Mayr, P.; Walter, A.-K. An exploratory study of google scholar. Online Inf. Rev. 200731, 814–830. [Google Scholar] [CrossRef]
  78. Gielen, M.; Rosen, J. Reverse Engineering the YouTube, tubefilter.com. 2016. Available online: http://www.tubefilter.com/2016/06/23/reverse-engineering-youtube-algorithm/ (accessed on 1 July 2019).
  79. Harzing, A.-W. Publish or Perish. 2016. Available online: https://harzing.com/resources/publish-or-perish (accessed on 1 July 2019).
  80. Harzing, A.W. The Publish Or Perish Book: Your Guide to Efective and Responsible Citation Analysis; Tarma Software Research Pty Ltd.: Melbourne, Australia, 2011; pp. 39–342. Available online: https://EconPapers.repec.org/RePEc:spr:scient:v:88:y:2011:i:1:d:10.1007_s11192-011-0388-8 (accessed on 1 July 2019).
  81. R Core Team. R: A Language and Environment for Statistical Computing. Available online: https://www.R-project.org (accessed on 1 July 2019).
  82. Revelle, W. Psych: Procedures for Personality and Psychological Research, Northwestern University. 2017. Available online: https://www.scholars.northwestern.edu/en/publications/psych-procedures-for-personality-and-psychological-research (accessed on 1 July 2019).
  83. Lemon, J. Plotrix: A package in the red light district of R. R News 20066, 8–12. [Google Scholar]