Friday, July 21, 2023

Internet Search Tips

Google-fu search skill is something I’ve prided myself ever since elementary school, when the librarian challenged the class to find things in the almanac; not infrequently, I’d win. And I can still remember the exact moment it dawned on me in high school that much of the rest of my life would be spent dealing with searches, paywalls, and broken links. The Internet is the greatest almanac of all, and to the curious, a never-ending cornucopia, so I am sad to see many fail to find things after a cursory search—or not look at all. For most people, if it’s not the first hit in Google/​Google Scholar, it doesn’t exist. Below, I reveal my best Internet search tricks and try to provide a rough flowchart of how to go about an online search, explaining the subtle tricks and tacit knowledge of search-fu.

Roughly, we need to have proper tools to create an occasion for a search: we cannot search well if we avoid searching at all. Then each search will differ by which search engine & type of medium we are searching—they all have their own quirks, blind spots, and ways to modify a failed search. Often, we will run into walls, each of which has its own circumvention methods. But once we have found something, we are not done: we would often be foolish & short-sighted if we did not then make sure it stayed found. Finally, we might be interested in advanced topics like ensuring in advance resources can be found in the future if need be, or learning about new things we might want to then go find. To illustrate the overall workflow & provide examples of tacit knowledge, I include many Internet case studies of finding hard-to-find things.

Followup section to the article covering how to search the Internet effectively: >14 case studies of challenging Internet searches drawn from the past 10 years. I present the problem, and step through the process of finding it, and describe my tacit knowledge and implicit strategies. These case studies make the prior tips more understandable by showing them off in practice.

Anders Sandberg asked:

Does anybody know where the online appendix to Nordhaus’ “Two Centuries of Productivity Growth in Computing” is hiding?

I look up the title in Google Scholar; seeing a friendly psu.edu PDF link (CiteSeerx), I click. The paper says “The data used in this study are provided in a background spreadsheet available at http://www.econ.yale.edu/~nordhaus/Computers/Appendix.xls”. Sadly, this is a lie. (Sandberg would, of course, have tried that already.)

I immediately check the URL in the IA—nothing. The IA didn’t catch it at all. Maybe the official published paper website has it? Nope, it references the same URL, and doesn’t provide a copy as an appendix or supplement. (What do we pay these publishers such enormous sums of money for, exactly?) So I back off to checking http://www.econ.yale.edu/~nordhaus/, to check Nordhaus’s personal website for a newer link. The Yale personal website is empty and appears to’ve been replaced by a Google Sites personal page. It links nothing useful, so I check a more thorough index, Google, by searching site:sites.google.com/site/williamdnordhaus/. Nothing there either (and it appears almost empty, so Nordhaus has allowed most of his stuff to be deleted and bitrot). I try a broader Google: nordhaus appendix.xls. This turns up some spreadsheets, but still nothing.

Easier approaches having been exhausted, I return to the IA and I pull up all URLs archived for his original personal website: https://web.archive.org/web/*/http://www.econ.yale.edu/~nordhaus/* This pulls up way too many URLs to manually review, so I filter results for xls, which reduces to a more manageable 60 hits; reading through the hits, I spot http://www.econ.yale.edu/~nordhaus/homepage/documents/Appendix_Nordhaus_computation_update_121410.xlsx from 2014-10-10; this sounds right, albeit substantially later in time than expected (either 2010 or 2012, judging from the filename).

Downloading it⁠, opening it up and cross-referencing with the paper, it has the same spreadsheet ‘sheets’ as mentioned, like “Manual” or “Capital_Deep”, and seems to be either the original file in question or an updated version thereof (which may be even better). The spreadsheet metadata indicates it was created “04/​09/​2001, 23:20:43, ITS Academic Media & Technology”, and modified “12/​22/​2010, 02:40:20”, so it seems to be the latter—it’s the original spreadsheet Nordhaus created when he began work several years prior to the formal 2007 publication (6 years seems reasonable given all the delays in such a process), and then was updated 3 years afterwards. Close enough.

A Redditor asked:

I was in a consignment type store once and picked up a book called “Eat fat, get thin”. Giving it a quick scan through, it was basically the same stuff as Atkins but this book was from the 50s or 60s. I wish I’d have bought it. I think I found a reference to it once online but it’s been drowned out since someone else released a book with the same name (and it wasn’t Barry Groves either).

The easiest way to find a book given a corrupted title, a date range, and the information there are many similar titles drowning out a naive search engine query, is to skip to a specialized search engine with clean metadata (ie. a library database).

Searching in WorldCat for 1950s–1970s, “Eat fat, get thin” turns up nothing relevant. This is unsurprising, as he was unlikely to’ve remembered the title exactly, and this title doesn’t quite sound right for the era anyway (a little too punchy and ungrammatical, and ‘thin’ wasn’t a desirable word back then compared to words like ‘slim’ or ‘sleek’ or ‘svelte’). People often oversimplify titles, so I dropped back to just “Eat fat”.

This immediately turned up the book: Richard Mackarness’s 1958 Eat Fat and Grow Slim—note that it is almost the same title, with a comma serving as conjunction and ‘slim’ rather than the more contemporary ‘thin’, but just different enough to screw up an overly-literal search.

With the same trick in mind, we could also have found it in a regular Google search query by adding additional terms to hint to Google that we want old books, not recent ones: both "Eat Fat" 1950s or "Eat Fat" 1960s would have turned it up in the first 5 search results. If we didn’t use quotes, the searches get harder because broader hits get pulled in. For example, Eat fat, get thin 1950s -Hyman excludes the recent book mentioned, but you still have to go down 15 hits before finding Mackarness, and Eat fat, get thin -Hyman requires going down 18 hits.

Bučar et al 2015⁠, on the phenomenon of disappearing polymorphs quotes striking transcripts from a major example of a disappearing crystal, when ~1998 Abbott suddenly became unable to manufacture the anti-retroviral drug ritonavir (Norvir™) due to a rival (and less effective) crystal form spontaneously infecting all its plants, threatening many AIDS patients, but notes:

The transcripts were originally published on the website42 of the International Association of Physicians in AIDS Care [IAPAC], but no longer appear there.

A search using the quotes confirms that the originals have long since vanished from the open Internet, turning up only quotes of the quotations. Unfortunately, no URL is given. The Internet Archive has comprehensive mirrors of the IAPAC, but too many to easily search through. Using the filter feature, I keyword-searched for “ritonavir”, but while this turned up a number of pages from roughly the right time period, they do not mention it and none of the quotes appear. The key turned out to be to use the trademark name instead which pulls up many more pages, and after checking a few, the IAPAC turned out to have organized all the Norvir material into a single subdirectory with a convenient index.html⁠; the articles/​transcripts, in turn, were indexed under the linked “Description of the Problem” index page⁠.

I then pulled the Norvir subdirectory with a ~/.gem/ruby/2.5.0/bin/wayback_machine_downloader wayback_machine_downloader 'http://www.iapac.org/norvir/' command and hosted a mirror to make it visible in Google.

Nancy Lebovitz asked about a citation in a Roy Baumeister speech about sex differences:

There’s an idea I’ve seen a number of times that 80% of women have had descendants, but only 40% of men. A little research tracked it back to this⁠, but the speech doesn’t have a cite and I haven’t found a source.

This could be solved by guessing that the formal citation is given in the book, and doing keyword search to find a similar passage. The second line of the speech says:

For more information on this topic, read Dr. Baumeister’s book Is There Anything Good About Men? available in bookstores everywhere, including here.

A search of Is There Anything Good About Men in Libgen turns up a copy. Download. What are we looking for? A reminder, the key lines in the speech are:

…It’s not a trick question, and it’s not 50%. True, about half the people who ever lived were women, but that’s not the question. We’re asking about all the people who ever lived who have a descendant living today. Or, put another way, yes, every baby has both a mother and a father, but some of those parents had multiple children. Recent research using DNA analysis answered this question about two years ago. Today’s human population is descended from twice as many women as men. I think this difference is the single most under-appreciated fact about gender. To get that kind of difference, you had to have something like, throughout the entire history of the human race, maybe 80% of women but only 40% of men reproduced.

We could search for various words or phrase from this passage which seem to be relatively unique; as it happens, I chose the rhetorical “50%” (but “80%”, “40%”, “underappreciated”, etc. all would’ve worked with varying levels of efficiency since the speech is heavily based on the book), and thus jumped straight to chapter 4, “The Most Underappreciated Fact About Men”. (If these had not worked, we could have started searching for years, based on the quote “about two years ago”.) A glance tells us that Baumeister is discussing exactly this topic of reproductive differentials, so we read on and a few pages later, on page 63, we hit the jackpot:

The correct answer has recently begun to emerge from DNA studies, notably those by Jason Wilder and his colleagues. They concluded that among the ancestors of today’s human population, women outnumbered men about two to one. Two to one! In percentage terms, then, humanity’s ancestors were about 67% female and 33% male.

Who’s Wilder? A C-f for “Wilder” takes us to pg286, where we immediately read:

…The DNA studies on how today’s human population is descended from twice as many women as men have been the most requested sources from my earlier talks on this. The work is by Jason Wilder and his colleagues. I list here some sources in the mass media, which may be more accessible to laypersons than the highly technical journal articles, but for the specialists I list those also. For a highly readable introduction, you can Google the article “Ancient Man Spread the Love Around,” which was published September, 20, 2004 and is still available (last I checked) online. There were plenty of other stories in the media at about this time, when the research findings first came out. In “Medical News Today,”⁠, on the same date in 2004, a story under “Genes expose secrets of sex on the side” covered much the same material.

If you want the original sources, read Wilder, J. A., Mobasher, Z., & Hammer, M. F. (2004). “Genetic evidence for unequal effective population sizes of human females and males”⁠. Molecular Biology and Evolution, 21, 2047–2057. If that went down well, you might try Wilder, J. A., Kingan, S. B., Mobasher, Z., Pilkington, M. M., & Hammer, M. F. (2004). “Global patterns of human mitochondrial DNA and Y-chromosome structure are not influenced by higher migration rates of females versus males”⁠. Nature Genetics, 36, 1122–1125. That one was over my head, I admit. A more readable source on these is Shriver, M. D. (2005), “Female migration rate might not be greater than male rate”⁠. European Journal of Human Genetics, 13, 131–132. Shriver raises another intriguing hypothesis that could have contributed to the greater preponderance of females in our ancestors: Because couples mate such that the man is older, the generational intervals are smaller for females (ie. baby’s age is closer to mother’s than to father’s). As for the 90% to 20% differential in other species, that I believe is standard information in biology, which I first heard in one of the lectures on testosterone by the late James Dabbs, whose book Heroes, Rogues, and Lovers remains an authoritative source on the topic.

Wilder et al 2004, incidentally, fits well with Baumeister remarking in 2007 that the research was done 2 or so years ago. And of course you could’ve done the same thing using Google Books: search “Baumeister anything good about men” to get to the book, then search-within-the-book for “50%”, jump to page 53, read to page 63, do a second search-within-the-book for “Wilder” and the second hit of page 287 even luckily gives you the snippet:

Sources and References 287

…If you want the original sources, read Wilder, J. A., Mobasher, Z., & Hammer, M. F. (2004). “Genetic evidence for unequal effective population sizes of human females and males”. Molecular Biology and Evolution

Did J.K. Rowling say the Harry Potter books were about ‘death’? There are a lot of Rowling statements, but checking WP and opening up each interview links (under the theory that the key interviews are linked there) and searching for ‘death’ soon turns up a relevant quote from 2001:

Death is an extremely important theme throughout all seven books. I would say possibly the most important theme. If you are writing about Evil, which I am, and if you are writing about someone who is essentially a psychopath⁠, you have a duty to show the real evil of taking human life.

Scott Alexander posted a piece linking to an except titled “Crowley on Religious Experience”.

The link was broken, but Alexander brought it up in the context of an earlier discussion where he also quoted Crowley; searching those quotes reveals that it must have been excerpts from Magick: Book 4.

Phil Goetz noted that an anti-aging conference named “SAGE” had become impossible to find in Google due to a LGBT aging conference also named SAGE.

Regular searches would fail, but a combination of tricks worked: SAGE anti-aging conference combined with restricting Google search to 2003–2005 time-range turned up a citation to its website as the fourth hit, http://www.sagecrossroads.net (which has ironically since died).

The Future of Humanity Institute (FHI) doesn’t clearly provide charity financial forms akin to the US Form 990s, making it hard to find out information about its budget or results.

FHI doesn’t show up in the CC, NPC, or GuideStar⁠, which are the first places to check for charity finances, so I went a little broader afield and tried a site search on the FHI website: budget site:fhi.ox.ac.uk. This immediately turned up FHI’s own documentation of its activities and budgets, such as the 2007 annual report; I used part of its title as a new Google search: future of humanity institute achievements report site:fhi.ox.ac.uk.

John Maxwell referred to a forgotten study on high correlation between Nobelist professors & Nobelist grad students (almost entirely a selection effect, I would bet). I was able to refind it in 7 minutes.

I wasted a few searches like factor predicting Nobel prize or Nobel prize graduate student in Google Scholar, until I search for Nobel laureate "graduate student"; the second hit was a citation, which is a little unusual for Google Scholar and meant it was important, and it had the critical word mutual in it—simultaneous partners in Nobel work is somewhat rare, but temporally separated teams don’t work for prizes, and I suspected that it was exactly what I was looking for. Googling the title, I soon found a PDF like “Eminent Scientists’ Demotivation in School: A symptom of an incurable disease?”, Viau2004 which confirmed it (and Viau2004 is interesting in its own right as a contribution to the Conscientious vs IQ question). I then followed it to a useful paragraph:

In a study conducted with 92 American winners of the Nobel Prize, Zuckerman (1977) discovered that 48 of them had worked as graduate students or assistants with professors who were themselves Nobel Prize award-winners. As pointed out by Zuckerman (1977), the fact that 11 Nobel prizewinners have had the great physicist Rutherford as a mentor is an example of just how significant a good mentor can be during one’s studies and training. It then appears that most eminent scientists did have people to stimulate them during their childhood and mentor(s) during their studies. But, what exactly is the nature of these people’s contribution.

  • Zuckerman, H. (1977). Scientific Elite: Nobel Laureates in the United States. New York: Free Press.

GS lists >900 citations of this book, so there may well be additional or followup studies covering the 40 years since. Or, also relevant is “Zuckerman, H. (1983). The scientific elite: Nobel laureates’ mutual influences. In R. S. Albert (Ed.), Genius and eminence (pp. 241–252). New York: Pergamon Press”, and “Zuckerman H. ‘Sociology of Nobel Prizes’, Scientific American 217 (5): 25& 1967.”

A link to a research article in a post by Morendil broke, he had not provided any formal citation data, and the original domain blocks all crawlers in its robots.txt so IA would not work. What to do?

The simplest solution was to search a direct quote, turning up a Scribd mirror; Scribd is a parasite website, where people upload copies from elsewhere, which ought to make one wonder where the original came from. (It often shows up before the original in any search engine, because it automatically runs OCR on submissions, making them more visible to search engines.) With a copy of the journal issue to work with, you can easily find the official HP archives and download the original PDF⁠.

If that hadn’t worked, searching for the URL without /pg_2/ in it yields the full citation, and then that can be looked up normally. Finally, somewhat more dangerous would be trying to find the article just by author surname & year.

A 2013 Medical Daily on the effects of reading fiction omitted any link or citation to the research in question. But it is easy to find.

The article says the authors are one Kaufman & Libby, and implies it was published in the last year. So: go to Google Scholar, punch in Kaufman Libby, limit to ‘Since 2012’; and the correct paper ( “Changing beliefs and behavior through experience-taking”) is the first hit with fulltext available on the right-hand side as the text link “[PDF] from tiltfactor.org” & many other domains.

Is soy milk bad for you as one study suggests? Has anyone replicated it? This is easy to look into a little if you use the power of reverse citation search!

Plug Brain aging and midlife tofu consumption into Google Scholar, one of the little links under the first hit points to “Cited by 176”; if you click on that, you can hit a checkbox for “Search within citing articles”; then you can search a query like experiment OR randomized OR blind which yields 121 results⁠. The first result shows no negative effect and a trend to a benefit, the second is inaccessible, the second & third are reviews whose abstract suggests it would argue for benefits, and the fourth discusses sleep & mood benefits to soy diets. At least from a quick skim, this claim is not replicating, and I am dubious about it.

Does NYC really have 114,000+ homeless school children? This case study demonstrates the critical skill of noticing the need to search at all, and the search itself is almost trivial.

Won’t someone think of the children? In March 2020, as New York coronavirus cases began their exponential increase centered in Manhattan (with a similar trend to Wuhan/​Iran/​Italy), NYC Mayor Bill de Blasio refused to take social distancing/​quarantine measures like ordering the NYC public school system closed, and this delay until 16 March contributed to the epidemic’s unchecked spread in NYC; one justification was that there were “114,085 homeless children” who received social services like free laundry through the schools. This number has been widely cited in the media by the NYT, WSJ, etc, and was vaguely sourced to “state data” reported by “Advocates for Children of New York”. This is a terrible reason to not deal with a pandemic that could kill tens of thousands of New Yorkers, as there are many ways to deliver services which do not require every child in NYC to attend school & spread infections—but first, is this number even true?

Basic numeracy: implausibly-large! Activists of any stripe are untrustworthy sources, and a number like 114k should make any numerate person uneasy even without any Fermi estimation or fact-checking; “114,085” is suspiciously precise for such a difficult-to-measure or define thing like homelessness, and it’s well-known that the population of NYC is ~8m or 8,000k—is it really the case that around 1 in every 70 people living in NYC is a homeless child age ~5–18 attending a public school? They presumably have at least 1 parent, and probably younger siblings, so that would bring it up to >228k or 1 in every <35 inhabitants of NYC being homeless in general. Depending on additional factors like transiency & turnover, the fraction could go much higher still. Does that make sense? No, not really. This quoted number is either surprising, or there is something missing.

Redefining “homeless”. Fortunately, the suspiciously-precise number and attribution make this a good place to start for a search. Searching for the number and the name of the activist group instantly turns up the source press release⁠, and the reasons for the bizarrely high number are revealed: the statistic actually redefines ‘homelessness’ to include living with relatives or friends, and counts any experience of any length in the previous year as rendering that student ‘homeless’ at the moment.

The data, which come from the New York State Education Department, show that in the 2018-2019 school year, New York City district and charter schools identified 114,085, or one in ten, students as homeless. More than 34,000 students were living in New York City’s shelters, and more than twice that number (73,750) were living ‘doubled-up’ in temporary housing situations with relatives, friends, or others…“This problem is immense. The number of New York City students who experienced homelessness last year—85% of whom are Black or Hispanic—could fill the Barclays Center six times,” said Kim Sweet, AFC’s Executive Director. “The City won’t be able to break the cycle of homelessness until we address the dismal educational outcomes for students who are homeless.”

The WSJ’s article (but not headline) confirms that ‘experienced’ does indeed mean ‘at any time in the year for any length of time’, rather than ‘at the moment’:

City district and charter schools had 114,085 students without their own homes at some point last year, topping 100,000 for the fourth year in a row, according to state data released in a report Monday from Advocates for Children of New York, a nonprofit seeking better services for the disadvantaged. Most children were black or Hispanic, and living “doubled up” with friends, relatives or others. But more than 34,000 slept in city shelters at some point, a number larger than the entire enrollment of many districts, such as Buffalo, Rochester or Yonkers.

Less than meet the eye. So the actual number of ‘homelessness’ (in the sense that everyone reading those media articles understands it) is less than a third the quote, 34k, and that 34k number is likely itself a loose estimate of how many students would be homeless at the time of a coronavirus closure. This number is far more plausible and intuitive, and while one might wonder about what the underlying NYS Education Department numbers would reveal if fact-checked further, that’s probably unnecessary for showing how ill-founded the anti-closure argument is, since even by the activists’ own description, the relevant number is far smaller than 114k.

“Evolution of the Human Brain: From Matter to Mind”, Hofman2015⁠, discusses the limits to the intelligence of increasingly large primate brains due to considerations like increasing latency and overheating. One citation attempting to extrapolate upper bounds is “Biological limits to information processing in the human brain”, Cochrane et al 1995.

The source information is merely a broken URL: http://www.cochrane.org.uk/opinion/archive/articles.phd which stands out for looking doubly-wrong: “.phd” is almost certainly a typo for “.php” (probably muscle memory on the part of Hofman from “PhD”), but it also gives a hint that the entire URL is wrong: why would an article or essay be named anything like archive/articles.php? That sounds like an index page listing all the available articles.

After trying and failing to find Cochrane’s paper in the usual places, I returned to the hint. The Internet Archive doesn’t have that page under either possible URL, but the directory strongly hints that all of the papers would exist at URLs like archive/brain.php or archive/information-processing.php, and we can look up all of the URLs the IA has under that directory—how many could there be? A lot⁠, but only one has the keyword “brain” in it, providing us the paper itself⁠.

If that hadn’t worked, there was at least one other version hiding in the IA. When I googled the quoted title “Biological limits to information processing in the human brain”, the hits all appeared to be useless citations repeating the original Hofman citation—but for a crucial difference, as they cite a different URL (note the shift to an ‘archive.cochrane.org’ subdomain rather than the subdirectory cochrane.org.uk/opinion/archive/, and change of extension from .html to .php):

  • hit 5:

    Biological Limits to Information Processing in the Human Brain. Retrieved from: http://archive.cochrane.org.uk/opinion/archive/articles/brain9a.php

  • hit 7:

    Biological Limits to Information Processing in the Human Brain. Available online at: http://archive.cochrane.org.uk/opinion/archive/articles/brain9a.php; Da Costa …

Aside from confirming that it was indeed a ‘.php’ extension, that URL gives you a second copy of the paper in the IA⁠. Unfortunately, the image links are broken in both versions, and the image subdirectories also seem to be empty in both IA versions, though there’s no weird JS image loading badness, so I’d guess that the image links were always broken, at least by 2004. There’s no indication it was ever published or mirrored anywhere else, so there’s not much you can do about it other than to contact Peter Cochrane (who is still alive and actively publishing although he leaves this particular article off his publication list).

A commenter who shall remain nameless wrote

I challenge you to find an example of someone saying “this den of X” where X does not have a negative connotation.

I found a positive connotation within 5s using my Google hotkey for "this den of ", and, curious about further ones, found additional uses of the phrase in regard to dealing with rattlesnakes in Google Books.

A failure case study: The_Duck looked for but failed to find other uses of a famous Wittgenstein anecdote. His mistake was being too specific:

Yes, clearly my Google-fu is lacking. I think I searched for phrases like “sun went around the Earth,” which fails because your quote has “sun went round the Earth.”

As discussed in the search tips, when you’re formulating a search, you want to balance how many hits you get, aiming for a sweet spot of a few hundred high-quality hits to review—the broader your formulation, the more likely the hits will include your target (if it exists) but the more hits you’ll return. In The_Duck’s case, he used an overly-specific search, which would turn up only 2 hits at most; this should have been a hint to loosen the search, such as by dropping quotes or dropping keywords.

In this case, my reasoning would go something like this, laid out explicitly: ‘“Wittgenstein” is almost guaranteed to be on the same page as any instance of this quote, since the quote is about Wittgenstein; LW, however, doesn’t discuss Wittgenstein much, so there won’t be many hits in the first place; to find this quote, I only need to narrow down those hits a little, and after “Wittgenstein”, the most fundamental core word to this quote is “Earth” or “sun”, so I’ll toss one of them in and… ah, there’s the quote!’

If I were searching the general Internet, my reasoning would go more like “‘Wittgenstein’ will be on, like, a million websites; I need to narrow that down a lot to hope to find it; so maybe ‘Wittgenstein’ and ‘Earth’ and ‘Sun’… nope, nothing on the first page, so toss in 'goes around' OR 'go around'—ah there it is!”

(Actually, for the general Internet, just Wittgenstein earth sun turns up a first page mostly about this anecdote, several of which include all the details one could need.)

Someone asked on IRC: “anybody here know that one artist with the really creepy art sytle [sic] that starts with a z?”

I googled: ‘that one artist with the really creepy art sytle [sic] that starts with a z’. It was hit #2, Zdzisław Beksiński⁠. (DuckDuckGo, incidentally, buries Beksiński several pages in, and I didn’t find him in Bing at all.)

Quanticle asked:

There’s a sci-fi book I’m thinking of, where the protagonist is a scout soldier fighting an endless war against an insectoid species. It reads like a cross between Ender’s Game and Starship Troopers (but is not written by John Sclazi or is The Forever War) and the main story takes place inside a frame story where two other people are actually “reading” this soldier’s memories from his salvaged battlesuit. There is a planet called “Golden”, where the soldier is allegedly from. Does anyone have any idea what I’m talking about?

The search book about a soldier from the planet golden immediately turned up John Steakley’s Armor⁠. (This was showing off a little—Armor is well-regarded and difficult to forget, and I’d read it a long time ago and already knew the answer, pace the hacker koan⁠.)

Quanticle noted that “You know, I searched for similar phrases, but I ended up fixating on the soldier’s key phrase, where he called his battle-trance”The Machine”, and that dragged in lots of irrelevancies.” (A good intuition for search engine use would shy away from using any word or phrase as incredibly generic as “the machine”.)

FeepingCreature asked, while designing a compiler for a custom language,

Hey, what was the official name for Lisp’s “data and code” thing?

I already knew that it is “homoiconicity”, but I bet that official name for Lisp's "data and code" thing would work if I tried it in Google. It did.

Grayson81:

One thing that’s rather shocking to those of us who used search engines (and even directories like Yahoo before they got the idea of becoming real search engines from Google) is just how good they’ve got at understanding a vague, poorly written or mistaken search.

…I remember trying to explain how Google works to my mother ten years ago and explaining why “who’s that actress? You know, the one with the eyes. Not Katy Perry” isn’t a question that a computer can answer. Now she can Google exactly that and all of the top results are telling her that she’s thinking of Zooey Deschanel!

Julia Galef tweeted:

I read a webcomic ~15 years ago that I’ve been unable to find since, even with my best google-fu. It involved a robot living a bleak life as a working stiff. At the end he cracked open his “skull” and there was a small dying creature inside. The art style was less cartoony, and more like Moebius, I think? And maybe it was wordless? And, sorry, it wasn’t a “webcomic” in the sense of a long-running thing. It was a self-contained story, maybe 15 pages long?

Ultimately rediscovering that

The comic was called “Headcase” and it was by Sam Chivers.

Unfortunately, no mirrors of it appeared online or on Chivers’s current website, and discussions of it mentioned that it was interesting for being an Adobe Flash webcomic. Worse still, nothing useful appeared in the Internet Archive for the original website—somehow the IA appeared to have missed any relevant .swf files, and ‘head’/​‘case’ turned up no relevant looking filenames. It might have been buried in the opaquely-named images, and my usual next step would be to download the IA archives and inspect every image, but in other hits, I found that an obscure comics publisher had published an anthology involving Chivers, and closer inspection confirmed that “Headcase” was in fact published in their (long out of print) 2004 anthology Prophecies: Volume 1. (Not a prophetic name inasmuch as there was no volume 2.)

In one of the usual ironies of linkrot, Chivers presumably taking down “Headcase” for print publication in Prophecy may have preserved it, as while I am unable to find any digital copies, the paper version is easily obtained as a used book & scanned at modest cost.

A physics article mentioned they had been unable to get an old 1973 interview in a popular magazine; as is usually the case for non-scholarly magazines, after looking thoroughly, I could find no trace of it anywhere (not even in libraries or used-magazine sellers) other than an expensive DVD collection of back issues 1970–2010 still being sold by the publisher. Reasoning that if they had digitized the archives and were even selling it as a DVD collection, they ought to provide subscribers access to them as well, I signed up—they didn’t! So I resorted to the DVD, as, worst-case, I should be able to get it running under WINE if nothing else, and can screenshot the interview.

The DVDs turned out to store all the PDFs as encrypted PDFs and the metadata in an ancient opaque database format I’d never heard of. Despite WINE AppDB’s claims, the viewing software only partially worked, and I set about attacking the PDFs directly. They used actual encryption, so pdftk couldn’t strip the passwording. Given the viewing software, I hypothesized that there was either a single master password or per-PDF passwords stored in the database.

In the hopes of it being a single short master password, I installed John the Ripper (JtR) jumbo edition and extracted the hash of a random file to attack: /snap/john-the-ripper/current/run/pdf2john.pl *.pdf > ~/hash. (Note: pdf2john is not in the default JtR, and it depends on JtR internal files so you can’t easily just copy it out of the Github repo & run it, as I discovered the hard way. You need to install the jumbo edition.) The password hashes of all the PDFs indeed turned out to be the same, so it used a master password. A simple attack with default password-space could be executed as john-the-ripper ~/hash. While I waited for all of the DVDs to copy, I saw that JtR was getting something like only a hundred thousand hashes/​s on my 16 Threadripper CPU cores, and did not have any success up to 5-character passwords.

If the password wasn’t really short, CPU wouldn’t be enough. I decided to switch to Hashcat to put my 2×1080ti Nvidia GPUs to good use, as they ought to run hundreds of times faster than JtR. (To convert the JtR hash format to Hashcat hash format, you delete the colon-separated filename field at the beginning of each line.) Hashcat uses a powerful but confusing DSL of specifying the exact password-space, and I made a reasonable guess that if the original programmer was so lazy as to use a single master password, he would also use a simple alphanumeric password (uppercase + lowercase + decimal numbers), and nothing harder to type or read. To specify the PDF hash type and an attack starting at 1-character alphanumeric & increasing, I wound up with the incantation hashcat -m 10500 ~/hash.cat -w 3 --force -a 3 --increment -1 '?l?u?d' ?1?1?1?1?1?1?1?1?1?1?1.

Hashcat worked much better and within an hour had bruteforced on the order of 170 billion hashes and up somewhere around 8 characters. This did not succeed either. At this point, another programmer thought it’d be fun to participate and, while reverse-engineering the executable to see how it decrypted PDFs, suggested that the master password was probably hardcoded as a string literal inside the viewer executable. One could just dump all the strings inside it with the CLI utility strings *.exe > strings.txt, and then use it as a Hashcat password list. To my chagrin, when I finally got around to trying cat strings.txt | hashcat -m 10500 ~/hash.cat -w 3, it finished within 1s.

The password turned out to be B775tO11dQvu74. I was right that it was alphanumerical, but at a length of 14 characters, I doubt I would have brute-forced it. (He successfully reverse-engineered it and discovered the viewer had been used for several other magazine archives as well, apparently, and simply switched master passwords to decrypt each one; the other passwords left in the executable were PbS19LuXd2pTXw, 1386r8wRrH01, & mfU33QQNlAFGI1.)

I then decrypted the PDF (for PDF in *.pdf; do pdftk "$PDF" input_pw "B775tO11dQvu74" output foo.pdf && mv foo.pdf "$PDF"; done), extracted & uploaded the interview, and archived the collection elsewhere.

In a vituperative review in Nature in 1977-03-17, the Harvard-professor R. C. Lewontin excoriated Richard Dawkins’s classic The Selfish Gene and sociobiology in general, giving as an example

For more than 40 years evolutionary theory has remained free of a naive selectionism, but in recent times there has been a return to the extreme form of the adaptationist program, as evolutionists have rediscovered behaviour. Beginning with the undoubted truth that behaviour must, like morphology and physiology, be subject to the force of natural selection⁠, the new Panglossians end with the old error that all describable behaviour must be the direct product of natural selection. The scientific manifestation of this trend can be seen in every issue of say, The American Naturalist, which is permeated by the language, if not the formal apparatus, of game theory⁠, and in the development of the school of ‘sociobiology’, among whose more extraordinary productions is a recent highly praised dissertation explaining fellatio and cunnilingus among the upper middle classes as an adaptive response to constant resources. The popular manifestation of this new caricature of Darwinism reaches its most extreme form in The Selfish Gene by Richard Dawkins.

As is common in book reviews, Lewontin provides no citations, and 2 biologists were curious but unable to figure out what Lewontin was referring to despite searching.

The thesis in question is easy to find in under a minute, because the context gives so many hints: Lewontin refers to it as notorious & widely discussed so it will have many substantive citations (if only to attack it); it is ‘recent’ (and sociobiology was a heated controversy so it is unlikely to be ‘recent’ in the sense of ‘a quiet field of research still mulling over a provocative paper from 2 decades’ before, but more like ‘within the past 2 or 3 years’ & certainly at least 1970–1977), it is a ‘dissertation’ and so single-authored & almost certainly a PhD thesis by someone who became at least a postgrad researcher (because a master’s thesis would be too low-status to be discussed or praised, or singled out for abuse in Nature—it would be unclassy for a chaired Harvard professor to attack such a junior grad student’s work there like that), and it likely uses the words “fellatio” and “cunnilingus” as technical terms & decorous Latinate scientific censoring.

If we plug into Google Scholar a date-range of 1970–1977 and the simplest possible query fellatio cunninglingus "evolutionary psychology" OR sociobiology or fellatio cunninglingus sociobiology or fellatio cunninglingus "evolutionary psychology", we see in GS 2 hits (for the former) or among the hits (latter), the immediately-relevant looking “Human sociobiology: Pair-bonding and resource predictability (effects of social class and race)”, Weinrich1977 and “Human Reproductive Strategy: I. Environmental Predictability And Reproductive Strategy; Effects Of Social Class And…”, Weinrich1976⁠, both by the same author ( 148+16 citations, quite healthy); the single-authorship & search.proquest.com domain for the latter immediately tells us that it’s a PhD thesis; clicking verifies that the thesis was at Harvard (which gives it both prestige Lewontin loathes & ensures he could easily hear of it); the similarity of titles suggests that the paper is a condensed version of the thesis (reading the paper suggests this isn’t entirely true but is more of an update); Weinrich did indeed go on to a long career (at San Diego University⁠, publishing up until 2014); there are attacks like Lande1987 showing it did not pass without notice; and even the ProQuest preview of the abstract looks consistent with Lewontin’s summary (there can’t be that many such theses!).

So, we can be sure that Lewontin is referring to Weinrich1976.

Nuclear physicist Edward Teller wrote a rhyming ‘atom alphabet’ about the nuclear era, but only a few of the letters like A/​B/​S are ever quoted. Did he write a whole alphabet?

Tracing citations back to a Time magazine & Laura Fermi’s memoir strongly implies that he did not, and only write A/​B/​F/​H/​S. (There has been at least one effort to write the rest⁠.)

In a discussion of learning⁠, Andy Matuschak referenced a paper on an illusion-of-depth in reading comprehension (related to illusions of learning from using cramming rather than spaced repetition), but mentioned he had been unable to find a copy anywhere to verify it. The citation for this paper was:

Pressley, M., Ghatala, E. S., Pirie, J., & Woloshyn, V. E. (1990). “Being really, really certain you know the main idea doesn’t mean you do”⁠. National Reading Conference Yearbook, 39⁠, 249–256.

I rose to the challenge.

Standard checks. Matuschak is indeed correct that this paper does not show up in any of the usual places, nor does ‘yearbook’ #9 seem to show up; this nut will not be cracked instantly. We do not see any encouraging hints if we google the citation (only a sporadic handful of later citations to it, which are sporadic enough to suggest that they too are citing papers they have not read & that this will be hard to find). University & ProQuest databases turn up nothing for either.

This begins to look anomalous, so I broadened the search in Google. Here I stumbled across several of the yearbooks hosted at what looks like the National Reading Conference’s website; a targeted site: search, alas, fails to turn up anything useful. They may have scanned some later yearbooks, but apparently not the 1989 one…? Unfortunately, a dead end.

Barkless dogs. So we turn to book sources, like used-book search engines. We can find many of these yearbooks used at reasonable prices, but not #39—not a trace of it! This is odd. Being the 39th yearbook, with the others often available, would imply that it is available too: such serial publications don’t usually vary that much from year to year—if the ones before & after it are easy to get, it should be too. What one notices is that the titles don’t look anything like “National Reading Conference Yearbook, 39”: this citation must be wrong, that’s not how they were titled! With this in mind, we can search for a used copy to buy & scan, but this would be premature to do: now we have explained the prior absence of hits, and need to redo our searches; we thought there was no scan online before, but we know that was misleading so it may exist after all.

Alternate titles. Knowing this, we can search more broadly in Google, and skimming search results, look what we find! The PDF snippet reveals that our quarry, “Proceedings of the Annual Meeting of The National Reading Conference (39th…)” has been hidden behind the long uncited title “Literacy Theory and Research: Analyses from Multiple Paradigms”. Well, no wonder you can’t find it normally, and also (disappointingly but unsurprisingly), no wonder everyone copies the same incorrect citation.

Screenshot of key Google search hit, revealing the Academia.edu PDF copy of ERIC scan of National Reading Conference Yearbook #39.

Alternate paths. Downloading it, Pressley et al 1989 turns out to be buried on pg256 of this PDF; now that we know what to look for, this book turns out to have been easily findable after all—we can readily find the original non-Academia.edu PDF on our old friend ERIC and can find other yearbooks easily. We can also doublecheck other strategies: for example, if we had known the full names of the authors rather than the abbreviated ones in the citation, and we had googled something like “Michael Pressley, Elizabeth Ghatala, Jennifer Pirie, Vera E. Woloshyn”, that would have matched the indexed fulltext PDFs immediately. (Since you can often find the full names of authors even if the citation abbreviates them, this is a good tactic to know.)

Thus, in this instance, it’s crucial to remember that citations can be inaccurate and one must try variations. Over-fixating on the book title can hamper efforts to locate the article, which was, in reality, merely a click away.



from Hacker News https://ift.tt/FkBIl9R

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.