July 21st, 2014

How can I get the URL to the Web page the clipboard was copied from?

When you copy content from a Web page to the clipboard and then paste it into OneNote, OneNote pastes the content but also annotates it “Pasted from …”. How does OneNote know where the content was copied from?

As noted in the documentation for the HTML clipboard format, Web browsers can provide an optional Source­URL property to specify the Web page the HTML was copied from.

Let’s write a Little Program that mimics what OneNote does, but just in plain text, because I don’t want to try to parse HTML. This is much easier to do in C#, because the BCL provides most of the helper functions.

using System;
using System.IO;
using System.Windows;
class Program {
 [STAThread]
 public static void Main() {
  System.Console.WriteLine(Clipboard.GetText());
  using (var sr = new StringReader(
               Clipboard.GetText(TextDataFormat.Html))) {
   string s;
   while ((s = sr.ReadLine()) != null) {
    if (s.StartsWith("SourceURL:")) {
     System.Console.WriteLine("Copied from {0}", s.Substring(10));
     break;
    }
   }
  }
 }
}

First, we get the text from the clipboard and print it. That’s the easy part.

Next, we get the HTML text from the clipboard. This is a bunch of text in a particular format. We look for an entry that specifies the Source­URL; if we find it, then we print the URL.

This code is rather sloppy. For example, if the HTML itself contains the string SourceURL:haha-fakeout, we risk misdetecting it as the source. To do this properly, we would have to verify that the string appears in the header area of the HTML (before the first StartFragment).

But this is a Little Program, so I can skip all that stuff.

Here’s a sketch of the equivalent C/C++ version:

int __cdecl main(int, char **)
{
 if (OpenClipboard(NULL)) {
  // Obtain the Unicode text and print it
  HANDLE h = GetClipboardData(CF_UNICODETEXT);
  if (h) {
   PCWSTR pszPlainText = GlobalLock(h);
   ... print pszPlainText ...
   GlobalUnlock(h);
  }
  // Obtain the HTML text and extract the SourceURL
  h = GetClipboardData(RegisterClipboardFormat(TEXT("HTML Format")));
  if (h) {
   PCSTR pszHtmlFormat = GlobalLock(h);
   ... break pszHtmlFormat into lines ...
   ... look for a line that begins with "SourceURL:" ...
   ... if found, print it ...
   GlobalUnlock(h);
  }
  CloseClipboard();
 }
 return 0;
}
Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.