Douwe Osinga's Blog

Wednesday, July 19, 2023

Automate Research with a Neptyne Spreadsheet and OpenAI

 While ChatGPT definitely has grabbed the headlines when it comes to the AI revolution, using LLMs to automate all kinds of tasks has yielded some interesting results too. A number of frameworks have sprung up like LangChain, AutoGPT and BabyAGI. They allow users to go beyond the simple chat interface and connect all kinds of components to GPT3.5/4, like web search, memory stores or even code writing. Very powerful tools, but not the easiest things to get started with.




In this post we’ll show how you can achieve similar results in a Neptyne spreadsheet without having to worry about deployment. We’re going to create a spreadsheet that autonomously does research. It’s fully customizable and should work for any type of task, but in our current example we’ll focus on researching AI startup funding news. It looks like this::



You specify a (news) search in C2 and the freshness in G2. Column headers E4 to H4 are also configurable and determine the information we want to extract from the articles we find.


After you hit the “go” button, the automatic research will start by sending the search query to Bing’s news search. For each result it will call out to an external service to render the page, running javascript and everything. It will then extract the interesting information from that news article and feed it into ChatGPT, asking that service to find information for each of the specified columns (in this case Company, Amount, CEO and Investors). It will then add a row to the spreadsheet of the information it found.



Get your own research bot


Ready to create your own research AI bot? The bot combines three different services so you will need to sign up for those and you’ll need of course a Neptyne account.


  • Bing News Search. We use this to get a list of news articles based on the query you entered into C2. If you have a Microsoft Azure account, setting this up is fairly straightforward. Comes with a free tier for 1000 searches per month. 

  • PhantomJsCloud. This service takes a url and renders it as html in the cloud. Just getting the html of a document is not enough anymore today. This step is actually the slowest in our pipeline - rendering a modern web page can take time. Sign up is free and the free tier gets you 500 page loads per day.

  • OpenAI. The current code uses ChatGPT 3.5 - you can switch to 4 if you feel like it is missing things, but it’ll be slower and more expensive. 


Sign up for all services and note the api key for each. Now navigate to: https://app.neptyne.com/-/tgjqzmjbfi and make a copy. When you hit the Go button, the system will ask you for the keys and once you’ve entered those, it will start running. You can interrupt the current run by hitting the button again, but it will take a little while since it will finish the current task.

How does this work?

The main code is called from the button Go and lives in the run() method. Here’s the slightly simplified code:


for item in news_search(C2, freshness=G2.value)['value']:

    article = fetch_article(item['url'])

    title, summary = extract_content(article)

    keywords = get_keywords(summary, E4:H4)



news_search returns a list of news articles. fetch_article calls PhantomJsCloud to get a rendered version of the article. extract_content uses the python-readability library to strip all non content from an html text. Finally get_keywords uses OpenAI to extract the keywords from the article. Most of these are pretty straightforward but let’s have a closer look at get_keywords:


def get_keywords(body, keywords):

    prompt = "Given this article:\n\n"

    prompt += body + "\n\n"

    prompt += "Generate a dictionary of key/value pairs in json with keys:"

    keywords = ['"' + kw + '"' for kw in keywords]

    prompt += ", ".join(keywords)

    prompt += "\nLeave out what cannot be found"

    return call_open_ai([{"role": "user", "content": prompt}])


All it does is build a prompt asking for a json document with the key/values for the columns we specify in the spreadsheet. The AI magic does the rest.


There’s a bit of data massaging going on of course, but once we have the data from the article and the keywords, we just insert a new line into the spreadsheet with:


B5:H.insert_row(

     0, [item['datePublished'], item['description'], item['url'], *keywords]

)


Conclusion

The integration of Neptyne spreadsheets with OpenAI opens a vast world of possibilities, from autonomously conducting research to data extraction and analysis. This article provided a detailed walkthrough of how to use the Neptyne spreadsheet in conjunction with various services like Bing News Search, PhantomJsCloud, and OpenAI to create an AI research bot. The bot seamlessly retrieves and parses news articles based on user-specific queries, applying AI to identify key data points from each source. By combining these different services and leveraging their unique capabilities, users can achieve a more automated, efficient, and effective research process.


Thursday, November 10, 2022

Neptyne: Making Spreadsheets Programmable

So yeah, it's been kinda quiet around here - famous last words on any blog, but I thought I should at least post something about what I've been up to here. Programmable Spreadsheets. Neptyne!

We tried this whole stealth mode thing. I'm not sure, maybe just developing out in the open is better. But here we are, we're ready to talk about what we are up to. 

After a bit over a 1000 pull requests, we’re ready to start showing a few things. So here it is, Neptyne: the programmable spreadsheet.

Neptyne Screenshot

So this is what it looks like. You have a spreadsheet on the left, a code pane on the upper right and down right a REPL. Write code in the code pane, use it directly from the spreadsheet and type commands into the REPL to try stuff. The code is all Python, but we've extended Python to be compatible with Excel.

So for example, you can just loop through a cell range with:

for row in A1:C10:
    for cell in row:
       if cell > 5:
           cell.set_background_color(255, 10, 10)

To make all cells with a value greater than 5 show up red.
Python screenhsot

Here's another example. We have a simple import function that fetches all the countries from the wikipedia that start with S, dump that into the spreadsheet by just assigning A1 to the results (it will spill over top create a table) and then we have a neat little function that annotates each row with the emoji flag. Finally we have a little widget that shows the countries on a map and shows the flags on mouse over. And all in less than 20 lines of code.

There's a lot more of neat stuff going on. Join the waitlist: https://neptyne.com/waitlist-add

Friday, April 1, 2022

The Clever Bit and the Gold Rush

Gold bars (source: Wikimedia)

 

Saturday, March 19, 2022

What’s the clever bit?

I realized that the blog linked from my website is still pointing to the this old blogger thing while if I write something these days it is on medium. So I'm going to do some cross posting here to keep links from my website to medium until I come up with something better.

My last day at Sidewalk Labs, the company where I worked on making cities better for the last four years, was December 31st 2021. I’m going to do something new. It’s cool and exciting and I’ll write about it some other time.


Today I want to talk about the clever bit. If you ask me for advice about your startup, I’m going to ask you about the clever bit. What is your unique insight that nobody else thought of so far? It’s not enough to identify a problem to solve, you want to be able to solve it better than other people. Just having a clever bit is not enough of course — you do need to solve a problem for actual people, otherwise you just have a solution looking for a problem.

continue reading

Wednesday, May 10, 2017

Movie Recommendations

Today's project is a small Python notebook to recommend movies. I know, I know, there's a million of those out there, but this one is special, since it is not trained on user ratings, but on the outgoing links of the Wikipedia articles of the movies.


Why is that good? Two reasons. One is using diverse data. When you build a recommender system just on user ratings, you do get an Amazon-like system of people that liked this movie, also liked that movie. But if you're not using information like the year of the movie, the genre or the director, you are throwing away a lot of relevant features that are easy to get.

The second reason is that when you start a new project, you probably don't have enough user ratings to be able to recommend stuff from the get go. On the other hand, for many knowledge areas it is easy to extract the relevant wikipedia pages.

The outgoing links of a wikipedia page make for a good signature. Similar pages will often link to the same page. Estimating the similarity between two pages by calculating the jaccard distance would probably already work quite well. I went a little further and trained an embedding layer over the outgoing links.

The result is not Netflix quality, but it works reasonably well. As an extra bonus, I projected the resulting movies onto a 2 dimensional plane, rendering their movie posters as placeholders. It's fun to explore movies that way. Go play with it.

Wednesday, February 8, 2017

Amazon Dash and Philips Hue

When the Amazon Dash came out a few years ago, I thought it was an April Fool's joke. A button that you install somewhere in your house to order one specific product from Amazon? That's crazy!


It didn't take long for people to figure out how to use the button for something other than ordering products from Amazon. No higher hacking is required and at USD$ 4.99 a pop, quite affordable. When not in use, the button isn't connected to the wifi. So when you click it, the first thing it does is set up that connection. A script monitoring the local network can easily detect this event and then do something arbitrary - like order beer.

There are many hacks around, but none of them do exactly what I want:
  • When the last person leaves the house, all lights should switch off
  • When the first person comes back, all lights should switch back on
So I wrote a script that doesn't just monitor the Dash button, but also the presence of the phones of me and my wife on the local network. The basic rules are:
  • If any lights are on, switch them off when:
    • - the button was pressed
    • - no phones were seen on the network for 20 minutes
  • If any lights have been previously switched off, switch them on when:
    • - the button was pressed
    • - a phone is seen after 20 minutes of no phones
This way, the button can always be used to switch on or off the lights, but if you don't switch off the lights when leaving home, they will go off automatically. Unlike with a motion controlled set up, this won't happen if you are home but not moving (though it will happen if your phone runs out of battery). When you come home and you had previously switched off the lights using the button, they will come on automatically.

To get this working, check out the code, install the requirements and run: 

python auto_lights --hue_bridge=<bridge-ip> --phone_mac=<phone-macs> --dash_mac=<dash-mac>

While running, the program will also print any new mac addresses it detects and for extra convenience it also prints the manufacturer. You can use this to find out the mac address of your phone and of the dash button - switch your phone to airplane mode, wait for things to quiet down and when you switch airplane mode off, you should see your phone's mac address.

It works reasonably well. The longest I've seen my phone not contacting the wifi was 13 minutes, so 20 minutes seems safe. Coming home, it takes a little longer than ideal for the phone to reconnect to the wifi, but you can use the Dash button if you are in a hurry.

As always the code is on Github.

Thursday, January 26, 2017

Building a Reverse Image Search Engine using a Pretrained Image Classifier

In the Olden Days, say more than 10 years ago, building Image Search was really quite hard (see xkcd). In fact, when Alta Visa first came out with a search engine for images, they couldn't really do it. What they did was return the image that had text around it that most matched your query. It's a wonder that that worked at all, but it had to do for years and years.
Why we need Reverse Image Search: find more cat pictures (from: wikipedia)
How things have changed. These days Neural Networks have no problem detecting the actual content of pictures, in some categories outperforming their human masters. An interesting development here is reverse image search - supply a search engine with an image and it will tell you where else this or similar images occur on the web. Most articles on the web describing approaches on how to do this focus on things like Perceptual Hashing. While I am sure that is a good way, it struck me that there is a much simpler way.

Embeddings! Algorithms like Word2Vec train a neural network for a classification task, but they don't use the learned classification directly. Instead they use the layer just before the classification as a representation of the word. Similarly, we can use a pre-trained image classifier and run it on a collection of images, but rather than using the final layer to label the result, we get the layer before that and use that as a vector representation of the image. Similar images will have a similar vector representation. So finding similar images becomes just a nearest neighbor search.

As with a lot of things like this, getting the data to run the algorithm on is more work than getting the algorithm to run. Where do we get a set of representative images from? The images from the Wikipedia are a good start, but we might not want all of them. Most articles are about specific instances of things - for a reverse image search demo, classes of things are more interesting. We're interested in cats, not specific cats.

Luckily, Wikidata annotates its records with a 'is instance of' property. If you have imported a Wikidata snapshot into Postgres, then getting the wikipedia_ids for all values for the instance-of property is a simple SQL statement:
select properties->>'instance of' as thing, count(*) as c 
from wikidata group by thing

For some of these, Wikidata also provides us with a canonical image. For others we have to fetch the wikipedia page and parse the wikicode. We're just going to get the first image that appears on the page, nothing fancy. After an hour of crawling, we end up with a set of roughly 7 thousand images.

SkLearn provides us with an k-nearest-neighbor algorithm implementation and we're off to the races. We can spin-up a flask based server that accepts an image as a POST request, feeds that image into our pre-trained classifier. From that we'll get the vector representing that image. We then feed that vector into the nearest neighbor model and out fall the most similar images. You can see a working demo here

It mostly works well. If you feed it a cat, it will return pictures of cats, the definition of success on the Internet. On mobile, you can directly upload a picture from your phone's camera and that seems to go ok, too. The biggest limitation I've come across so far is that the algorithm is bad at estimating how good its guesses are. So if there aren't any suitable pictures in the training set, it will return the one that it thinks is the closest match, but to the human eye it seems fairly unrelated.

As always, you can find the code on Github.